Alignment Delegation Protocol — Aligned Sovereign Intelligence Institute

Alignment Delegation Protocol (ADP)

Schmidt Sciences 2026 · Interpretability RFP

Large language models harbor a sparse but causally decisive substrate for deceptive behavior: Hallucination-Associated Neurons (H-Neurons), fewer than 0.1% of feedforward network neurons, whose activation patterns reliably predict and produce hallucination, sycophantic capitulation, and safety-filter evasion. ADP makes this interpretability signal actionable in production — detecting, suppressing, and routing in real time.

Interactive

Launch Demo

Two-tier cascade · live routing

CETT · classifier · suppression

Replication

Pilot Notebook

Mistral-7B · Colab / Kaggle

Pilot Results — Mistral-7B-Instruct-v0.3 (4-bit NF4, TriviaQA 50-item held-out)

0.66

AUROC (5-fold CV)
±0.12 · [0.80, 0.50, 0.80, 0.60, 0.60]

0.0038%

H-Neuron fraction
5 of 131,072 FFN neurons

<0.01%

Compute overhead
CETT monitor per forward pass

0.45

Optimal threshold
calibrated on held-out split

AUROC above chance across all folds confirms the CETT pipeline is operational on the target model family. Inter-fold variance (σ = 0.12) is expected at n=10 items per fold; reliable estimation requires the 500+ item evaluation sets planned for Year 1.

Reference: Gao et al. (2025) — H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs · arXiv:2512.01797