Aligned Institute
Alignment Delegation Protocol — Aligned Sovereign Intelligence Institute

Alignment Delegation Protocol (ADP)

Schmidt Sciences 2026  ·  Interpretability RFP

Large language models harbor a sparse but causally decisive substrate for deceptive behavior: Hallucination-Associated Neurons (H-Neurons), fewer than 0.1% of feedforward network neurons, whose activation patterns reliably predict and produce hallucination, sycophantic capitulation, and safety-filter evasion. ADP makes this interpretability signal actionable in production — detecting, suppressing, and routing in real time.

Pilot Results — Mistral-7B-Instruct-v0.3  (4-bit NF4, TriviaQA 50-item held-out)
0.66
AUROC (5-fold CV)
±0.12  ·  [0.80, 0.50, 0.80, 0.60, 0.60]
0.0038%
H-Neuron fraction
5 of 131,072 FFN neurons
<0.01%
Compute overhead
CETT monitor per forward pass
0.45
Optimal threshold
calibrated on held-out split

AUROC above chance across all folds confirms the CETT pipeline is operational on the target model family. Inter-fold variance (σ = 0.12) is expected at n=10 items per fold; reliable estimation requires the 500+ item evaluation sets planned for Year 1.

Reference: Gao et al. (2025) — H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs  ·  arXiv:2512.01797
ADP Architecture Diagram