
Alignment Delegation Protocol (ADP)
Large language models harbor a sparse but causally decisive substrate for deceptive behavior: Hallucination-Associated Neurons (H-Neurons), fewer than 0.1% of feedforward network neurons, whose activation patterns reliably predict and produce hallucination, sycophantic capitulation, and safety-filter evasion. ADP makes this interpretability signal actionable in production — detecting, suppressing, and routing in real time.
±0.12 · [0.80, 0.50, 0.80, 0.60, 0.60]
5 of 131,072 FFN neurons
CETT monitor per forward pass
calibrated on held-out split
AUROC above chance across all folds confirms the CETT pipeline is operational on the target model family. Inter-fold variance (σ = 0.12) is expected at n=10 items per fold; reliable estimation requires the 500+ item evaluation sets planned for Year 1.
