Open questions

Unsettled questions surfaced across the corpus, grouped by note. Each section links back to its source note. This page is generated by scripts/aggregate_open_questions.py — do not edit by hand.

Constitutional AI Training (RLAIF)

The orchestrator’s open-questions framing follows the Limitations and open problems section above, but synthesises across the populated sections to identify the questions that have load on the next generation of alignment research rather than only on CAI itself. The questions below are ranked by the magnitude of the bet they represent for whether the constitutional-AI approach scales.

The four questions that decide whether CAI survives at frontier scale

Does AI feedback remain reliable once the policy exceeds the labeller’s competence on the dimension being judged? This is the central scalable-oversight question; CAI works because the AI labeller is comparable to or stronger than the policy on harmlessness judgement. Past that crossover, sycophancy [10] and substrate-shared blind spots become first-order risks rather than minor failure modes. Debate, recursive reward modelling, and process supervision are the candidate next-generation oversight protocols; CAI is one degenerate case of the broader family.

Is training-time alignment alone ever sufficient at production scale? Anthropic’s own decision to ship Constitutional Classifiers [13] as a separate runtime defence layer is empirical evidence that the CAI-trained policy is necessary but insufficient against universal jailbreaks at the current Claude capability tier. The deeper question — whether any training-time-only alignment method clears the production safety bar at higher Responsible Scaling Policy capability thresholds, or whether the future production stack is always going to be training + runtime defences + capability-gated deployment — is open.

Whose constitution survives at frontier scale? Collective Constitutional AI [8] showed a publicly-sourced constitution produces a model with comparable performance and measurably less political-topic skew. But ~1,000 participants is not a legitimacy proof at the scale Claude operates. The follow-on questions — which public, weighted how, deliberating with what information access, adjudicated by what mechanism — are governance questions, not technical ones, and they are where AISI-class evaluation work intersects most directly with the CAI literature.

Does the constitution live in interpretable features inside the trained policy? Behavioural CAI works — the policy refuses on adversarial inputs more reliably than RLHF baselines. Whether the policy has internalised the principles, or merely learned the surface behaviour they entail, is the inner-alignment question. As of 2026 the interpretability literature is just starting to probe this: sparse autoencoders and dictionary learning can ask “what features fire when the policy is being constitutional,” but no work I have found has decisively distinguished a CAI-trained policy that believes the constitution from one that performs against it.

Secondary open questions surfaced by this research:

The critique-quality threshold. Zhang [zhang-constitution-collapse-2025] and Chacón Menke & Tan [chacon-small-cai-2025] document that CAI does not work uniformly across model scales — small models (7-9B) produce self-critiques too low-quality to drive useful self-improvement, and mode collapse in the RL stage becomes the dominant failure. Where the critique-quality threshold lies, and what determines it (parameter count, instruction-tuning depth, base reasoning capability), is empirically open.
DPO vs. PPO at frontier scale. Xu et al. [14] push back on the open-source DPO consensus with a careful study showing PPO wins at frontier scale with the right hyperparameters. Whether this generalises across post-training pipelines and whether the gap closes with DPO refinements (KTO, IPO, ORPO) remains unsettled.
Iterative self-rewarding stability. Yuan et al. [22] showed three rounds of self-rewarding DPO pass Claude 2 on AlpacaEval 2.0; CREAM and Temporal-SRLM follow-ups document degradation past round three without regularisation. Whether there is a stable iterative self-improvement protocol that survives many rounds is the question that decides whether the AI labeller can be made strictly stronger than the policy via bootstrapping rather than relying on a separate, larger, externally-trained labeller.
Cross-organisation convergence of constitution-trained models. As more frontier labs adopt CAI-flavoured pipelines (Tulu 3, Llama 3 Instruct, Anthropic’s Claude, and whatever the next-generation Gemini training stack looks like), an empirical question becomes tractable: do models trained against different constitutions converge in behaviour on shared eval sets, or do they preserve constitution-mirrored differences? The answer determines whether constitutions are policy-tools or merely PR-tools.
CAI under fine-tuning attack. Constitutional Classifiers can be bypassed by adversarial fine-tuning when the attacker has fine-tuning access (Trojan-Speak class of attacks). The robustness of constitution-trained policies under post-training adversarial fine-tuning is a related and largely unstudied question.

Daniel ML Evans

Explorer

Open questions

Constitutional AI Training (RLAIF)

Graph View

Backlinks