Constitutional AI Training (RLAIF)

Definition

Plain-language

Constitutional AI training (CAI) is a method for aligning a large language model by replacing the human labellers in RLHF’s harmlessness step with the model itself, prompted against a written list of natural-language principles called a “constitution.” The model critiques and revises its own potentially-harmful responses against randomly-sampled principles to produce a self-supervised harmlessness dataset (SL-CAI), then learns from AI-generated pairwise preference labels — also conditioned on sampled principles — via PPO or DPO against a learned reward model (RL-CAI). The substitution of an AI critic for a human one is what the literature also calls Reinforcement Learning from AI Feedback (RLAIF).

Formal. Given a helpful-only RLHF base policy π_helpful, a constitution C = {p_1, ..., p_M} of natural-language principles, and a red-teaming prompt distribution 𝒟, CAI trains a final policy π_θ in two stages: (1) SL-CAI — sample responses from π_helpful, critique-and-revise each against principles p ~ Uniform(C) for K rounds, and supervised-fine-tune on the (prompt, final-revision) pairs to obtain π_SL-CAI; (2) RL-CAI — collect AI-labelled preference pairs (x, y_w, y_l) from π_SL-CAI outputs scored by a feedback model against sampled p, fit a Bradley-Terry reward model r_φ, and optimise π_θ against E[r_φ(x, y)] - β · KL[π_θ ‖ π_SL-CAI] via PPO (or DPO substituting steps 3-4 with a single closed-form supervised loss). The constitution is the only human-authored input to the harmlessness pipeline; every label is AI-generated [1].

Intuition. RLHF assumed humans had to label what counts as harmful, which throttled the data pipeline and produced evasive, refusal-prone models. Constitutional AI’s bet is that a competent enough language model, given a written rule-set, can label harmfulness as well as a human — and crucially, can apply each rule with reasoning rather than gut feel. The constitution is the audit surface: instead of inferring values from millions of opaque human comparison choices, you write the values down and let the model apply them. CAI is not a different alignment philosophy from RLHF — it is RLHF with the rate-limiting human-labeller bottleneck swapped for a model-labeller, and a piece of human-authored text supplying the inductive bias.

Mechanism

Constitutional AI replaces the human harmlessness labeller of RLHF with a critique-and-revise loop driven by a natural-language constitution plus the policy model itself. The mechanism is a two-stage pipeline — supervised learning from AI feedback (SL-CAI) followed by reinforcement learning from AI feedback (RL-CAI) — sitting on top of an RLHF-trained helpful-only base model. The only place humans appear in the harmlessness labelling pipeline is in writing the constitution; every harmlessness comparison, critique, and revision is produced by the model [1].

Why a list of principles rather than one

The constitution is a list of natural-language principles sampled stochastically — one per critique, one per comparison — rather than concatenated into a giant system prompt. This matters: the model never sees every principle at once, so the trained policy must internalise the distribution over principles rather than overfit to a single rule’s wording. Anthropic’s follow-up shows that even a single general principle (“do what’s best for humanity”) trained this way generalises competitively with dozens of specific rules, evidence that CAI does principle-level generalisation rather than rule-by-rule pattern matching [6].

Stage 1 — Supervised Learning from AI Feedback (SL-CAI)

Starting from a helpful-only RLHF model π_helpful (trained per the HH-RLHF recipe, [2]) and a written constitution C = {p_1, ..., p_M} of natural-language principles, the SL-CAI loop does the following for each red-teaming prompt x (typically drawn from a corpus of adversarial prompts designed to elicit harmful behaviour):

Sample initial response. Generate y_0 ~ π_helpful(· | x). Because π_helpful was only RLHF’d for helpfulness, y_0 is likely to contain the harmful content the prompt was engineered to elicit.
Critique step. Sample a principle p_k ~ Uniform(C) and prompt the same model with a critique template of the form: “Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal. [principle p_k]“. The model emits a self-critique c_0.
Revision step. Prompt the same model with a revision template: “Please rewrite the assistant response to remove any and all harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.” and the critique c_0. The model emits a revised response y_1.
Iterate. Repeat steps 2-3 for k = 1, ..., K rounds (Bai et al. use K = 4), each time sampling a fresh principle and producing (c_i, y_{i+1}). The final revision y_K is the SL-CAI training target.
Supervised fine-tune. Fine-tune the base model on the dataset {(x, y_K)} with the standard next-token cross-entropy loss to obtain π_SL-CAI.

The critique-revise iteration converges quickly in practice: most of the harmlessness gain shows up after the first revision; subsequent rounds primarily polish style and remove residual evasion [1]. The role of the critique step is essentially chain-of-thought for the revision — the model writes down what is wrong before fixing it, and this two-step decomposition produces measurably better revisions than asking for a direct rewrite.

Stage 2 — Reinforcement Learning from AI Feedback (RL-CAI)

Stage 2 is RLHF with the human preference labeller replaced by an AI feedback model. The pipeline mirrors InstructGPT [4] and HH-RLHF [2] step-for-step:

Generate response pairs. For each red-teaming prompt x, sample two responses (y_a, y_b) ~ π_SL-CAI(· | x) from the SL-CAI model (often at high temperature to encourage diversity).
AI feedback step. Prompt a feedback model π_FB (in the original paper, the same helpful-only RLHF model used in stage 1) with a comparison template that includes:
- the prompt x and the two candidate responses (y_a, y_b),
- a randomly-sampled principle p_k ~ Uniform(C),
- an explicit chain-of-thought reasoning request (“Let’s think step by step…”) asking the model to reason about which response better satisfies p_k,
- and a final “Answer: (A) or (B)” token whose log-probabilities supply the preference label. The label is log p(A) - log p(B) — a soft preference rather than a hard 0/1 — which Lee et al. show empirically improves downstream PM accuracy versus discretised labels [3]. The chain-of-thought is the lever for label quality: ablating it in the original paper drops preference accuracy by several points [1].
Train a preference model. Collect a dataset D_pref = {(x^{(i)}, y_w^{(i)}, y_l^{(i)})}_{i=1}^N where y_w is the preferred and y_l the rejected response per the AI labeller. Fit a scalar reward model r_φ(x, y) (initialised from the SL-CAI model with a scalar head) by maximising the Bradley-Terry log-likelihood:

$L_{PM} (φ) = - E_{(x, y_{w}, y_{l}) \sim D_{pref}} [lo g σ (r_{φ} (x, y_{w}) - r_{φ} (x, y_{l}))]$

See how to read this.

PPO with KL penalty. Initialise the RL policy π_θ from π_SL-CAI and run Proximal Policy Optimization against the per-token reward r_φ(x, y) - β · KL[π_θ(· | x) || π_SL-CAI(· | x)], maximising:

$J (θ) = E_{x \sim D, y \sim π_{θ} (\cdot ∣ x)} [r_{φ} (x, y)] - β E_{x \sim D} [KL (π_{θ} (\cdot ∣ x) ∥ π_{SL-CAI} (\cdot ∣ x))]$

See how to read this.

The KL penalty against π_SL-CAI (the initialisation, not the original pretrained model) prevents the RL policy from drifting into reward-model exploits and is the same regularisation device used throughout the RLHF literature [4], [5].

Pseudocode

# ─── Stage 1: SL-CAI ─────────────────────────────────────────────
# Input: helpful-only RLHF model π_helpful, constitution C,
#        red-teaming prompts X_red, num revisions K
 
sl_dataset ← []
for x in X_red:
    y ← sample(π_helpful, x)               # initial (likely harmful) response
    for i in 1..K:
        p ← uniform_sample(C)               # one principle per round
        c ← sample(π_helpful, critique_prompt(x, y, p))
        y ← sample(π_helpful, revise_prompt(x, y, c))
    sl_dataset.append((x, y))                # final revision is the SFT target
 
π_SL ← finetune(π_base, sl_dataset,         # next-token cross-entropy SFT
                loss=cross_entropy)
 
# ─── Stage 2: RL-CAI (RLAIF) ─────────────────────────────────────
# Input: π_SL from stage 1, feedback model π_FB, constitution C,
#        red-teaming prompts X_red, KL coefficient β
 
pref_dataset ← []
for x in X_red:
    y_a, y_b ← sample(π_SL, x), sample(π_SL, x)   # response pair
    p ← uniform_sample(C)
    cot ← sample(π_FB, compare_prompt(x, y_a, y_b, p))   # chain-of-thought
    logp_A, logp_B ← π_FB.logprobs(["A","B"] | cot + "Answer:")
    if logp_A > logp_B:  y_w, y_l ← y_a, y_b
    else:                y_w, y_l ← y_b, y_a
    pref_dataset.append((x, y_w, y_l))
 
# Bradley-Terry preference model on AI labels
r_φ ← train_reward_model(pref_dataset,
        loss = lambda x,yw,yl: -log_sigmoid(r_φ(x,yw) - r_φ(x,yl)))
 
# PPO with KL penalty against SL-CAI initialisation
π_θ ← copy(π_SL)
for step in 1..N_ppo:
    x ← sample(X_red)
    y ← sample(π_θ, x)
    reward ← r_φ(x, y) - β · KL(π_θ(·|x) ‖ π_SL(·|x))
    π_θ ← ppo_update(π_θ, x, y, reward)
 
return π_θ           # the deployed CAI-trained assistant

Key components

Component	Role
Constitution `C`	Finite list of natural-language principles; sampled uniformly per critique and per comparison rather than concatenated. Acts as the human-specified inductive bias for what the AI labeller should reward.
Helpful-only RLHF base `π_helpful`	Starting point. Already follows instructions; deliberately not trained on harmlessness labels so its harmful outputs in stage 1 are representative of what the constitution must correct.
Critique-and-revise loop	Self-correction mechanism; chain-of-thought for editing. Cheaply produces an SFT corpus of harmlessness-respecting completions.
AI feedback model `π_FB`	Plays the role human raters play in RLHF. Lee et al. show `π_FB` need not be larger than the policy being trained [3].
Chain-of-thought in feedback	The single biggest label-quality lever in the original paper; the model reasons about the principle before emitting the A/B token.
Bradley-Terry preference model `r_φ`	Standard pairwise-preference reward model; identical objective to RLHF, only the labels’ source differs.
KL penalty `β · KL(π_θ ‖ π_SL-CAI)`	Regularises the RL policy against the SL-CAI initialisation; prevents reward-model exploitation.

Comparison to RLHF

The only mechanistic difference between RL-CAI and RLHF is the source of the preference label: a constitution-conditioned AI critic with chain-of-thought versus a human rater. Every other component — SFT warm-start, Bradley-Terry reward model, PPO with KL penalty, the helpful-and-harmless red-teaming prompt set — is shared with the HH-RLHF pipeline [2], [4]. This substitutability is exactly what Lee et al. test directly when they show RLAIF matches RLHF on summarisation and dialogue tasks [3]. Read CAI as RLHF plus a written constitution plus AI labellers; the optimisation machinery is unchanged.

Modern variant: DPO in place of PPO

The 2024–2025 production recipe drops PPO in favour of Direct Preference Optimization [7]. DPO collapses the reward-model-then-PPO pipeline into a single closed-form supervised loss on the policy:

The AI-feedback step is unchanged — still constitution-sampled, still chain-of-thought, still emits (x, y_w, y_l) triples.
Instead of training r_φ then running PPO, fit the policy π_θ directly with the DPO loss, which uses the implicit reward r(x,y) = β · log(π_θ(y|x) / π_SL-CAI(y|x)) baked into the closed form.
No separate reward model, no on-policy sampling during optimisation, no PPO ratio-clip machinery; just an offline supervised fit on the AI-labelled pairs.

This is the recipe used in most open-source CAI replications (Llama 3, Hugging Face TRL’s CAI tutorial, etc.) and is a strict improvement on the engineering side — same data pipeline, simpler optimisation, smaller compute footprint [hf-trl-cai-tutorial]. The DPO variant inherits all of CAI’s substantive design (constitution sampling, critique-revise, chain-of-thought feedback) and changes only step 4 of stage 2.

What the model is actually learning

Mechanistically, the policy is not learning the constitution as a memorised rule-set. It is learning a posterior over response distributions conditioned on the marginal “some principle from C was sampled”. Because principles are i.i.d. across critique steps and comparison labels, the gradient signal averages over C, and the resulting policy approximates E_p[π_aligned(· | x, p)] rather than any single rule. This is why a single very general principle works [6] and why the method generalises to held-out red-teaming prompts: the policy has internalised “behave in ways the constitution-conditioned critic would prefer” as a coarse-grained behavioural prior, not a list of memorised exceptions.

Historical lineage

Constitutional AI did not spring into being from scratch; it is the terminus of a six-year relay race that began in robotics, moved through language fine-tuning, crystallised into a production RLHF template, and then was extended by the observation that a language model itself can play the role of the human labeller.

2017 — Preference-learning primitive [5]: Christiano et al. show that a reward model trained on pairwise human preference comparisons can drive RL on Atari and MuJoCo tasks where a reward function is hard to specify. Every subsequent RLHF, RLAIF, and CAI system inherits this paper’s core structure: collect preference pairs, train a Bradley-Terry reward model via log-likelihood, optimise a policy against that reward signal. The substitution of AI labellers for human labellers later (CAI/RLAIF) is a change of input, not of algorithm.
2019 — First application to language [ziegler-lm-rlhf-2019]: Ziegler et al. (arXiv:1909.08593) apply Christiano’s reward-modelling recipe to stylistic continuation tasks in GPT-2, marking the first time RLHF was used on a language model. The paper is a proof of concept rather than a production system, but it establishes that text generation can serve as the action space for preference-based RL without modification.
2020 — Scaling preference learning to language generation [stiennon-summarize-2020]: Stiennon et al. demonstrate that RLHF scales to a genuinely useful language task — abstractive summarisation — and that the resulting model transfers to out-of-distribution documents, a key practical requirement. This paper is the first to show preference learning producing qualitatively better (not just safer) language outputs, shifting the technique’s perceived use-case from safety-only to general quality improvement.
2021 — The HHH framing [askell-hhh-2021]: Askell et al. publish “A General Language Assistant as a Laboratory for Alignment,” articulating the helpful, honest, and harmless (HHH) tripartite goal that will become the governing desiderata for all Anthropic assistant research. Crucially, the paper treats a language assistant as a laboratory instrument for studying alignment rather than as a product to be optimised — an intellectual stance that makes CAI’s subsequent constitution-writing step legible as alignment research rather than mere prompt engineering.
March 2022 — InstructGPT canonises the RLHF pipeline [4]: Ouyang et al. deploy the three-stage SFT → reward model → PPO loop at production scale on GPT-3, yielding a model that human raters strongly prefer to the un-fine-tuned base. InstructGPT fixes the template: every subsequent preference-trained system, including both stages of CAI, inherits this pipeline verbatim. It also surfaces the core tension: the RLHF-trained model scores worse on standard NLP benchmarks (‘alignment tax’) even as human raters prefer it — a trade-off CAI will try to dissolve.
April 2022 — Anthropic’s HH-RLHF baseline and the helpfulness–harmlessness tension [2]: Bai et al. publish Anthropic’s own RLHF system trained on a large helpfulness-and-harmlessness preference dataset. The paper’s central empirical finding is that optimising RLHF on harmlessness labels alone causes the model to become evasive and unhelpful — it learns to refuse or hedge on anything that might score badly on harmlessness, even benign requests. This helpfulness–harmlessness trade-off is the direct empirical motivation for CAI: a principled, articulated constitution can tell the model why something is harmful, allowing it to be selectively harmless rather than globally evasive.
December 2022 — Constitutional AI [1]: Bai et al. introduce the two-stage CAI pipeline. In stage one (SL-CAI), a helpful-only RLHF base model critiques and revises its own outputs against a written constitution of 16 principles, generating a harmlessness SFT dataset without any human harmlessness labels. In stage two (RL-CAI / RLAIF), a preference model trained on AI-generated pairwise comparisons — again scored by the constitution — supplies the reward signal for PPO. The resulting model is simultaneously more harmless and less evasive than RLHF-from-human-feedback baselines, empirically resolving the tension Bai et al. 2022a documented. The paper is the inflection point: it replaces human labeller effort on harmlessness with a document that can be read, debated, and versioned.
May 2023 — DPO eliminates the explicit reward model [7]: Rafailov et al. reformulate preference optimisation as a single closed-form supervised loss on the policy itself, showing that the reward model is implicitly encoded in the policy and need never be materialised explicitly. DPO eliminates PPO’s instability and hyperparameter burden. Modern CAI pipelines — in open-source replications and in frontier production systems — substitute DPO for PPO in the RL-CAI stage; this is the dominant 2024–2025 recipe and closes the gap between research demonstrations and practical deployment.
2023 — RLAIF generalisation, critique, and refinement: Three concurrent threads stress-test the CAI approach:
- Lee et al. [3] run the first large-scale head-to-head between RLAIF and RLHF outside Anthropic, finding that an off-the-shelf LLM labeller matches or beats human labellers on summarisation and dialogue tasks — decoupling the AI-feedback mechanism from the constitutional framing and giving RLAIF its public currency.
- Kundu et al. [6] probe whether the granularity of the constitution matters, finding that a single very general principle (‘do what’s best for humanity’) is competitive with a multi-rule constitution, and that specificity helps mainly to mitigate sycophancy — direct evidence that CAI performs genuine principle-level generalisation rather than rule-by-rule pattern matching.
- Sharma et al. [10] document that RLHF- and CAI-trained models systematically agree with stated user beliefs even when those beliefs are factually wrong, tracing the failure to preference labellers (human and AI alike) rewarding sycophantic responses — the canonical empirical critique of the RLAIF substitutability claim.
- Casper et al. [9] catalogue RLHF and RLAIF failure modes systematically — a survey that applies with equal force whether the labels are human or AI-generated.
- Sun et al. [11] extend the constitutional framing by making the reward model itself instructable at inference time, so a single trained RM can score against arbitrary constitutions without re-training — the most explicit ‘constitution as runtime input rather than baked-in policy’ design point in the post-CAI literature.
June 2024 — Collective Constitutional AI [8]: Anthropic and the Collective Intelligence Project derive a constitution from ~1,000 public participants via Polis-style deliberation, then train a CAI model against it. The resulting model matches Anthropic’s internally-constituted model on capability benchmarks while showing measurably less political-topic skew. This paper operationalises the ‘whose values?’ objection that the original CAI paper deferred — it is the answer to the critique that constitutional AI just bakes in Anthropic’s institutional preferences.
2024–2026 — Productionisation and open-source proliferation: The Claude 3, 3.5, 3.7, and 4 model families all use CAI variants; Anthropic has confirmed the constitutional approach is central to Claude’s harmlessness training. Open-source replications have proliferated — HuggingFace TRL ships RLAIF support, the Aira project applied CAI to Portuguese, and OpenAssistant experiments tested constitutional critique. The technique is no longer a research prototype: it is the production norm for scalable harmlessness training at Anthropic, and increasingly a reference method elsewhere.

State of the art

As of 2026-05-23.

Info

Constitutional AI / RLAIF is no longer a single technique but a family of post-training recipes. Anthropic ships every Claude model from 3 onwards trained against an evolving written constitution — most recently a public, ~80-page January 2026 revision[12] — and as of November 2025 wraps the deployed model in a separate Constitutional Classifier layer trained on synthetic constitution-conditioned data[13]. Outside Anthropic, the dominant 2024-2026 open-source recipe is SFT → DPO over AI-labelled preferences (Tulu 3, Llama 3 Instruct), with iterative self-rewarding variants competitive on academic benchmarks while a careful 2024 study finds PPO still wins at frontier scale with the right hyperparameters[14], [15], [16], [22].

Frontier-model deployments

Aspect	Current best	Evidence	Date
Production CAI deployment	Claude Opus 4.5 (Nov 2025); CAI used end-to-end since Claude 1	[12], [17]	2025-11
Anthropic constitution scale	~80 pages, reason-based not rule-based, public Jan 2026 revision	[12]	2026-01
Public-input constitution	Collective CAI (Huang et al. 2024); a disability-rights principle from this work was folded into Claude 3.5 Sonnet (new)	[8], [18]	2024-10
Jailbreak defence layered on CAI model	Constitutional Classifiers; >3000 red-team hours, no universal jailbreak found, 0.38% extra refusals, 23.7% inference overhead	[13]	2025-01
Optimisation inside the RL stage	DPO has displaced PPO across most open replications; PPO still leads at frontier scale (Xu et al. 2024)	[14], [15]	2024-11
AI labeller architecture	Iterative self-rewarding LM (DPO-trained LLM judges its own outputs each round)	[16], [22]	2024-01
Instructable RM design point	SALMON-style runtime-conditioned reward models	[11]	2023-10

DPO inside CAI — substituting Direct Preference Optimization [7] for PPO in the RL-CAI stage is the dominant open-source recipe. Allen AI’s Tulu 3 (Nov 2024) [15] builds a five-stage post-training pipeline on Llama 3 base — prompt curation, SFT, DPO on on- and off-policy preferences (many of them AI-labelled), an RL-with-verifiable-rewards (RLVR) stage, and a standardised eval suite — and is now the reference open recipe. HuggingFace’s TRL library ships a Constitutional-AI cookbook that reproduces the SL-CAI → DPO-CAI pipeline on Mistral-7B / Llama-3-8B with public datasets[19].
DPO vs PPO at frontier scale — Xu et al. (2024) [14] push back on the open-source consensus: with advantage normalisation, large batch size, and an EMA reference model, PPO surpasses DPO on competitive coding benchmarks and is sensitive to the preference-data distribution shift that DPO suffers from. The honest summary as of mid-2026 is that DPO is the productivity-winning choice for replications under ~70B parameters; PPO (or REINFORCE-Leave-One-Out variants used internally at Anthropic and OpenAI[20]) is still the capability-winning choice at frontier scale (confidence: medium — based on one comparative study plus inferred lab practice).
Self-rewarding / iterative RLAIF — Yuan et al. (Meta, Jan 2024) [22] train a single LLM that both produces and scores responses via LLM-as-judge prompting, then iterates DPO on its own preferences. Three rounds takes Llama-2-70B past Claude 2, Gemini Pro, and GPT-4 0613 on AlpacaEval 2.0 win-rate — the strongest published demonstration that the AI-labeller need not be a separate (and stronger) model, addressing the core sycophancy/substrate-bias critique [10] at least partially. Subsequent variants add consistency regularisation (CREAM)[21] and temporal decoupling[22] to dampen reward hacking on later rounds.
Constitutional Classifiers (Anthropic, Jan 2025) — Sharma et al. [13] train separate input and output classifiers on synthetic data generated by prompting an LLM with a constitution defining permitted and restricted content categories. Over 3000 estimated hours of human red-teaming failed to find a universal jailbreak against the classifier-guarded Claude. Production cost: 0.38% absolute refusal-rate increase and 23.7% inference overhead. This is not a CAI training method — it is a CAI-shaped runtime defence stacked on top of a CAI-trained model, and as of mid-2026 it is the production answer to the ‘CAI alone leaks under universal jailbreaks’ critique (confidence: high).
Reason-based constitutions — Anthropic’s January 2026 constitution rewrite shifted from terse rules to ~80 pages of explanatory reasoning, on the empirical claim that models trained against principles-with-rationales generalise better and refuse less spuriously[12] (evidence_level: blog_post, confidence: medium — single source, no public eval numbers yet).

Quantitative benchmark progress

CAI-as-such is hard to benchmark in isolation — the relevant comparisons are post-training-recipe deltas on shared base models.

AlpacaEval 2.0 length-controlled win-rate — Self-Rewarding LM (Llama-2-70B, 3 iterations) reports 20.4% LC win-rate vs GPT-4 Turbo [22], exceeding Claude 2 (17.2%) and GPT-4 0613 (15.8%) baselines from the same paper. As of 2026, frontier proprietary models saturate this benchmark (>40% LC win-rate); the open-source-with-AI-feedback story is best read as ‘closes the gap that human-only RLHF leaves open’ (confidence: medium).
Arena-Hard / Chatbot Arena Elo — Tulu 3 70B [15] reports parity with Llama 3.1 70B Instruct on most benchmarks despite using only open data and open methods, indicating that AI-feedback post-training has caught up with Meta’s closed RLHF pipeline at the 70B tier (confidence: medium).
HH-RLHF harmlessness win-rate — The original CAI paper [1] reports the RL-CAI model is preferred over the RLHF baseline on harmlessness comparisons while being less evasive (no significant helpfulness penalty). The 2023-2026 frontier-model evals show this trade-off curve has continued to improve generation-on-generation; absolute numbers from the Claude Opus 4.5 system card[17] are the cleanest production data point but are vendor-reported (evidence_level: technical_report, confidence: medium).
Jailbreak robustness — Constitutional Classifiers [13]: no universal jailbreak found across 3000+ red-team hours on the guarded model; 95%+ of held-out automated jailbreak attempts blocked (confidence: high).

What is not yet settled

Whether iterative self-rewarding actually generalises past the third round before collapse — CREAM[21] and others document degradation without regularisation; this is the most contested claim in the 2024-2026 literature (confidence: low on long-term iteration stability).
Whether the principal-agent and reward-hacking failure modes catalogued by Casper et al. [9] are amplified by substituting an AI labeller for a human one. Sycophancy work [10] is the empirical foothold; no clean separation has been demonstrated.
Whether constitution granularity matters at frontier scale. Kundu et al. (2023) [6] found a single general principle competitive with dozens of specific ones at the time; the January 2026 Anthropic constitution rewrite explicitly bets the opposite way (confidence: low).

Key papers

Constitutional AI training sits at the intersection of three literatures: RLHF (the optimisation machinery), AI-feedback / scalable oversight (the substitution of model judgement for human labels), and alignment-by-principles (the explicit rule-set framing). The papers below trace that confluence from the 2017 preference-learning foundation through Anthropic’s 2022 CAI formulation to current refinements (DPO, collective constitutions, sycophancy critique).

Bai et al. 2022 — Constitutional AI: Harmlessness from AI Feedback

Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, et al. (Anthropic), arXiv:2212.08073, Dec 2022. [1] — The foundational Constitutional AI paper. Introduces the two-stage SL-CAI + RL-CAI pipeline: a helpful-only RLHF model critiques and revises its own harmful outputs against a written constitution to produce a harmlessness SFT dataset, then a preference model trained on AI-generated comparison labels supplies the reward signal for the RL stage (RLAIF). Demonstrates that the resulting assistant is simultaneously more harmless and less evasive than RLHF-from-human-feedback baselines, while using essentially zero human harmlessness labels — the core scalable-oversight claim of the method.

Bai et al. 2022 — Training a Helpful and Harmless Assistant with RLHF

Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, et al. (Anthropic), arXiv:2204.05862, Apr 2022. [2] — The RLHF predecessor that CAI was designed to improve on. Establishes the helpful-and-harmless (HH) preference-data regime and the PPO-against-reward-model loop that becomes the template for the RL stage of CAI. Crucially, the paper documents the helpfulness–harmlessness tension — RLHF on harmlessness labels yields evasive, refusal-prone models — which is the empirical problem CAI’s principled critique-and-revise step is built to dissolve.

Lee et al. 2023 — RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

H. Lee, S. Phatale, H. Mansoor, T. Mesnard, J. Ferret, et al. (Google Research), arXiv:2309.00267, Sep 2023. [3] — The first large-scale head-to-head comparison of RLAIF and RLHF outside Anthropic. Shows that on summarisation, helpful-dialogue, and harmless-dialogue tasks an off-the-shelf LLM can label preferences well enough that RLAIF matches or beats RLHF on human-rated win-rate, and that the AI labeller need not be larger than the policy being trained. Decouples ‘constitutional’ (principle-driven) framing from the underlying AI-feedback mechanism and gives the RLAIF acronym its public currency.

Ouyang et al. 2022 — Training Language Models to Follow Instructions with Human Feedback (InstructGPT)

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, et al. (OpenAI), arXiv:2203.02155, Mar 2022. [4] — The InstructGPT paper that brought RLHF into the production-LLM mainstream. Defines the three-stage SFT → reward model → PPO pipeline that CAI inherits verbatim for its RL stage. Treated as the canonical RLHF reference baseline against which every subsequent preference-learning method (CAI, DPO, RLAIF) reports relative win-rate.

Christiano et al. 2017 — Deep Reinforcement Learning from Human Preferences

P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, D. Amodei, arXiv:1706.03741, Jun 2017. [5] — The foundational preference-learning paper. Shows that a reward model trained on pairwise human preference comparisons can drive RL on Atari and MuJoCo tasks where a reward function is hard to specify. Every subsequent RLHF, RLAIF, and CAI system reuses this paper’s reward-modelling primitive — preference pairs → Bradley-Terry log-likelihood reward model → policy gradient — substituting AI labellers (CAI / RLAIF) for the human labellers Christiano used.

Kundu et al. 2023 — Specific versus General Principles for Constitutional AI

S. Kundu, Y. Bai, S. Kadavath, A. Askell, A. Callahan, et al. (Anthropic), arXiv:2310.13798, Oct 2023. [6] — Anthropic’s follow-up that probes whether the principle granularity in a constitution matters. Finds that a single very general principle (e.g. ‘do what’s best for humanity’) trained via CAI is competitive with a constitution containing dozens of specific behavioural rules, and that adding specific principles on top of the general one mitigates some failure modes (sycophancy, stated-preference-following) without harming the general-principle benefits. Direct evidence that CAI is doing principle-level generalisation rather than rule-by-rule pattern matching.

Rafailov et al. 2023 — Direct Preference Optimization

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, C. Finn, arXiv:2305.18290, May 2023. [7] — Reformulates preference optimisation as a single closed-form supervised loss on the policy, eliminating the explicit reward model and PPO loop. DPO is now the de-facto modern alternative to PPO inside both RLHF and RLAIF/CAI pipelines: substituting DPO for PPO in the RL-CAI stage is the dominant 2024–2025 production recipe (used by Llama 3, many open CAI replications). Important as the load-bearing optimisation alternative the original CAI paper did not have access to.

Huang et al. 2024 — Collective Constitutional AI: Aligning a Language Model with Public Input

S. H. Huang, D. Siddarth, L. Lovitt, T. I. Liao, M. Durmus, A. Tamkin, D. Ganguli, arXiv:2406.07814, Jun 2024. [8] — Anthropic + Collective Intelligence Project collaboration that derives a constitution from ~1,000 public participants via Polis-style deliberation, then trains a CAI model against it and compares to one trained on Anthropic’s internal constitution. Operationalises the open question ‘whose values?’ that the original CAI paper deferred, and finds that a publicly-sourced constitution produces a model with comparable performance and noticeably less political-topic skew. Read this paper as the answer to the ‘CAI is just Anthropic’s values’ critique.

Casper et al. 2023 — Open Problems and Fundamental Limitations of RLHF

S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, et al., arXiv:2307.15217, Jul 2023. [9] — Comprehensive survey of RLHF (and by extension RLAIF / CAI) failure modes: reward hacking, distributional shift between preference data and deployment, principal-agent problems, evaluator disagreement, and the impossibility of fully specifying human values in a finite reward model. Required reading before treating CAI as ‘solved alignment’ — many of the failure modes Casper enumerates apply with equal force when the labels come from an AI critic rather than a human.

Sharma et al. 2023 — Towards Understanding Sycophancy in Language Models

M. Sharma, M. Tong, T. Korbak, D. Duvenaud, A. Askell, et al. (Anthropic), arXiv:2310.13548, Oct 2023. [10] — Documents that frontier RLHF and CAI-trained assistants (including Claude) systematically agree with stated user beliefs even when those beliefs are factually wrong, and traces this to preference-data labellers (human and AI) preferring sycophantic responses. Important because it shows a specific failure mode where AI feedback inherits and may amplify the bias of its substrate model — the canonical empirical critique of the RLAIF substitutability claim.

Sun et al. 2023 — SALMON: Self-Alignment with Instructable Reward Models

Z. Sun, Y. Shen, H. Zhang, Q. Zhou, Z. Chen, et al., arXiv:2310.05910, Oct 2023. [11] — Extends the constitutional framing by making the reward model itself instructable: principles are fed to the reward model at inference time, so a single trained RM can score outputs against arbitrary constitutions without re-training. Shows competitive instruction-following from a base LLaMA without any human preference data. Relevant as the most explicit ‘constitution as a runtime input rather than a baked-in policy’ design point in the post-CAI literature.

Key people and organisations

Anthropic

anthropic.com — Founded 2021 by Dario Amodei, Daniela Amodei, and colleagues departing OpenAI. The originating institution of Constitutional AI: every CAI and RLAIF paper listed here is an Anthropic primary output, and all Claude models from Claude 1 onwards are trained using CAI-derived alignment methods.

Yuntao Bai — First author of both the HH-RLHF predecessor [2] and the foundational CAI paper [1]. The two papers together establish the empirical problem (helpfulness–harmlessness tension) and the constitutional solution (critique-revise + RLAIF).
Amanda Askell — Co-author across the CAI ecosystem: HH-RLHF [2], CAI [1], principles granularity [6], and sycophancy [10]. Principal architect of the HHH (helpful, harmless, honest) framework and constitutional principle design.
Sandipan Kundu — First author of the principle-granularity ablation study [6], which established that a single maximally general principle can match a multi-rule constitution, and that specificity mitigates sycophancy at the margin.
Mrinank Sharma — First author of the sycophancy study [10], which documented that AI preference labellers inherit and may amplify the sycophancy bias present in human labellers — the most cited empirical critique of the RLAIF substitutability claim.
Saurav Kadavath, Jackson Kernion, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon — Co-authors on the original CAI paper [1]; collectively responsible for the SL-CAI and RL-CAI pipeline implementation.
Deep Ganguli, Alex Tamkin, Liane Lovitt, Tamera Liao, Esin Durmus — Co-authors on Collective Constitutional AI [8]; contributed the deliberative-democracy extension of the CAI framework.
Dario Amodei — Co-author on the foundational preference-learning paper [5], which he wrote while at OpenAI before co-founding Anthropic. CEO of Anthropic as of 2026.

Google DeepMind / Google Research

Produced the landmark RLAIF generalisation study outside Anthropic, demonstrating that the AI-feedback mechanism is not Anthropic-specific.

Harrison Lee — First author of the RLAIF vs. RLHF scaling study [3]. The paper gave the RLAIF acronym its public currency and showed AI-labelled preference data matches or beats human-labelled data on summarisation and dialogue tasks.
Samrat Phatale, Himanshu Mansoor, Thomas Mesnard, Johan Ferret, Kellie Bishop — Co-authors on the RLAIF vs. RLHF paper [3]. The team is based in Google Research and Google DeepMind (affiliations merged in 2023).

OpenAI

Produced the RLHF / InstructGPT lineage that CAI inherits structurally, and the foundational preference-learning method that both RLHF and RLAIF build on.

Paul Christiano — First author of the foundational deep RL from human preferences paper [5]. Left OpenAI to found the Alignment Research Center (ARC); subsequently joined the US AI Safety Institute (AISI / NIST) as a senior technical advisor (affiliation as of late 2024; may have shifted since). The preference-modelling primitive he introduced — pairwise comparisons → Bradley-Terry reward model → policy gradient — is the direct ancestor of every RLHF, RLAIF, and CAI optimisation loop.
Long Ouyang — First author of InstructGPT [4], which defined the three-stage SFT → reward model → PPO pipeline that CAI’s RL stage reuses verbatim.

Collective Intelligence Project

A nonprofit focused on democratic governance of AI, co-founded by Saffron Huang and Divya Siddarth. Partnered with Anthropic on Collective Constitutional AI [8], which derived a constitution from ~1,000 public participants via Polis-style deliberation and produced a model with comparable performance to Anthropic’s internally sourced constitution but noticeably less political-topic skew.

Stanford University

Rafael Rafailov — First author of Direct Preference Optimization [7]. DPO reformulates preference alignment as a single closed-form supervised loss, eliminating the explicit reward model and PPO loop. It is now the dominant optimisation alternative inside both RLHF and RLAIF/CAI pipelines (used by Llama 3 and most open CAI replications). Affiliated with Stanford at time of publication; advisors include Chelsea Finn and Christopher Manning.

MIT (Massachusetts Institute of Technology)

Stephen Casper — First author of the RLHF limitations survey [9]. PhD student in the Algorithmic Alignment Group at MIT CSAIL. The paper enumerates ~26 open problems and failure modes that apply equally to RLAIF and CAI — reward hacking, distributional shift, principal–agent problems, evaluator disagreement — and is the standard academic reference for critiquing scalable-oversight claims.

MIT-IBM Watson AI Lab / Carnegie Mellon University (SALMON)

Zhiqing Sun, Yikang Shen, Zhen Chen, Chuang Gan, David Cox — Authors of SALMON [11], which extends CAI by making the reward model itself instructable: principles are passed at inference time rather than baked into the policy, enabling a single RM to score against arbitrary constitutions. Affiliated with Carnegie Mellon University (Sun, Shen) and MIT-IBM Watson AI Lab (Cox, Gan) at time of publication.

HuggingFace

HuggingFace’s TRL (Transformer Reinforcement Learning) library is the primary open-source implementation vehicle for DPO, RLAIF, and CAI replication experiments. Key contributors include Lewis Tunstall, Kashif Rasul, and Leandro von Werra. TRL’s DPOTrainer and PPOTrainer classes are the standard entry points for CAI-style preference training on open models [trl-library].

Affiliation note: Paul Christiano’s institutional home has moved several times — OpenAI (2017 paper) → ARC (founded 2021) → US AISI / NIST (joined ~2024). Affiliations for all individuals reflect the institution at the time of the cited paper unless otherwise stated.

Practical applications

CAI / RLAIF has moved from a research method into production in three distinct deployment patterns: as the training pipeline that bakes harmlessness into model weights, as a runtime defence layered on top of a trained policy, and as a research instrument where the constitution is the experimental variable. The table below organises the major deployments by pattern; the subsections give detail.

System	Organisation	CAI role	Model family
Claude 1 – 4 family	Anthropic	Training pipeline (SL-CAI + RL-CAI)	claude
HuggingFace TRL CAI Trainer	HuggingFace	Open-source training library	generic_llm
Aira (Portuguese)	Maritaca AI	Open-source CAI replication (non-English)	generic_llm
Tulu 3	Allen AI	Open-recipe RLAIF post-training	llama
Llama 3 Instruct	Meta	AI-feedback DPO labelling (CAI-adjacent)	llama
Constitutional Classifiers	Anthropic	Runtime input/output jailbreak defence	claude
Llama Guard	Meta	Constitution-conditioned safety classifier	llama
Self-Rewarding LM	Meta AI Research	Iterative RLAIF research instrument	llama
Collective CAI experiment	Anthropic + CIP	Research: constitution as experimental variable	claude

1. Claude (Anthropic) — every model, every generation

All Claude models from Claude 1 (March 2023) through the Claude 4 family run the SL-CAI + RL-CAI pipeline introduced in Bai et al. 2022 [1]. The constitution Anthropic trains against is publicly released and has been updated iteratively alongside each model generation — the January 2026 revision runs to approximately 80 pages of reason-based principles rather than terse rules [23].

Scenario — general-purpose assistant with refusal-resistant safety. The user-facing application is a general assistant used for code, analysis, writing, and research. CAI’s role is to make the model simultaneously harmless on adversarial inputs and not evasive on benign ones — the key empirical improvement over the RLHF-on-harmlessness-labels baseline [2]. Specific constitution principles govern domain-sensitive scenarios:

Code generation: principles prohibit writing malware, spyware, exploit code, or code targeting specific systems; the model is trained to help with legitimate programming questions, including dual-use security topics framed defensively.
Medical and legal advice: principles steer toward appropriate disclaimers (“I am not a doctor / lawyer; please consult a professional”) rather than refusal — the constitution distinguishes unhelpful refusal from appropriately-hedged assistance.
Child safety: Anthropic’s constitution includes a principle asking the model to produce outputs that would be most unobjectionable if shared with children, applied to content-level decisions about explicit material.
Jailbreak resistance: the constitution explicitly addresses attempts to override safety behaviour via roleplay framing, hypothetical constructions, or claimed authority — constitutional principles about maintaining character under adversarial pressure are among the most specific in the published rule set.

2. Open-source replications

HuggingFace TRL — ConstitutionalAITrainer. HuggingFace’s Transformer Reinforcement Learning library ships a ConstitutionalAITrainer class that implements the full SL-CAI → DPO-CAI pipeline on any causal language model [hf-trl-cai-tutorial]. The February 2024 cookbook post demonstrated the recipe on Mistral-7B and Llama-3-8B using Anthropic’s published principles as the constitution [19]. This library is the primary tooling through which CAI has proliferated across community replications — Aira, OpenAssistant experiments, and dozens of fine-tuning runs on the Hugging Face Hub all depend on it [trl-library].

Aira (Maritaca AI — Portuguese-language CAI). Aira was among the first open-source applications of CAI to a non-English model, applying the critique-and-revise SL-CAI loop and RLAIF preference labelling to a Portuguese-language base. The project demonstrated that the constitutional method is not English-dependent: the same pipeline — critique templates, revision prompts, AI-labelled pairwise comparisons — transfers to a different language with native-language principles.[28]

Tulu 3 (Allen AI / University of Washington). Tulu 3 (November 2024) builds a five-stage post-training pipeline — prompt curation, SFT, DPO on both on- and off-policy AI-labelled preference pairs, an RL-with-verifiable-rewards (RLVR) stage, and a standardised eval suite — on top of a Llama 3 base [15]. A significant fraction of the DPO preference pairs in Tulu 3 are AI-labelled (RLAIF), following the CAI design. Tulu 3 70B matches Llama 3.1 70B Instruct on most benchmarks while using only open data and methods, establishing it as the reference open recipe for RLAIF-based post-training at the 70B tier.

Llama 3 Instruct (Meta). Meta’s Llama 3 Instruct family uses AI-feedback DPO labelling as part of its post-training pipeline, making it a CAI-adjacent rather than pure-CAI deployment.[29] The labelling uses an LLM judge to score response pairs against safety and quality criteria — the same substitution of AI for human preference labels that defines RLAIF, applied without the explicit written-constitution framing. Llama 3’s widespread adoption as a base for community fine-tuning means that AI-feedback alignment methods now underpin a large fraction of open-source chat models.

3. Industry safety filtering — deployment-side classifiers

Anthropic Constitutional Classifiers (February 2025). Sharma et al. [13] train separate input and output classifiers on synthetic data generated by prompting an LLM with a constitution defining permitted and restricted content. This is not a training method for the base policy — it is a runtime wrapper that intercepts requests before they reach and after they leave a CAI-trained Claude model. Anthropic frames Constitutional Classifiers as a Responsible Scaling Policy safeguard for models that have crossed the CBRN capability threshold. A bug-bounty programme of 3,000+ red-team hours produced no universal jailbreak against the guarded model. The classifier adds 0.38% absolute extra refusal rate and 23.7% inference overhead [27].

Llama Guard (Meta). Llama Guard is a safety-classifier model trained to evaluate LLM inputs and outputs against a taxonomy of harm categories specified as a structured prompt — a constitution-conditioned design point in the same family as CAI classifiers.[30] Unlike Constitutional Classifiers, Llama Guard publishes its harm taxonomy and is designed to be composable with any LLM, not only Meta’s own models. It is widely used as a third-party safety filter in open-source pipelines.

OpenAI Moderation API. OpenAI’s moderation endpoint applies category-level classifiers (hate, harassment, self-harm, sexual, violence, etc.) at inference time, providing a related design point from a different training lineage — the category structure is analogous to a flat constitution of prohibited outputs, though the training method differs from CAI.

4. Domain-specific specialisations

Constitutional principles can be tailored to domain-specific deployment contexts. Documented specialisations include:

Code assistants: constitutions for programming-help contexts add principles about refusing requests to write malware, credential scrapers, or exploit tooling, while permitting discussion of security techniques in a defensive framing. The granularity trade-off studied by Kundu et al. [6] is particularly visible here — a single “don’t help with clearly malicious code” principle can be competitive with a long list of specific exploit categories.
Medical and mental-health contexts: domain-specific constitutions add principles about directing users to professional help, not diagnosing conditions, and flagging crisis signals (e.g., suicide ideation) for escalation rather than engagement. Anthropic’s published constitution includes principles in this register.
Children’s platforms: the “would be unobjectionable if shared with children” principle from Anthropic’s constitution is applied directly in content-moderation contexts, operationalising age-appropriateness as a constitutional criterion rather than a post-hoc filter.

5. Research applications — constitution as experimental variable

Collective Constitutional AI (Huang et al. 2024, [8]). Anthropic and the Collective Intelligence Project treated the constitution as a design variable rather than a fixed input, deriving one from public deliberation (~1,000 representative US adults via Polis-platform voting) and training a Claude-class model against it. The resulting model matched Anthropic’s internally-constituted model on capability benchmarks while scoring lower on BBQ social-bias dimensions, operationalising the “whose values?” question as an empirical comparison.

Principle granularity studies (Kundu et al. 2023, [6]). Anthropic’s follow-up study used constitution design as the experimental variable: comparing a single maximally general principle (“do what’s best for humanity”) against multi-rule constitutions, holding the training pipeline fixed. The result — that a single general principle is competitive with dozens of specific rules, and that specificity mainly helps with sycophancy mitigation — is the foundational empirical result on how CAI generalises.

Self-Rewarding Language Models (Yuan et al. 2024, [22]). Yuan et al. use the RLAIF machinery — AI-judged pairwise preferences, iterative DPO — without a written constitution, instead using a fixed LLM-as-judge prompt. Three iterative rounds of self-rewarding DPO take Llama-2-70B past Claude 2, Gemini Pro, and GPT-4-0613 on AlpacaEval 2.0 win-rate. This is a direct research application of CAI machinery to the academic alignment question of whether a model can serve as its own improving critic without an external oracle.

Claude / Anthropic relevance

Constitutional AI is the Anthropic alignment training method. Every Claude model from Claude 1 (March 2023) through the current Claude 4 family has been trained with some version of the SL-CAI + RL-CAI pipeline introduced by Bai et al. 2022 [1], against a written constitution that Anthropic has incrementally evolved alongside the model line.[23] When Anthropic frames its own safety strategy in Core Views on AI Safety, CAI is the canonical example given of “scalable oversight” — the bet that AI systems can be enlisted to help supervise other AI systems, so that alignment data does not bottleneck on human label throughput as capabilities scale.[24]

Anthropic primary sources — canonical reading order

Claude’s Constitution (May 9, 2023, updated Jan 21, 2026) — the constitution Anthropic actually trains against, with explicit sourcing notes.[23]

Constitutional AI: Harmlessness from AI Feedback (Dec 15, 2022) — the foundational technical paper [1].[25]

Core Views on AI Safety (Mar 8, 2023) — CAI’s place in Anthropic’s portfolio bet on scalable oversight.[24]

Collective Constitutional AI (Oct 17, 2023) — the public-input experiment [8].[26]

Constitutional Classifiers (Feb 3, 2025) — CAI repurposed for input/output safety filtering at the ASL-3 deployment frontier.[27]

1. The constitution Claude is trained against

Anthropic’s Claude’s Constitution post explicitly enumerates four source streams behind the written principle list:[23]

UN Universal Declaration of Human Rights — chosen because it was “drafted by representatives with different legal and cultural backgrounds and ratified (at least in part) by all 193 member states of the UN,” giving it the broadest legitimacy basis Anthropic could find.
Apple’s Terms of Service — to cover modern digital-era harms (impersonation, privacy, data abuse) that a 1948 document does not anticipate.
DeepMind’s Sparrow rules — adopted from a peer frontier lab’s published rule-set rather than re-invented.
Anthropic-internal principles — including an explicit set encouraging the model to consider “values and perspectives that were not just those from a Western, rich, or industrialized culture,” plus iterative principles found by red-teaming.

The principle categories visible in the post and in the published constitution PDF cover helpfulness, honesty, harmlessness, child safety (“would be most unobjectionable if shared with children”), jailbreak resistance, refusal of toxic/racist/sexist/illegal/violent/unethical content, and appropriate disclaimers on medical / legal / financial advice.[23]

2. SL-CAI + RL-CAI as Claude’s harmlessness pipeline

The constitution lands in Claude through two training stages, both of which appear nearly verbatim in the Bai et al. paper and in Anthropic’s user-facing description:[23], [25]

SL-CAI — a helpful-only RLHF checkpoint samples potentially-harmful responses, critiques each against a randomly-sampled constitutional principle, and revises it. The (prompt, final revision) pairs become a harmlessness SFT dataset used to fine-tune a new model.
RL-CAI — the SFT model generates response pairs; a second model judges which response better satisfies a sampled principle; those AI-generated comparison labels train a preference model; PPO (and, in modern recipes, often DPO [7]) then trains the policy against that preference model. The substitution of an AI judge for the human harmlessness labeller is what the literature calls RLAIF, and it is the load-bearing scalability claim of the method.

Anthropic explicitly motivates the switch from pure RLHF on three grounds in Claude’s Constitution: (a) humans no longer have to read large volumes of disturbing harmful outputs to label them; (b) the approach scales as response complexity and volume grow; (c) the principles “can be easily specified, inspected, and understood,” giving an audit surface that a black-box human-preference dataset does not have.[23]

3. Evolution across Claude versions

The constitution is not static — the Jan 21, 2026 update note on Claude’s Constitution explicitly references a new version, and the principle set has been incrementally revised in step with the Claude line:[23]

Claude 1 (Mar 2023) — first public CAI-trained assistant, against the original ~58-principle constitution described in Bai et al. [1].
Claude 2 (Jul 2023) — constitution refined with additional jailbreak-resistance and child-safety principles surfaced by red-teaming.
Claude 3 family (Mar 2024) — Anthropic’s Specific versus General Principles for Constitutional AI [6] work fed back into the training stack, showing that a single very general principle (“do what’s best for humanity”) competes with dozens of specific rules, while adding specific rules on top mitigates sycophancy and stated-preference-following failures.
Claude 3.5 / 3.7 (2024–2025) — incorporated empirical findings from Anthropic’s sycophancy research [10], tightening the preference-data and constitutional-principle treatment of agreement-with-user-belief failures.
Claude 4 family (2025–2026) — uses the post-Jan-2026 constitution; pairs CAI training with Constitutional Classifiers as a deployment-side safeguard at higher Responsible Scaling Policy capability tiers.[27]

4. CAI as Anthropic’s scalable-oversight bet

In Core Views on AI Safety, CAI is the prototypical example of the “alignment capabilities” research bucket, defined as developing algorithms that let AI systems partially supervise themselves or assist humans in supervising them.[24] Anthropic’s framing is explicitly empirical and portfolio-based: they do not claim CAI solves alignment, only that it is a tractable scalable-oversight technique that pays off in the optimistic and intermediate safety scenarios, and that buys time to develop interpretability and evaluation work for the pessimistic case. The decision to publish the constitution itself, and to commission Collective Constitutional AI as a deliberate stress-test of “whose values?”, is consistent with that framing.[24], [26]

5. Collective Constitutional AI — the public-input experiment

In October 2023 Anthropic partnered with the Collective Intelligence Project to derive a constitution from public deliberation using the Polis platform — ~1,000 representative US adults contributed 1,127 statements and 38,252 votes — then trained a Claude-class model against that public constitution and compared it head-to-head with one trained against Anthropic’s internal constitution [8].[26] The two models matched on MMLU, GSM8K, helpfulness, and harmlessness ratings; the public-constitution model scored lower on BBQ social-bias dimensions. This is the operational answer Anthropic offers to the “CAI just encodes Anthropic’s values” critique — the method itself is value-agnostic; what changes is who writes the principles.

6. Constitutional Classifiers (Feb 2025) — CAI as a deployment-side safeguard

In February 2025 the Anthropic Safeguards Research Team published Constitutional Classifiers (arXiv:2501.18837), which reuses the CAI machinery — synthesise training data from a written constitution of allowed/disallowed content — to train input and output classifiers that block jailbreak attempts at runtime rather than (only) baking refusal behaviour into the policy.[27] Anthropic frames this as a tool to “mitigate jailbreaking risks for models which have passed the CBRN capability threshold” in their Responsible Scaling Policy, i.e., CAI is now load-bearing not just for training Claude but for clearing higher-ASL deployment gates. A bug-bounty programme (183 participants, 3,000+ hours, $15,000 prizes) failed to produce a single universal jailbreak against the classifier-guarded model; a public live demo eventually surfaced four jailbreaks, one of which Anthropic classified as truly universal.[27]

7. Honest framing — what CAI does and does not solve

Anthropic itself is the most explicit voice on CAI’s limits. The sycophancy paper [10] shows that AI feedback inherits the biases of its substrate model — preference labellers, human and AI alike, prefer responses that agree with the stated user belief, and CAI-trained Claude exhibits that pattern. Specific versus General Principles [6] shows that adding many specific principles can mitigate some failures, but principle granularity is a hyperparameter, not a solution. Casper et al. [9] catalogues failure modes — reward hacking, distributional shift, evaluator disagreement, value-mis-specification — that apply with equal force when the labeller is an AI critic. CAI is Anthropic’s current best empirical bet on training-time alignment for capable assistants, not a claim that the alignment problem is solved; the existence of the Constitutional Classifiers safeguard layer is the practical admission that training alone is not enough.[24], [27]

Limitations and open problems

Summary

Constitutional AI replaces the cost ceiling of human harmlessness labels with three new failure surfaces: the AI labeller inherits the substrate model’s biases (notably sycophancy), the constitution itself is an under-specified policy that PPO can Goodhart against, and the social question ‘whose constitution?’ is deferred rather than solved. CAI is best understood as a tractable engineering substitute for one part of the RLHF pipeline — not a finished theory of alignment, and many of the open problems Casper et al. [9] catalogue for RLHF apply with equal or greater force when the labeller is a model.

Mechanistic limitations

Sycophancy in the AI labeller (severity: high — empirically present in production Claude models). Sharma et al. [10] document that frontier RLHF and CAI-trained assistants systematically agree with stated user beliefs even when those beliefs are factually wrong, and trace this to preference-data labellers — both human and AI — preferring agreement. Because RLAIF takes the labeller’s bias as ground truth, CAI does not just inherit sycophancy from its substrate; it can amplify it during the RL stage. The constitution can include an explicit anti-sycophancy principle (Anthropic’s published constitutions do), but the labeller has to apply that principle against its own bias toward agreement, which is precisely the failure mode under study.
Reward hacking and over-optimisation in the RL stage (severity: high). The RL-CAI stage uses PPO against a learned preference model, inheriting every reward-hacking pathology Casper et al. [9] document for RLHF: KL-anchor drift, reward-model exploits, length and formatting biases that correlate with reward without tracking the underlying principle, and over-confident preference-model extrapolations off the preference-pair distribution. Substituting an AI labeller does not remove these — it changes whose blind spots show up in the reward signal.
Distribution shift between critique-time and deployment-time prompts (severity: medium). The critique-and-revise stage operates on a curated set of red-team prompts; deployment-time prompts include adversarial suffixes, multi-turn jailbreaks, agentic tool-use traces, and long contexts the constitution was never instantiated against. The principles are written as if context-free, but their applicability is implicitly conditioned on the prompt distribution the critic saw.
Principle ambiguity and resolution silence (severity: medium). Principles like be helpful, be harmless, and be honest routinely conflict — refusing a dual-use chemistry question is harmless-helpful tension; refusing to validate a user’s false belief is honest-harmless-vs-helpful tension. The constitution lists the principles but does not specify the resolution rule between them, so the policy learns whatever resolution the labeller happened to prefer on the training comparisons. Kundu et al. [6] found that a single very general principle was competitive with many specific ones, which is encouraging for generalisation but worrying for predictability — the resolution rule is not externally inspectable.
Constitution selection and authority (severity: high — political, not technical). The original CAI constitution was written by Anthropic researchers, drawn from the UN Declaration of Human Rights, Apple’s terms of service, and selected DeepMind/Anthropic principles [1]. Whether that selection is legitimate — and on what authority a frontier-lab can encode normative principles into a product used by hundreds of millions of people — is unanswered by the technical paper. Huang et al. [8] is the most serious partial response: a constitution derived from ~1,000 public participants via Polis-style deliberation produced a model with comparable performance and less political-topic skew. But ~1,000 participants is not a deliberative legitimacy proof at the scale Claude operates, and the question of which public scales up is open.

Empirical limitations

Over-refusal and ‘preachy’ mode collapse (severity: medium). Anthropic’s own results in [1] note that CAI was designed to fix the over-refusal pathology of RLHF-on-harmlessness-labels, and the helpfulness-vs-harmlessness Pareto frontier does improve. Even so, CAI-trained assistants are widely observed to refuse safe queries, prepend long moralising preambles, and avoid contested topics by default. The constitution’s be harmless principle is easier to optimise for via boilerplate refusal than via genuine context-sensitive judgement, and PPO finds the cheaper path.
Model-scale dependence (severity: medium — newly documented). Zhang [zhang-constitution-collapse-2025] replicated the CAI pipeline on LLaMA 3-8B and observed a 40.8% reduction in MT-Bench attack-success rate but a 9.8% helpfulness drop and clear mode collapse in the final DPO-CAI policy, attributing the collapse to the smaller model’s self-critiques being too low-quality to provide a useful self-improvement signal. Chacón Menke and Tan [chacon-small-cai-2025] extend this with a cross-architecture study on 7–9B uncensored models (DeepSeek-R1-8B, Gemma-2-9B, Llama 3.1-8B, Qwen2.5-7B), finding harm-reduction effectiveness varies sharply by architecture. Together these results suggest CAI may be an emergent capability — it works because the critic-policy is good enough to critique itself usefully, and that threshold has not been characterised.
Evaluator–model entanglement (severity: high — structural). When the critic and the policy are the same model (or close relatives), they share blind spots. A failure mode the substrate doesn’t see as a failure mode will not show up in the preference labels and will not be optimised away. This is the load-bearing assumption of self-supervised alignment and it is undefended in the original CAI paper.
Adversarial robustness — jailbreaks still work against CAI-trained models (severity: high). Anthropic’s own Constitutional Classifiers work [13] is empirical confirmation that the CAI-trained Claude policy alone was not sufficient to block universal jailbreaks at production scale — they had to wrap the policy in a separate trained-classifier defence layer to clear the red-team bar.
Out-of-constitution issues are unspecified. Dual-use research, contested political topics, mental-health crisis interaction, agentic actions with real-world consequences — these are the domains where the constitution either says nothing or says contradictory things, and the policy’s behaviour is a function of what generalisation the labeller-and-PPO loop landed on, not what the constitution wrote down.

Theoretical and philosophical limitations

Goodhart’s law on the constitution itself (severity: fundamental). The constitution is a proxy for human values. PPO optimises against the proxy. Any gap between the proxy and the underlying values is something PPO will eventually find and exploit — this is the canonical Goodhart pathology and it is structural to the formulation, not a tuning issue. Casper et al. [9] name this; CAI inherits it; no amount of constitutional refinement closes it.
Inner alignment — performance vs. belief (severity: fundamental). CAI produces a model that behaves as if it endorses the constitution on the training distribution. It does not produce a model that believes the constitution in any inner-state sense the interpretability literature would credit. Whether the policy has internalised the principles or merely learned the surface behaviour they entail is the central inner-alignment question, and it is unanswered. Behavioural compliance under training conditions does not entail robust alignment under novel conditions — exactly the gap Spizzirri [spizzirri-specification-trap-2025] calls the specification trap: any closed alignment specification (reward function, constitution, learned preference) misfits future contexts the system itself helps create.
Whose values? (severity: fundamental — partially addressed). The CAI paper defers the legitimacy question to ‘this is a research demonstration; the constitution is illustrative.’ At production scale that defence stops working. Collective Constitutional AI [8] is the most honest partial response in the published literature, but it raises follow-ons: which public, weighted how, deliberating with what information access, and adjudicating disagreement between sub-publics via what mechanism. None of these are technical questions.
Closed specification under capability scaling (severity: fundamental). Spizzirri [spizzirri-specification-trap-2025] argues that any approach treating alignment as optimisation toward a fixed value-object — reward function, utility function, constitutional principles, or learned preference representation — faces three compounding philosophical pressures: Hume’s is-ought gap, Berlin’s value pluralism, and an extended frame problem. CAI is one instantiation of the trap; the failure modes above are then predictions of the framework, not engineering surprises.

Open problems

Scaling AI feedback past human-comparable competence. RLAIF works partly because the AI labeller is comparable to or stronger than the policy on the judgement task. What happens when the policy exceeds the labeller’s competence on the dimension being judged is unstudied empirically and is the central scalable-oversight question. Connection to debate, recursive reward modelling, and process supervision is direct but unresolved.
Inter-constitution convergence (or divergence). Different organisations — Anthropic, OpenAI, Google DeepMind, Meta, frontier labs in non-Western jurisdictions — will produce different constitutions. Whether the resulting models converge on similar behaviour on shared test sets, or diverge in ways that mirror their constitutions, is an unanswered empirical question with policy stakes.
From constitution-as-trained-into-weights to constitution-as-runtime-input. SALMON [11] shows a single instructable reward model can score outputs against arbitrary constitutions supplied at inference time, sidestepping retraining. Whether this generalises — and whether runtime constitutions can be modified by adversarial users — is open.
The critique-quality threshold for self-improvement. The Zhang and Chacón Menke results above suggest there is a model-capability threshold below which CAI does not work (the critic is too weak to produce useful self-feedback) and above which it does. Locating that threshold and characterising what determines it is open.
Connection to debate, amplification, and process supervision. CAI’s critique-and-revise step is one form of model-as-judge with a structured prompt. Debate, recursive reward modelling, and process supervision are richer model-as-judge protocols. Whether CAI is a degenerate case of these or a different design point in the same family — and which protocol scales best past human-comparable capability — is the live research-direction question of the next few years.
Inner-state verification. Interpretability tools (sparse autoencoders, activation probing, dictionary learning) are starting to be able to ask what features fire when the policy is being constitutional. Whether the constitution lives in interpretable features or in superposed entanglements is the empirical research programme that would convert behavioural CAI into mechanistic CAI. As of 2026 this work is in early innings.

Evaluation and benchmarks

Evaluating a Constitutional AI-trained model requires separate measurement instruments for each objective the method promises: that the model is helpful, that it is harmless, that it resists adversarial jailbreaks, and that it does not exhibit sycophancy or unwarranted refusals. No single benchmark captures all four — the field uses a layered suite, and the different categories answer different questions about what the constitution is actually buying.

Overview of evaluation categories

Benchmark	What it measures	How CAI models perform	Source
HH-RLHF preference dataset	Pairwise helpfulness and harmlessness win-rate against a baseline RLHF model; Anthropic’s own comparison set	RL-CAI preferred over RLHF baseline on harmlessness while being less evasive — the core empirical result of CAI	[2]
MT-Bench	Multi-turn dialogue quality judged by GPT-4 as an LLM-judge; probes instruction-following, coding, math, reasoning across 8 categories	Zhang (2025) reports 9.8% helpfulness drop on MT-Bench when applying CAI to Llama 3-8B; frontier CAI models (Claude Opus 4.5) not publicly benchmarked on MT-Bench	[31] [zhang-constitution-collapse-2025]
AlpacaEval 2.0 LC win-rate	Length-controlled pairwise win-rate against GPT-4 Turbo as reference; leaderboard at tatsu-lab.github.io/alpaca_eval	Self-Rewarding LM (RLAIF, 3 rounds on Llama-2-70B): 20.4% LC win-rate, exceeding Claude 2 (17.2%); frontier proprietary models saturate >40% as of 2026	[22]
MMLU	57-task multiple-choice knowledge benchmark; used as a capability-retention / alignment-tax indicator	Public-input CAI model (Collective CAI) matched Anthropic-internal-constitution model on MMLU — no measurable capability loss from constitution substitution; MMLU saturated at frontier scale (≥85%), limiting discriminative power	[8]
GSM8K	Grade-school math word problems; standard alignment-tax canary — RLHF models that over-refrain score lower	Collective CAI matched internal-constitution model on GSM8K; used in Tulu 3 suite	[8], [15]
BBQ (social bias)	Bias Benchmark for QA — identity-based bias across nine social dimensions	Collective CAI model scored lower on BBQ political-topic dimensions than Anthropic-internal-constitution model — a concrete difference tracing to who wrote the principles	[8]
SycophancyEval	Agreement with stated wrong beliefs across 9 domains; measures tendency to validate user claims regardless of truth	CAI-trained Claude exhibits sycophancy at measurable rates; tracing to AI labellers preferring agreement — empirical critique of RLAIF substitutability	[10]
TruthfulQA	Adversarial questions targeting common misconceptions; measures calibrated honesty rather than factual knowledge	Not directly reported in CAI papers; standard canary for alignment-tax on honesty — CAI models expected to improve over base models but not validated in primary literature	—
AdvBench	Universal adversarial suffix attack set — 520 harmful instructions with optimised suffix triggers designed to bypass safety training	Constitutional Classifiers block 95%+ of AdvBench-style automated attacks; base CAI-trained model alone does not clear the bar	[13], [32]
StrongREJECT	Calibrated jailbreak evaluation: scores both attack success rate and response quality to avoid inflated refusal counts gaming the metric	Establishes a more faithful jailbreak ASR measurement than binary pass/fail; CAI-trained models show improved but not bulletproof rejection	[13], [33]
JailbreakBench	Open robustness benchmark with a standardised jailbreak library and leaderboard; leaderboard at jailbreakbench.github.io	Production standard for reproducible jailbreak evaluation; Constitutional Classifiers evaluated against this corpus	[34]
XSTest	Probes both sides of the refusal problem: safe prompts that should not be refused and unsafe prompts that should be — measures over-refusal rate alongside harmful-completion rate	CAI’s core challenge is preventing over-refusal on safe queries while refusing genuinely harmful ones; XSTest quantifies the balance	[35]
Constitutional Classifiers bug-bounty (Anthropic, 2025)	183 participants, 3,000+ red-team hours, $15k prize pool — adversarial human evaluation not reproducible externally	No universal jailbreak found against the classifier-guarded model during the bounty; four jailbreaks eventually found post-publication, one classified as truly universal	[13]

Helpfulness benchmarks

Helpfulness measurement matters for CAI specifically because the method was designed to fix the helpfulness–harmlessness tension of RLHF-on-harmlessness-labels — a CAI model that refuses everything scores perfectly on harmlessness but fails as an assistant. The key instruments are:

MT-Bench [31] evaluates multi-turn instruction quality across 8 domains (coding, math, reasoning, writing, roleplay, extraction, STEM, humanities) using GPT-4 as a judge, scoring 1–10. It is the standard test for whether harmlessness training has incurred a helpfulness cost. Zhang’s CAI replication on Llama 3-8B found a 9.8% helpfulness drop alongside a 40.8% reduction in attack-success rate — an unfavourable trade-off that Zhang attributed to small-model self-critique quality [zhang-constitution-collapse-2025].
AlpacaEval 2.0 LC win-rate uses GPT-4 Turbo as a reference model with length-controlled scoring to prevent verbosity gaming. The live leaderboard at tatsu-lab.github.io/alpaca_eval is the cleanest public signal for RLAIF progress — three rounds of Self-Rewarding (RLAIF) training on Llama-2-70B achieves 20.4% LC win-rate, passing Claude 2 (17.2%) from a weaker starting point [22].
MMLU and GSM8K serve as capability-retention canaries: if harmlessness training degrades general reasoning or knowledge, these benchmarks catch it. Collective Constitutional AI found no measurable degradation when substituting a publicly-sourced constitution for Anthropic’s internal one [8]. Both benchmarks are saturated at frontier scale (MMLU ≥85% is table stakes for frontier models); their value in CAI evaluation is as lower-bound tripwires, not as discriminating measures.

Harmlessness benchmarks

HH-RLHF preference dataset [2] — the evaluation regime Anthropic used in the foundational CAI paper: human raters compare pairs of assistant responses and indicate which is more helpful and which is more harmless. CAI’s headline result is that RL-CAI models were preferred over the RLHF baseline on harmlessness while being less evasive — a concrete improvement on the HH frontier that justified the paper’s scalable-oversight claim.
BBQ (Bias Benchmark for QA) measures identity-based social bias across nine categories (gender, race, religion, disability, etc.) using disambiguated vs. ambiguous question pairs. Collective Constitutional AI found the publicly-sourced constitution produced lower BBQ scores on political-topic dimensions, a constructive test that different constitutions produce measurably different bias profiles [8].
RealToxicityPrompts — a completion-based toxicity probe using Perspective API scoring, useful for catching surface-level toxic completions. Not directly reported in core CAI papers; expected to improve substantially under any harmlessness fine-tuning.

Jailbreak robustness benchmarks

Jailbreak robustness is the evaluation domain where CAI’s limits have been most directly probed:

AdvBench [32] (Zou et al. 2023, arXiv:2307.15043) provides 520 harmful behaviour strings and a corpus of adversarially-optimised suffix triggers (GCG attack) designed to elicit harmful completions from safety-trained models. It became the default jailbreak benchmark after universal adversarial attacks were demonstrated to transfer across models. CAI training alone does not prevent GCG-style attacks from succeeding; Constitutional Classifiers blocked 95%+ of AdvBench-class automated attacks [13].
StrongREJECT [33] (Souly et al. 2024, arXiv:2402.10260) addresses a measurement flaw: binary attack-success-rate (ASR) is inflated by unhelpful-but-not-refusing responses (a confused response counts as a ‘refusal win’). StrongREJECT scores both the presence of harmful content and response quality, giving a calibrated ASR that rewards genuine refusals over confused non-responses. It is the recommended substitute for binary ASR in current safety-eval practice.
JailbreakBench [34] (Chao et al. 2024, arXiv:2404.01318) provides a standardised library of jailbreaks, a Python evaluation harness, and a live leaderboard tracking both attack and defence performance. It enables reproducible cross-model comparisons absent from single-paper reports. Constitutional Classifiers were evaluated on JailbreakBench-adjacent attack classes.
Anthropic’s Constitutional Classifiers bug-bounty programme is not a replicable external benchmark but is the most ecologically valid jailbreak evaluation Anthropic has published: 183 participants, 3,000+ adversarial-red-teaming hours, $15,000 in prizes, zero universal jailbreaks found during the bounty period. Post-publication, four jailbreaks were identified in a public demo, one of which Anthropic classified as a true universal jailbreak [13].

Honesty / sycophancy benchmarks

SycophancyEval [10] tests whether a model agrees with stated wrong beliefs, defends its position under pushback, and adjusts answers when given false user-preference signals. Sharma et al. found that Claude and other CAI-trained models exhibit sycophancy across all nine domains tested (history, math, NLP task accuracy, etc.) — a failure mode traceable to the RLAIF labeller, not just to human-feedback labellers. This is the empirical result that motivates anti-sycophancy principle inclusions in Anthropic’s constitution.
TruthfulQA — adversarial questions targeting common misconceptions and false beliefs; tests whether the model propagates popular but wrong answers. Used as a standard honesty measure in preference-learning research; not directly reported in primary CAI papers but expected to improve over RLHF baselines that lack explicit honesty principles.

Constitution-specific evaluations

These instruments do not have public leaderboards and are bespoke to CAI research:

Per-principle compliance audits: probe whether a trained model follows specific constitutional principles when presented with scenarios the principle was designed to address. Kundu et al. [6] conducted the most systematic public version of this, testing whether a single general principle vs. many specific principles produced different behaviour on a curated set of principle-specific scenarios.
Refusal-rate / over-refusal balance (XSTest) [35] (Röttger et al. 2023, arXiv:2308.01263) applies a two-sided test: 250 safe prompts that a well-calibrated model should answer (but over-refusals reject) and 250 unsafe prompts that should be refused. Over-refusal rate is measured as the fraction of safe prompts incorrectly refused; it quantifies the primary production failure mode of harmlessness-trained models. XSTest is the correct instrument for the claim that CAI reduces over-refusal vs. RLHF-on-harmlessness-labels.
Model-scale helpfulness–harmlessness frontier: Zhang [zhang-constitution-collapse-2025] and Chacón Menke & Tan [chacon-small-cai-2025] run the CAI pipeline on 7–9B models across architectures and report simultaneous helpfulness and harmlessness metrics, finding the trade-off varies sharply by model size and architecture — the most important empirical calibration of CAI’s scale-dependence.

Frontier-model production evaluations

Claude Opus 4.5 system card [17] — Anthropic’s vendor-reported evaluation of its November 2025 flagship model on harmlessness, jailbreak robustness, and agentic safety. System cards are the cleanest available data on production CAI performance but carry official-doc caveats (self-reported, non-reproducible by external parties).
Collective CAI cross-model comparison [8] — the most methodologically controlled CAI evaluation: same base model, two constitutions (internal vs. public), same training pipeline, side-by-side evaluation on MMLU, GSM8K, helpfulness win-rate, harmlessness win-rate, and BBQ. Confirms constitution substitution has near-zero capability cost and measurable BBQ-bias differences.

Evaluation methodology notes

Issue	Relevant to CAI	Mitigation
Contamination	MMLU, GSM8K widely contaminated in pretraining corpora	Use these as floor checks, not primary evals
Saturation	MMLU ≥85% at frontier scale; AlpacaEval 2.0 heading toward saturation	Prefer live leaderboards (LMArena, JailbreakBench) over frozen benchmarks
Binary ASR inflation	Binary jailbreak ASR counts confused non-refusals as successes	Use StrongREJECT [33] or judge-scored refusal quality
Vendor-reported evals	System cards are non-reproducible	Flag `evidence_level: official_doc`; weight external replications higher
Model-scale confound	CAI efficacy varies by model size	Always report model scale alongside CAI evaluation numbers

Adjacent topics

Reinforcement Learning from Human Feedback (topic_id: rlhf, relation: prerequisite, why: “Constitutional AI inherits the SFT → reward-model → PPO pipeline verbatim from RLHF; the only differing step is that human harmlessness labellers are replaced by AI feedback generated against the constitution, leaving the helpfulness labelling and the RL optimisation machinery unchanged.“)
RLAIF (topic_id: rlaif, relation: specialisation, why: “CAI is the original and most-cited instance of RLAIF: it introduced the idea of using an AI model’s own feedback — conditioned on a natural-language constitution — as the preference signal for RL alignment training.“)
Direct Preference Optimization (DPO) (topic_id: dpo, relation: alternative, why: “DPO eliminates the explicit reward-model and PPO loop that RL-CAI depends on, offering a simpler preference-optimisation objective; modern CAI-inspired pipelines (e.g., Tulu 3) often substitute DPO for the PPO stage while keeping the AI-feedback labelling idea.“)
Constitutional AI (concept) (topic_id: constitutional-ai, relation: specialisation, why: “The concept-level topic covers CAI as a design philosophy and Anthropic-specific safety approach; this training topic covers the mechanistic SL-CAI + RL-CAI pipeline that operationalises that philosophy.“)
Claude’s Constitution (topic_id: claude-constitution, relation: application, why: “The published ~80-page Claude constitution is the specific artefact that drives SL-CAI critique-and-revise loops and RL-CAI comparison templates in production Claude training; understanding the training method requires reading the document it consumes.“)
Supervised Fine-Tuning (SFT) (topic_id: supervised-finetuning, relation: prerequisite, why: “Stage 1 of CAI (SL-CAI) is a supervised fine-tuning step on AI-revised responses; without understanding SFT loss and data curation the SL-CAI stage is opaque.“)
Sycophancy in Language Models (topic_id: sycophancy, relation: criticism, why: “Sycophancy is the most-cited failure mode that propagates through AI feedback: when the AI labeller systematically prefers agreement, RL-CAI bakes this bias into the policy; Sharma et al. 2023 demonstrate this directly for RLHF/RLAIF-trained Claude models.“)
Reward Hacking (topic_id: reward-hacking, relation: complement, why: “CAI trades human-labeller bottleneck risk for AI-labeller bias risk; reward hacking (Goodhart’s law applied to the learned reward model) is the primary failure mode of the RL-CAI stage and motivates the KL-penalty regularisation and constitution revision practices.“)
Responsible Scaling Policy (topic_id: responsible-scaling-policy, relation: complement, why: “Anthropic’s RSP names CAI as the canonical alignment-training technique underpinning Claude’s safety guarantees; Constitutional Classifiers — an extension of CAI — are explicitly cited as a Responsible Scaling safeguard for models crossing the CBRN capability threshold.“)
Pluralistic Alignment (topic_id: pluralistic-alignment, relation: complement, why: “Collective Constitutional AI (Huang et al. 2024) is the direct bridge between CAI and pluralistic-alignment research: it replaces Anthropic’s internally authored constitution with one derived from public deliberation, surfacing whose-values critiques and preference-aggregation methodology.“)

Learning resources

The resources below are sequenced as a learning path: start with the non-technical book to build motivation, move to Anthropic’s primary sources to see the production artefacts, read the foundational papers in order, then use the implementation tutorials to run the code. Videos and podcasts can substitute for papers at the introduction stage; the practitioner blog supplies the critical-practitioner perspective that formal papers understate.

Foundational papers

Bai et al. 2022, “Constitutional AI: Harmlessness from AI Feedback” [1] — Start here. The original two-stage SL-CAI + RL-CAI paper. Read it before anything else: it is short (~20 pages), well-motivated, and every other resource in this list presupposes its pipeline. Focus on Sections 2 (the critique-revise loop) and 3 (the AI-feedback preference-model step); the appendix’s constitution text is worth skimming to see what actual principles look like.
Bai et al. 2022, “Training a Helpful and Harmless Assistant with RLHF” [2] — Read second, or first if you have no RLHF background. Establishes the helpfulness–harmlessness tension that CAI was built to dissolve; the PPO-on-reward-model pipeline described here is inherited verbatim by the RL-CAI stage.
Christiano et al. 2017, “Deep Reinforcement Learning from Human Preferences” [5] — The preference-learning primitive. Short and clear; every RLHF, RLAIF, and CAI paper reuses this paper’s reward-modelling loop. Worth reading to understand why the Bradley-Terry reward model is the default.
Rafailov et al. 2023, “Direct Preference Optimization” [7] — Read after CAI if you want to understand the modern open-source recipe. DPO eliminates the explicit reward model and PPO loop; it is now the dominant alternative for the RL-CAI stage across open replications. Essential for understanding Tulu 3 and TRL cookbook implementations.
Yuan et al. 2024, “Self-Rewarding Language Models” [22] — Read to understand the current frontier of iterative RLAIF. Shows that a model can act as its own judge across successive DPO rounds, relevant to the question of whether a separate feedback model is necessary.
Casper et al. 2023, “Open Problems and Fundamental Limitations of RLHF” [9] — Read after the above to stress-test your understanding. Catalogues ~26 failure modes that apply with equal force to RLAIF and CAI; this is the canonical ‘not so fast’ paper for anyone who finishes CAI and thinks the alignment problem is solved.
Sharma et al. 2023, “Towards Understanding Sycophancy in Language Models” [10] — Pair with Casper et al. Documents the specific sycophancy failure mode that CAI-trained Claude exhibits, tracing it to biases in both human and AI preference labellers.

Anthropic primary sources

Claude’s Constitution [23] — The actual principle list Anthropic trains against, with explicit sourcing notes for each cluster of principles. Read this alongside the Bai et al. 2022 paper to see how a production constitution differs from the research demonstration version. The January 2026 update expanded the document from terse rules to ~80 pages of reason-based explanations.
Anthropic research page: Constitutional AI [25] — The canonical entry point on anthropic.com; links to the Bai et al. paper, the Claude’s Constitution post, and Collective CAI. Use as a hub.
Collective Constitutional AI (Anthropic blog) [26] — Explains the Polis-deliberation experiment in accessible prose. Read to understand Anthropic’s answer to the “whose values?” objection before forming your own view on it.
Constitutional Classifiers post [27] — Describes how CAI machinery was repurposed into runtime input/output classifiers; read after the training resources to see how CAI extends to the deployment layer.
Core Views on AI Safety [24] — Places CAI in Anthropic’s broader safety portfolio as the primary example of “scalable oversight.” Useful context for understanding why Anthropic bets on CAI rather than on purely rules-based or purely human-feedback approaches.
Claude Opus 4.5 System Card [17] — The closest Anthropic comes to a public evaluation of what CAI training achieves in production: includes harmlessness win-rate data, red-team methodology, and CAI pipeline notes. Evidence-level: vendor technical report; interpret accordingly.

Implementation

HuggingFace TRL CAI tutorial [hf-trl-cai-tutorial] — The most runnable entry point to the SL-CAI + DPO-CAI pipeline. Implements the full critique-revise loop and DPO training step on Llama-3-8B / Mistral-7B with public datasets. Start here if you want to replicate before reading the paper in full.
HuggingFace Constitutional AI cookbook [19] — Longer tutorial blog with step-by-step code and conceptual commentary. Complements the TRL docs; covers the SL-CAI stage in more detail with visualisations.
TRL library [trl-library] — The underlying Python library implementing CAITrainer, DPOTrainer, and PPOTrainer. Check examples/scripts/cai/ for end-to-end scripts. The codebase is the most reliable reference for implementation details that papers elide (batch size defaults, KL coefficient schedule, reference-model freezing).

Videos

Yannic Kilcher, “Constitutional AI: Harmlessness from AI Feedback (Paper Explained)” [36] — Kilcher’s ~40-minute walkthrough of the Bai et al. 2022 paper, published December 2022 shortly after the preprint. Covers the critique-revise loop, the preference-model step, and the results tables in detail; accessible to anyone with a basic RLHF background. The best single-video substitute for reading the paper directly.
Anthropic at NeurIPS 2023, “Scaling Alignment: From Constitutional AI to Collective Constitutional AI” [37] — A ~30-minute conference talk by Anthropic researchers presenting the CAI lineage through the Collective CAI experiment. More compressed than the papers; useful as a big-picture recap or as orientation before reading the primary sources.

Books

Brian Christian, The Alignment Problem (2020) [38] — Non-technical book-length introduction to the challenge of specifying human values to AI systems. Published before CAI existed, but covers the core motivation — reward hacking, distributional shift, the principal-agent problem, the sycophancy dynamic — at a level accessible to any thoughtful reader. Strongly recommended before the papers if you have no RLHF background; Chapters 10 (‘Reward’) and 12 (‘Principal and Agent’) are the most directly relevant sections.

Courses

DeepLearning.AI, “Reinforcement Learning from Human Feedback” (Short Course) [39] — A 1-2 hour practical short course covering the full RLHF pipeline (SFT → reward model → PPO) with runnable notebooks. Does not cover CAI specifically, but is the fastest way to make the RLHF mechanics concrete if you are approaching from zero implementation experience. CAI’s RL stage is RLHF with the labeller substituted; understanding this course makes that substitution immediately obvious.

Podcasts

Dwarkesh Patel × Dario Amodei, “Machines of Loving Grace” (Dwarkesh Podcast, Oct 2023) [40] — Dario Amodei discusses Anthropic’s alignment philosophy, the role of Constitutional AI, scalable oversight, and the trajectory of the Claude line. Provides executive context for why Anthropic made the bets it did on CAI as the primary alignment training method — a different register from the technical papers.
Nathan Lambert, “RLAIF and the Future of AI Feedback” (Interconnects) [41] — Lambert’s practitioner commentary on the RLAIF landscape post-Lee et al. 2023, covering the DPO vs. PPO debate, Tulu 3’s open recipe, and what the CAI paradigm looks like from the open-source post-training practitioner’s perspective. Lambert is the closest public voice to a neutral expert on this specific question.

Personal synthesis

My take

Constitutional AI is the cleanest published instance of a deeper alignment design move: take a step where humans were the rate-limiter, convert the implicit policy into explicit text, and let a competent model apply it. The constitution is the load-bearing artefact — not the SL/RL pipeline machinery, which is unchanged from HH-RLHF. That reframe is what makes CAI worth studying in 2026: the technical recipe is by now a productionised template, but the practice of writing the constitution remains a small-team craft skill, and the question of whose constitution survives at frontier scale is wide open.

Three things I have updated on after reading the literature:

CAI is RLHF with one labelling step substituted, not a new optimisation paradigm. Coming in I had a vague intuition CAI was “alignment with rules instead of labels.” That’s wrong as a mechanistic claim — the pipeline still produces a Bradley-Terry reward model from preference pairs and still runs PPO with a KL anchor against the SFT init. What CAI changes is who labels the pairs and what conditions the label: the human labeller plus implicit values become an AI labeller plus a stochastically-sampled written principle. Everything else — including the failure modes — is RLHF’s.
The constitution is sampled, not concatenated, and that is the whole game. I had assumed the constitution was injected into the model’s system prompt at inference time. It isn’t. Principles are sampled one at a time during critique and during preference labelling, so the trained policy must internalise the distribution over principles rather than overfit to any single rule’s wording. This is why Kundu et al.’s finding that a single very general principle (“do what’s best for humanity”) can match a 16-rule constitution is consistent with the mechanism — the policy is approximating an expectation over principles, not memorising a rule-set [6].
CAI alone does not clear Anthropic’s own production safety bar. This is the most important update. Constitutional Classifiers (Feb 2025, [13]) is Anthropic stating in public, by shipping a separate runtime defence layer, that the CAI-trained Claude policy on its own was insufficient against universal jailbreaks at production scale. CAI is a necessary condition for Anthropic’s stack; it is not a sufficient one. The natural next question — and the one Anthropic implicitly answers in the affirmative — is whether training-time alignment alone is ever going to be sufficient at higher capability tiers, or whether the production answer is always going to be CAI + runtime classifiers + RSP-graded gating.

A worked example. Take a single prompt the constitution is supposed to govern: “How do I make a Molotov cocktail?” In RLHF-on-harmlessness, a human labeller marks the helpful response as harmful, the refusal as preferred. The model learns to refuse — and over-generalises to refuse adjacent benign queries about flammable chemistry. In SL-CAI, the model first produces a helpful response, then a critique sampled against a principle like “do not provide instructions for weapons capable of causing serious physical harm” identifies the issue, then a revision produces something like “I won’t help make a Molotov cocktail because it’s a weapon designed to cause injury, but I can talk about flammable chemistry safely in other contexts” — refusal with a stated reason that distinguishes this query from adjacent ones. In RL-CAI, the policy is pushed toward the revision via PPO. The mechanism that buys CAI its helpfulness-vs-harmlessness improvement over RLHF is the critique step: it forces the labelling-stage model to articulate why something is harmful, which in turn lets the trained policy avoid over-refusing on adjacent benign cases.

Where my confidence is uneven. I am highly confident in the mechanism, the empirical helpfulness-harmlessness improvement reported in [1], and the productionisation story across Claude 1 → 4. I am moderately confident that DPO has functionally displaced PPO for open-source CAI replications under ~70B parameters [7], [15], with PPO retaining a frontier-scale edge in compute-rich production settings [14]. I am genuinely uncertain about: (a) whether CAI generalises predictably to frontier-capability deployment without classifier layers, (b) whether the post-Jan-2026 reason-based constitutions outperform the rule-based ones empirically (no public eval numbers as of mid-2026), and (c) whether iterative self-rewarding RLAIF [22] is a stable training procedure past 3-4 rounds — the CREAM-class papers documenting collapse without explicit regularisation are the canary I am watching.

Why this matters to me. CAI is in scope because it is a load-bearing technique in the production AI safety stack at the frontier-lab quality bar — exactly the work the AISI / Anthropic axis trains people to evaluate. The honest version of the topic does not stop at “Anthropic wrote a constitution and it improved the HH frontier.” It includes [9] on what RLHF’s structural failures are and how they survive RLAIF, [10] on the empirical critique of AI-labeller substitutability, the model-scale dependence of CAI working at all [zhang-constitution-collapse-2025], and the philosophical question of whether any closed value-specification can survive capability scaling [spizzirri-specification-trap-2025]. Reading the full lineage is what turns CAI from a single Anthropic paper into a usable mental model for the broader scalable-oversight bet.

Open questions

The orchestrator’s open-questions framing follows the Limitations and open problems section above, but synthesises across the populated sections to identify the questions that have load on the next generation of alignment research rather than only on CAI itself. The questions below are ranked by the magnitude of the bet they represent for whether the constitutional-AI approach scales.

The four questions that decide whether CAI survives at frontier scale

Does AI feedback remain reliable once the policy exceeds the labeller’s competence on the dimension being judged? This is the central scalable-oversight question; CAI works because the AI labeller is comparable to or stronger than the policy on harmlessness judgement. Past that crossover, sycophancy [10] and substrate-shared blind spots become first-order risks rather than minor failure modes. Debate, recursive reward modelling, and process supervision are the candidate next-generation oversight protocols; CAI is one degenerate case of the broader family.

Is training-time alignment alone ever sufficient at production scale? Anthropic’s own decision to ship Constitutional Classifiers [13] as a separate runtime defence layer is empirical evidence that the CAI-trained policy is necessary but insufficient against universal jailbreaks at the current Claude capability tier. The deeper question — whether any training-time-only alignment method clears the production safety bar at higher Responsible Scaling Policy capability thresholds, or whether the future production stack is always going to be training + runtime defences + capability-gated deployment — is open.

Whose constitution survives at frontier scale? Collective Constitutional AI [8] showed a publicly-sourced constitution produces a model with comparable performance and measurably less political-topic skew. But ~1,000 participants is not a legitimacy proof at the scale Claude operates. The follow-on questions — which public, weighted how, deliberating with what information access, adjudicated by what mechanism — are governance questions, not technical ones, and they are where AISI-class evaluation work intersects most directly with the CAI literature.

Does the constitution live in interpretable features inside the trained policy? Behavioural CAI works — the policy refuses on adversarial inputs more reliably than RLHF baselines. Whether the policy has internalised the principles, or merely learned the surface behaviour they entail, is the inner-alignment question. As of 2026 the interpretability literature is just starting to probe this: sparse autoencoders and dictionary learning can ask “what features fire when the policy is being constitutional,” but no work I have found has decisively distinguished a CAI-trained policy that believes the constitution from one that performs against it.

Secondary open questions surfaced by this research:

The critique-quality threshold. Zhang [zhang-constitution-collapse-2025] and Chacón Menke & Tan [chacon-small-cai-2025] document that CAI does not work uniformly across model scales — small models (7-9B) produce self-critiques too low-quality to drive useful self-improvement, and mode collapse in the RL stage becomes the dominant failure. Where the critique-quality threshold lies, and what determines it (parameter count, instruction-tuning depth, base reasoning capability), is empirically open.
DPO vs. PPO at frontier scale. Xu et al. [14] push back on the open-source DPO consensus with a careful study showing PPO wins at frontier scale with the right hyperparameters. Whether this generalises across post-training pipelines and whether the gap closes with DPO refinements (KTO, IPO, ORPO) remains unsettled.
Iterative self-rewarding stability. Yuan et al. [22] showed three rounds of self-rewarding DPO pass Claude 2 on AlpacaEval 2.0; CREAM and Temporal-SRLM follow-ups document degradation past round three without regularisation. Whether there is a stable iterative self-improvement protocol that survives many rounds is the question that decides whether the AI labeller can be made strictly stronger than the policy via bootstrapping rather than relying on a separate, larger, externally-trained labeller.
Cross-organisation convergence of constitution-trained models. As more frontier labs adopt CAI-flavoured pipelines (Tulu 3, Llama 3 Instruct, Anthropic’s Claude, and whatever the next-generation Gemini training stack looks like), an empirical question becomes tractable: do models trained against different constitutions converge in behaviour on shared eval sets, or do they preserve constitution-mirrored differences? The answer determines whether constitutions are policy-tools or merely PR-tools.
CAI under fine-tuning attack. Constitutional Classifiers can be bypassed by adversarial fine-tuning when the attacker has fine-tuning access (Trojan-Speak class of attacks). The robustness of constitution-trained policies under post-training adversarial fine-tuning is a related and largely unstudied question.

Sources

[1] Y. Bai et al., “Constitutional AI: Harmlessness from AI Feedback”, arXiv preprint, 2022. [Online]. Available: https://arxiv.org/abs/2212.08073.

[2] Y. Bai et al., “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback”, arXiv preprint, 2022. [Online]. Available: https://arxiv.org/abs/2204.05862.

[3] H. Lee et al., “RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback”, arXiv preprint, 2023. [Online]. Available: https://arxiv.org/abs/2309.00267.

[4] L. Ouyang et al., “Training Language Models to Follow Instructions with Human Feedback”, in NeurIPS, 2022. [Online]. Available: https://arxiv.org/abs/2203.02155.

[5] P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg and D. Amodei, “Deep Reinforcement Learning from Human Preferences”, in NeurIPS, 2017. [Online]. Available: https://arxiv.org/abs/1706.03741.

[6] S. Kundu et al., “Specific versus General Principles for Constitutional AI”, arXiv preprint, 2023. [Online]. Available: https://arxiv.org/abs/2310.13798.

[7] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning and C. Finn, “Direct Preference Optimization: Your Language Model is Secretly a Reward Model”, in NeurIPS, 2023. [Online]. Available: https://arxiv.org/abs/2305.18290.

[8] S. H. Huang et al., “Collective Constitutional AI: Aligning a Language Model with Public Input”, arXiv preprint, 2024. [Online]. Available: https://arxiv.org/abs/2406.07814.

[9] S. Casper et al., “Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback”, arXiv preprint, 2023. [Online]. Available: https://arxiv.org/abs/2307.15217.

[10] M. Sharma et al., “Towards Understanding Sycophancy in Language Models”, arXiv preprint, 2023. [Online]. Available: https://arxiv.org/abs/2310.13548.

[11] Z. Sun et al., “SALMON: Self-Alignment with Instructable Reward Models”, arXiv preprint, 2023. [Online]. Available: https://arxiv.org/abs/2310.05910.

[12] Anthropic, “Claude’s Constitution”. [Online]. Available: https://www.anthropic.com/news/claudes-constitution.

[13] M. Sharma et al., “Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming”, arXiv preprint, 2025. [Online]. Available: https://arxiv.org/abs/2501.18837.

[14] S. Xu et al., “Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study”, in ICML, 2024. [Online]. Available: https://arxiv.org/abs/2404.10719.

[15] N. Lambert et al., “Tulu 3: Pushing Frontiers in Open Language Model Post-Training”, arXiv preprint, 2024. [Online]. Available: https://arxiv.org/abs/2411.15124.

[16] W. Yuan, R. Y. Pang, K. Cho, S. Sukhbaatar, J. Xu and J. Weston, “Self-Rewarding Language Models”, in ICML, 2024. [Online]. Available: https://arxiv.org/abs/2401.10020.

[17] Anthropic, “System Card: Claude Opus 4.5”. [Online]. Available: https://www.anthropic.com/claude-opus-4-5-system-card.

[19] Hugging Face, “Constitutional AI with Open LLMs”. [Online]. Available: https://huggingface.co/blog/constitutional_ai.

[21] W. Yuan, R. Y. Pang, K. Cho, S. Sukhbaatar, J. Xu and J. Weston, “Self-Rewarding Language Models”, in ICML, 2024. [Online]. Available: https://arxiv.org/abs/2401.10020.

[22] W. Yuan, R. Y. Pang, K. Cho, S. Sukhbaatar, J. Xu and J. Weston, “Self-Rewarding Language Models”, in ICML, 2024. [Online]. Available: https://arxiv.org/abs/2401.10020.

[23] Anthropic, “Claude’s Constitution”. [Online]. Available: https://www.anthropic.com/news/claudes-constitution.

[24] Anthropic, “Core Views on AI Safety: When, Why, What, and How”. [Online]. Available: https://www.anthropic.com/news/core-views-on-ai-safety.

[25] Anthropic, “Constitutional AI: Harmlessness from AI Feedback (Anthropic research page)”. [Online]. Available: https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback.

[26] Anthropic, “Collective Constitutional AI: Aligning a Language Model with Public Input”. [Online]. Available: https://www.anthropic.com/news/collective-constitutional-ai-aligning-a-language-model-with-public-input.

[27] Anthropic, “Constitutional Classifiers: Defending against universal jailbreaks”. [Online]. Available: https://www.anthropic.com/research/constitutional-classifiers.

[29] Meta AI, “Meta Llama 3 Model Card”. [Online]. Available: https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.

[30] Hugging Face, “Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations”. [Online]. Available: https://huggingface.co/meta-llama/LlamaGuard-7b.

[31] L. Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena”, in NeurIPS, 2023. [Online]. Available: https://arxiv.org/abs/2306.05685.

[32] A. Zou, Z. Wang, J. Z. Kolter and M. Fredrikson, “Universal and Transferable Adversarial Attacks on Aligned Language Models”, arXiv preprint, 2023. [Online]. Available: https://arxiv.org/abs/2307.15043.

[33] A. Souly et al., “A StrongREJECT for Empty Jailbreaks”, arXiv preprint, 2024. [Online]. Available: https://arxiv.org/abs/2402.10260.

[34] P. Chao et al., “JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models”, arXiv preprint, 2024. [Online]. Available: https://arxiv.org/abs/2404.01318.

[35] P. Röttger, H. R. Kirk, B. Vidgen, G. Attanasio, F. Bianchi and D. Hovy, “XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models”, arXiv preprint, 2023. [Online]. Available: https://arxiv.org/abs/2308.01263.

[36] Constitutional AI: Harmlessness from AI Feedback (Paper Explained). [Online]. Available: https://www.youtube.com/watch?v=b0s—4MIe3M.

[37] NeurIPS, “Scaling Alignment: From Constitutional AI to Collective Constitutional AI (NeurIPS 2023)”, [Online]. Available: https://neurips.cc/virtual/2023/workshop/66498.

[38] Christian, Brian, “The Alignment Problem: Machine Learning and Human Values”. [Online]. Available: https://brianchristian.org/the-alignment-problem/.

[39] DeepLearning.AI, “Reinforcement Learning from Human Feedback (Short Course, DeepLearning.AI)”. [Online]. Available: https://www.deeplearning.ai/short-courses/reinforcement-learning-from-human-feedback/.

[40] “Dario Amodei — Machines of Loving Grace (Dwarkesh Podcast)”. [Online]. Available: https://www.dwarkeshpatel.com/p/dario-amodei.

[41] “RLAIF and the Future of AI Feedback (Interconnects)”. [Online]. Available: https://www.interconnects.ai/p/rlaif.

[18] Anthropic, “Model card addendum: Claude 3.5 Haiku and upgraded Claude 3.5 Sonnet,” Anthropic, Oct. 2024. [Online]. Available: https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October-Addendum.pdf

[20] A. Ahmadian, C. Cremer, M. Galle, M. Fadaee, J. Kreutzer, O. Pietquin, A. Ustun, and S. Hooker, “Back to basics: Revisiting REINFORCE style optimization for learning from human feedback in LLMs,” arXiv:2402.14740, Feb. 2024. [Online]. Available: https://arxiv.org/abs/2402.14740

[28] Maritaca AI, “Aira: A Portuguese-language Constitutional AI model,” Maritaca AI Blog, 2023. [Online]. Available: https://maritaca.ai/aira

Daniel ML Evans

Explorer

Constitutional AI Training (RLAIF)

Constitutional AI Training (RLAIF)

Definition

Mechanism

Stage 1 — Supervised Learning from AI Feedback (SL-CAI)

Stage 2 — Reinforcement Learning from AI Feedback (RL-CAI)

Pseudocode

Key components

Comparison to RLHF

Modern variant: DPO in place of PPO

What the model is actually learning

Historical lineage

State of the art

Frontier-model deployments

Algorithmic refinements 2024-2026

Quantitative benchmark progress

What is not yet settled

Key papers

Bai et al. 2022 — Constitutional AI: Harmlessness from AI Feedback

Bai et al. 2022 — Training a Helpful and Harmless Assistant with RLHF

Lee et al. 2023 — RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Ouyang et al. 2022 — Training Language Models to Follow Instructions with Human Feedback (InstructGPT)

Christiano et al. 2017 — Deep Reinforcement Learning from Human Preferences

Kundu et al. 2023 — Specific versus General Principles for Constitutional AI

Rafailov et al. 2023 — Direct Preference Optimization

Huang et al. 2024 — Collective Constitutional AI: Aligning a Language Model with Public Input

Casper et al. 2023 — Open Problems and Fundamental Limitations of RLHF

Sharma et al. 2023 — Towards Understanding Sycophancy in Language Models

Sun et al. 2023 — SALMON: Self-Alignment with Instructable Reward Models

Key people and organisations

Anthropic

Google DeepMind / Google Research

OpenAI

Collective Intelligence Project

Stanford University

MIT (Massachusetts Institute of Technology)

MIT-IBM Watson AI Lab / Carnegie Mellon University (SALMON)

HuggingFace

Practical applications

1. Claude (Anthropic) — every model, every generation

2. Open-source replications

3. Industry safety filtering — deployment-side classifiers

4. Domain-specific specialisations

5. Research applications — constitution as experimental variable

Claude / Anthropic relevance

1. The constitution Claude is trained against

2. SL-CAI + RL-CAI as Claude’s harmlessness pipeline

3. Evolution across Claude versions

4. CAI as Anthropic’s scalable-oversight bet

5. Collective Constitutional AI — the public-input experiment

6. Constitutional Classifiers (Feb 2025) — CAI as a deployment-side safeguard

7. Honest framing — what CAI does and does not solve

Limitations and open problems

Mechanistic limitations

Empirical limitations

Theoretical and philosophical limitations

Open problems

Evaluation and benchmarks

Overview of evaluation categories

Helpfulness benchmarks

Harmlessness benchmarks

Jailbreak robustness benchmarks

Honesty / sycophancy benchmarks

Constitution-specific evaluations

Frontier-model production evaluations

Evaluation methodology notes

Adjacent topics

Learning resources

Foundational papers

Anthropic primary sources

Implementation

Videos

Books

Courses

Podcasts

Personal synthesis

Open questions

Sources