Recipe
Constitutional AI
Train language models to self-critique and revise their outputs using a fixed set of principles — no human labels required for harmlessness.
What It Is
Constitutional AI (CAI) replaces human preference labels with a written constitution — a short list of behavioral rules. The model generates a response, then critiques and revises it against each rule. The revised output becomes the supervised target for fine-tuning.
The Two-Phase Loop
- Supervised phase: sample a harmful prompt, generate a response, ask the model to rewrite it according to the constitution. Train on the revised output.
- RL phase: use the fine-tuned model to generate pairs, then ask it which response better follows the constitution. Train a preference model from those comparisons.
Example Constitution
- • Choose the response that is least harmful.
- • Do not encourage illegal or unethical behavior.
- • Prefer responses that are honest and calibrated.
- • Avoid toxic, racist, or sexist language.
Why It Matters
CAI dramatically reduces the cost and latency of alignment. It removes the bottleneck of human labelers, scales with compute, and produces models that are both helpful and harmless without sacrificing capability.