Speculative Decoding
A latency-hiding technique that uses a fast draft model to propose multiple tokens, then lets the target model verify them in parallel. When it works, you get 2–3× throughput with zero quality loss.
How it works
- 1. A small draft model (e.g. 150M params) autoregressively generates K candidate tokens.
- 2. The target model runs one forward pass over the prompt + candidates, producing logits for all positions simultaneously.
- 3. A rejection sampler compares draft logits against target logits. Accepted tokens are kept; the first mismatch triggers a resample from the target distribution.
- 4. The cycle repeats. The target model never runs autoregressively — it only verifies.
Why it matters
- Exact output. The rejection sampler guarantees the final distribution matches the target model — no approximation.
- Memory-bound gains. Decoding large models is memory-bandwidth–bound. Verifying K tokens in one pass amortizes the weight-loading cost.
- No retraining. Works with any off-the-shelf draft model. Common choices: a distilled variant, a shallower version of the target, or an n-gram model.
Key parameters
K
Number of draft tokens per cycle. Typical range: 3–8. Larger K increases peak throughput but lowers acceptance rate.
Acceptance rate
Fraction of draft tokens kept. 70–90% is healthy. Drops when draft and target distributions diverge (code, math, rare tokens).
For a production implementation with tree attention and multi-sequence verification, see the advanced recipe.