Back to DocsRecipe

Speculative Decoding

A latency-hiding technique that uses a fast draft model to propose multiple tokens, then lets the target model verify them in parallel. When it works, you get 2–3× throughput with zero quality loss.

How it works

1. A small draft model (e.g. 150M params) autoregressively generates K candidate tokens.
2. The target model runs one forward pass over the prompt + candidates, producing logits for all positions simultaneously.
3. A rejection sampler compares draft logits against target logits. Accepted tokens are kept; the first mismatch triggers a resample from the target distribution.
4. The cycle repeats. The target model never runs autoregressively — it only verifies.

Why it matters

Exact output. The rejection sampler guarantees the final distribution matches the target model — no approximation.
Memory-bound gains. Decoding large models is memory-bandwidth–bound. Verifying K tokens in one pass amortizes the weight-loading cost.
No retraining. Works with any off-the-shelf draft model. Common choices: a distilled variant, a shallower version of the target, or an n-gram model.

Key parameters

Number of draft tokens per cycle. Typical range: 3–8. Larger K increases peak throughput but lowers acceptance rate.

Acceptance rate

Fraction of draft tokens kept. 70–90% is healthy. Drops when draft and target distributions diverge (code, math, rare tokens).

For a production implementation with tree attention and multi-sequence verification, see the advanced recipe.