← Back to Docs
Recipe

Model Quantization

Shrink model size and inference latency without retraining — a practical guide to post-training quantization.

Quantization maps 32-bit floating-point weights to lower-precision integers (INT8, INT4) while preserving output quality. The result: models that run on edge devices, fit in browser memory, and serve at lower cost.

Why quantize?

  • 4× memory reduction with INT8, 8× with INT4
  • 2–4× inference speedup on CPU and edge silicon
  • Lower cloud GPU costs per token served

Quantization types

Dynamic

Activations quantized on-the-fly. Weights pre-quantized. Easiest path — zero calibration data needed.

Static

Both weights and activations quantized ahead of time. Requires a calibration dataset but yields best latency.

Recipe steps

  1. Profile your model. Identify outlier layers — these degrade most under quantization.
  2. Choose a scheme. Per-tensor for speed, per-channel for accuracy on CNNs.
  3. Calibrate (static only). Run 100–500 representative samples to compute activation ranges.
  4. Quantize and validate. Compare perplexity or accuracy against the FP32 baseline.
  5. Deploy. Use an INT8-capable runtime like ONNX Runtime or llama.cpp.

Pro tip: Mixed-precision quantization keeps attention layers in FP16 while quantizing FFN blocks — often the sweet spot for LLMs.