Recipe
Model Quantization
Shrink model size and inference latency without retraining — a practical guide to post-training quantization.
Quantization maps 32-bit floating-point weights to lower-precision integers (INT8, INT4) while preserving output quality. The result: models that run on edge devices, fit in browser memory, and serve at lower cost.
Why quantize?
- 4× memory reduction with INT8, 8× with INT4
- 2–4× inference speedup on CPU and edge silicon
- Lower cloud GPU costs per token served
Quantization types
Dynamic
Activations quantized on-the-fly. Weights pre-quantized. Easiest path — zero calibration data needed.
Static
Both weights and activations quantized ahead of time. Requires a calibration dataset but yields best latency.
Recipe steps
- Profile your model. Identify outlier layers — these degrade most under quantization.
- Choose a scheme. Per-tensor for speed, per-channel for accuracy on CNNs.
- Calibrate (static only). Run 100–500 representative samples to compute activation ranges.
- Quantize and validate. Compare perplexity or accuracy against the FP32 baseline.
- Deploy. Use an INT8-capable runtime like ONNX Runtime or llama.cpp.
Pro tip: Mixed-precision quantization keeps attention layers in FP16 while quantizing FFN blocks — often the sweet spot for LLMs.