Recipe
QLoRA Primer
Fine-tune 70B models on a single GPU without going broke.
What is QLoRA?
QLoRA combines 4-bit NormalFloat quantization with Low-Rank Adapters. The base model weights are frozen and quantized to 4 bits, while tiny trainable adapter matrices are injected into attention and feed-forward layers. You get full 16-bit fine-tuning quality at a fraction of the memory cost.
Memory Breakdown
| Model | Full FT | QLoRA |
|---|---|---|
| Llama-2 7B | 56 GB | 6 GB |
| Llama-2 13B | 104 GB | 10 GB |
| Llama-2 70B | 560 GB | 48 GB |
Key Techniques
- NF4 quantization — information-theoretically optimal for normally-distributed weights. Outperforms INT4 by preserving tail density.
- Double quantization — quantizes the quantization constants themselves, saving ~0.4 bits per parameter.
- Paged optimizers — offloads optimizer states to CPU RAM via unified memory paging, avoiding OOM spikes during checkpointing.
Gotchas
NF4 works best when weights are normally distributed. If your base model uses non-standard initialization, fall back to FP4. Always set bf16=True for compute dtype — FP32 accumulation with NF4 storage is the sweet spot. Gradient checkpointing is non-negotiable above 13B.