Recipe

QLoRA Primer

Fine-tune 70B models on a single GPU without going broke.

What is QLoRA?

QLoRA combines 4-bit NormalFloat quantization with Low-Rank Adapters. The base model weights are frozen and quantized to 4 bits, while tiny trainable adapter matrices are injected into attention and feed-forward layers. You get full 16-bit fine-tuning quality at a fraction of the memory cost.

Memory Breakdown

Model	Full FT	QLoRA
Llama-2 7B	56 GB	6 GB
Llama-2 13B	104 GB	10 GB
Llama-2 70B	560 GB	48 GB

Key Techniques

NF4 quantization — information-theoretically optimal for normally-distributed weights. Outperforms INT4 by preserving tail density.
Double quantization — quantizes the quantization constants themselves, saving ~0.4 bits per parameter.
Paged optimizers — offloads optimizer states to CPU RAM via unified memory paging, avoiding OOM spikes during checkpointing.

Gotchas

NF4 works best when weights are normally distributed. If your base model uses non-standard initialization, fall back to FP4. Always set bf16=True for compute dtype — FP32 accumulation with NF4 storage is the sweet spot. Gradient checkpointing is non-negotiable above 13B.

Implementation →All Recipes