← Back to Docs
Recipe

QLoRA Primer

Fine-tune 70B models on a single GPU without going broke.

What is QLoRA?

QLoRA combines 4-bit NormalFloat quantization with Low-Rank Adapters. The base model weights are frozen and quantized to 4 bits, while tiny trainable adapter matrices are injected into attention and feed-forward layers. You get full 16-bit fine-tuning quality at a fraction of the memory cost.

Memory Breakdown

ModelFull FTQLoRA
Llama-2 7B56 GB6 GB
Llama-2 13B104 GB10 GB
Llama-2 70B560 GB48 GB

Key Techniques

  • NF4 quantization — information-theoretically optimal for normally-distributed weights. Outperforms INT4 by preserving tail density.
  • Double quantization — quantizes the quantization constants themselves, saving ~0.4 bits per parameter.
  • Paged optimizers — offloads optimizer states to CPU RAM via unified memory paging, avoiding OOM spikes during checkpointing.

Gotchas

NF4 works best when weights are normally distributed. If your base model uses non-standard initialization, fall back to FP4. Always set bf16=True for compute dtype — FP32 accumulation with NF4 storage is the sweet spot. Gradient checkpointing is non-negotiable above 13B.