Back to docsRecipe

vLLM primer

vLLM is the open-source inference engine that makes serving large language models fast and memory-efficient. This recipe walks through the core concepts, a minimal deployment, and the knobs that matter most for throughput.

Why vLLM?

Traditional transformers generate one token per forward pass, leaving GPU compute idle. vLLM uses PagedAttention to manage KV cache in non-contiguous blocks, slashing memory waste and enabling continuous batching. The result: 10–20× higher throughput versus Hugging Face Transformers out of the box.

Minimal deployment

Spin up an OpenAI-compatible API server with a single command:

python -m vllm.entrypoints.openai.api_server
  --model mistralai/Mistral-7B-Instruct-v0.3
  --tensor-parallel-size 1
  --max-model-len 8192

That exposes /v1/chat/completions on port 8000. Point any OpenAI client at it and you are done.

Key tuning knobs

  • --gpu-memory-utilization (default 0.90) — how much VRAM vLLM reserves for KV cache. Lower it if you need headroom for other processes.
  • --max-num-seqs — cap on concurrent sequences. Higher values improve batching but increase latency under load.
  • --enforce-eager — disables CUDA graphs. Useful for debugging or when graph capture fails on exotic models.

Production notes

Run behind a reverse proxy with streaming support. Monitor queue depth via the built-in Prometheus metrics on port 8000. For multi-GPU setups, set --tensor-parallel-size to the number of GPUs and ensure NCCL is configured. Prefix caching (enabled by default) reuses KV blocks across requests that share a common system prompt — a free latency win.

Next step: vLLM throughput tuning — dial in prefix caching, chunked prefill, and speculative decoding.