Back to DocsRecipe

Text Generation Inference

A production-grade recipe for serving open-weight LLMs with sub-100ms time-to-first-token using Hugging Face TGI. Covers quantization strategies, continuous batching, and token streaming patterns that keep your GPU saturated without blowing your latency budget.

4-bit

GPTQ & AWQ quant targets

8k

Max context window baseline

SSE

Server-Sent Events streaming

Quick Start

$ docker run --gpus all \

-p 8080:80 \

-v $PWD/models:/data \

ghcr.io/huggingface/text-generation-inference:latest \

--model-id mistralai/Mistral-7B-Instruct-v0.3

Architecture Notes

  • Continuous batching via PagedAttention — no more padding to max_seq_len.
  • FlashAttention-2 fused kernels for 2-3× throughput on Ampere+.
  • Token streaming over SSE with per-token JSON payloads — clients render incrementally.
  • Watermarking logits processor for provenance tagging.

Next step

Deploy with our vLLM OpenAI-compatible endpoint recipe for a drop-in replacement.