Memory hierarchy primer
Modern inference workloads live or die by how well they exploit the memory hierarchy. This recipe walks through the layers a Meridian model sees on a typical Azure GPU node and shows where you can squeeze latency out without rewriting your stack.
1. The five layers that matter
Registers, L1/L2 SRAM, HBM, host DRAM, and NVMe form a roughly five-tier pyramid. Each step up the pyramid costs 5x to 50x more cycles. The Meridian router places hot KV cache in HBM and cold prefix cache on NVMe so the GPU never blocks on a PCIe round trip during a hot path.
- Registers: ~1 cycle, ~256 KB per SM
- L2 SRAM: ~30 cycles, ~50 MB on H100
- HBM3: ~400 cycles, 80 GB at 3 TB/s
- Host DRAM: ~10k cycles via PCIe Gen5
- NVMe spill: ~100k cycles, effectively unbounded
2. Where the bottleneck actually lives
Most teams blame the model when the real culprit is a mis-sized KV cache pushing tokens out of HBM mid-decode. The Meridian gateway exposes a per-request x-meridian-hbm-pressure header so you can watch eviction live without flame graphs.
curl https://gateway.meridian.dev/v1/chat/completions \
-H "Authorization: Bearer $MERIDIAN_KEY" \
-H "x-meridian-trace: hbm" \
-d '{"model":"azure/model-router","messages":[...]}'
# response header
# x-meridian-hbm-pressure: 0.71
# x-meridian-cache-tier: hbm
# x-meridian-evictions: 03. Practical tuning checklist
Start cheap, climb the pyramid only when the data forces you to. The first three items below cover 80% of the latency wins we see across Meridian customers in production.
- Pin prompt prefixes longer than 2k tokens to the prefix cache.
- Cap batch fan-out so working set fits in L2 between attention heads.
- Route long-context jobs through the HBM-resident tier explicitly.
- Spill cold conversations to NVMe after 15 minutes of inactivity.
- Watch the eviction header and alert at pressure greater than 0.85.