GGUF/GGML Format Primer

GGUF (GPT-Generated Unified Format) is the successor to GGML, designed for fast loading, memory-mapped inference, and broad quantization support. Meridian uses GGUF under the hood when you call our local-LLM endpoints. This primer explains the layout, the quant tiers, and how to convert your own weights.

1. File layout

A GGUF file is a single binary with a magic header, a key-value metadata table, tensor descriptors, and a tightly-packed tensor data region. Everything is little-endian and aligned so the runtime can mmap() the file directly.

[ magic: "GGUF" ]
[ version: u32 ]
[ tensor_count: u64 ]
[ metadata_kv_count: u64 ]
[ metadata_kv[]: key/type/value ]
[ tensor_info[]: name/dims/type/offset ]
[ alignment padding ]
[ tensor_data: raw bytes ]

2. Quantization tiers

GGUF supports Q2_K through Q8_0, plus full F16 and F32. The _K variants use k-quants with per-block scales and minima for better accuracy at low bitrates. For most chat workloads, Q4_K_M is the sweet spot: ~4.5 bits per weight, <1% perplexity loss vs F16, and roughly 4x the throughput of the unquantized model.

3. Converting your weights

Use llama.cpp/convert_hf_to_gguf.py to turn a HuggingFace checkpoint into F16 GGUF, then llama-quantize to compress to your target tier. Upload the resulting .gguf to Meridian and reference it by model_id in your API call.

← Back to all docs