GGUF/GGML Format Primer
GGUF (GPT-Generated Unified Format) is the successor to GGML, designed for fast loading, memory-mapped inference, and broad quantization support. Meridian uses GGUF under the hood when you call our local-LLM endpoints. This primer explains the layout, the quant tiers, and how to convert your own weights.
1. File layout
A GGUF file is a single binary with a magic header, a key-value metadata table, tensor descriptors, and a tightly-packed tensor data region. Everything is little-endian and aligned so the runtime can mmap() the file directly.
[ magic: "GGUF" ] [ version: u32 ] [ tensor_count: u64 ] [ metadata_kv_count: u64 ] [ metadata_kv[]: key/type/value ] [ tensor_info[]: name/dims/type/offset ] [ alignment padding ] [ tensor_data: raw bytes ]
2. Quantization tiers
GGUF supports Q2_K through Q8_0, plus full F16 and F32. The _K variants use k-quants with per-block scales and minima for better accuracy at low bitrates. For most chat workloads, Q4_K_M is the sweet spot: ~4.5 bits per weight, <1% perplexity loss vs F16, and roughly 4x the throughput of the unquantized model.
3. Converting your weights
Use llama.cpp/convert_hf_to_gguf.py to turn a HuggingFace checkpoint into F16 GGUF, then llama-quantize to compress to your target tier. Upload the resulting .gguf to Meridian and reference it by model_id in your API call.