Ingestion

Document chunking

How Meridian splits source material into retrievable units — the single most impactful lever on answer quality.

Why it matters

A chunk that is too large dilutes retrieval precision. Too small, and context collapses. Meridian targets semantic self-containment — each chunk should answer exactly one atomic question without requiring its neighbors.

Fixed-size windows

The baseline strategy splits text every 512 tokens with a 64-token overlap. Fast, predictable, and surprisingly effective for prose. Meridian uses this as the fallback when no structural markers are detected.

Structure-aware splitting

When documents carry headings, code fences, or list boundaries, Meridian respects those edges. A Markdown file is never split mid-heading or inside a fenced block — chunk boundaries align to the nearest section break.

Semantic re-chunking

For high-value knowledge bases, Meridian can re-chunk using embedding similarity. Adjacent sentences that drift in vector space trigger a split. This produces chunks that are topically coherent even when the source has no explicit structure.

Metadata anchoring

Every chunk carries its source path, heading breadcrumb, and position index. At retrieval time, Meridian can expand a hit into its surrounding context window — giving the LLM the full paragraph or section without polluting the vector index.

← Docs index Embedding models →