Recipe

Document chunking strategy

Chunking is the single biggest lever on retrieval quality. Too small and you lose context; too large and your embeddings drift toward an averaged mush. This recipe walks through the tradeoffs Meridian uses in production, and shows the exact SDK calls to pick a strategy that matches the shape of your corpus.

1. Start with the document shape

Before tuning chunk size, look at how the source material is structured. Markdown with clean headings rewards semantic splitting. Transcripts and chat logs do better with fixed windows plus generous overlap. PDFs with tables almost always need a custom extractor before chunking ever runs.

A 90-second eyeball of ten random documents will tell you more than an hour of parameter sweeps.

2. Pick fixed vs semantic

Fixed-size chunking is cheap, deterministic, and easy to debug. It is the right default when documents are unstructured or when you need uniform embedding cost per chunk. Semantic chunking respects paragraph and heading boundaries, which keeps related ideas glued together and dramatically improves answer faithfulness on structured corpora.

// chunking.ts — naive vs semantic
import { chunk } from '@meridian/sdk';

const doc = await fetch('/policy.md').then(r => r.text());

// Strategy A: fixed-size sliding window
const naive = chunk(doc, {
  strategy: 'fixed',
  size: 512,
  overlap: 64,
});

// Strategy B: semantic, paragraph-aware
const semantic = chunk(doc, {
  strategy: 'semantic',
  maxTokens: 800,
  minTokens: 200,
  splitOn: ['heading', 'paragraph'],
});

console.log(naive.length, semantic.length);

3. Measure, then iterate

Wire an eval set of fifty real user questions before you ship. Track recall@5 and answer faithfulness as you sweep chunk size from 256 to 1024 tokens. The curve usually peaks somewhere unexpected, and the only way to find it is to plot it.

Re-run the sweep whenever your corpus shape changes. Chunking is not a set-and-forget decision.

Back to all recipes