Recipe

Knowledge graph design

A practical pattern for modeling entities, relations, and provenance so your retrieval layer can answer multi-hop questions without hallucinating the connective tissue.

A knowledge graph is more than a database with edges. The design choices around node granularity, edge typing, and source attribution decide whether your agents will be able to reason cleanly or get stuck chasing ambiguous references. This recipe walks through the three decisions that matter most when you bootstrap a graph against an unstructured corpus.

1. Pick the right node granularity

Start with the smallest unit that has a stable identity in your domain. For product catalogs, that is a SKU; for biomedical text, a gene or compound. Resist the urge to model every noun phrase as a node, that path leads to a graph dominated by mention-level noise instead of entity-level signal.

2. Type your edges aggressively

Avoid a single generic RELATED_TO edge. Spend the time defining a closed vocabulary of relation types up front. Strongly typed edges let your downstream queries express business logic declaratively instead of post-filtering.

3. Carry provenance on every edge

Every edge should remember which document, paragraph, and extraction run it came from. Provenance is what lets you retract extractions when a model improves, attribute answers to sources, and debug retrieval failures without re-running the whole pipeline.

Example node schema

{
  "id": "ent:compound:caffeine",
  "type": "Compound",
  "canonical_name": "Caffeine",
  "aliases": ["1,3,7-trimethylxanthine"],
  "provenance": {
    "first_seen": "doc:42#p7",
    "extraction_run": "run_2026_06_27_a",
    "confidence": 0.94
  }
}

Next up: once your schema is stable, wire an incremental extractor so new documents extend the graph instead of rebuilding it from scratch.