Recipe
Translation quality evaluator
Build a side-by-side comparison tool that scores translation fidelity using semantic similarity and human-readable metrics.
Overview
This recipe walks through constructing an evaluation pipeline that ingests source text, a candidate translation, and a reference translation, then emits a composite quality score. The output includes lexical overlap, embedding distance, and fluency heuristics — all surfaced in a clean dashboard panel.
Ingredients
- Source text corpus (plaintext or JSONL)
- Candidate translations from your model
- Reference translations (human or gold-standard)
- Sentence-transformers embedding model
- Scoring module: BLEU, chrF, cosine similarity
- Results table with sortable columns
Steps
- Load data. Parse source, candidate, and reference files into aligned records keyed by segment ID.
- Embed. Encode all three text columns with a multilingual sentence-transformer. Store vectors in-memory.
- Score. Compute BLEU and chrF against the reference. Derive cosine similarity between candidate and reference embeddings.
- Aggregate. Normalize scores to 0–100. Weight lexical and semantic components equally for the composite metric.
- Render. Display a sortable table with per-segment scores and a summary bar chart of the distribution.
Expected output
A dashboard panel showing mean composite score, a histogram of score buckets, and a searchable segment table. Each row highlights low-scoring translations in pink for quick triage.
Need the full implementation with embedding calls and scoring math? Browse the recipes index for the complete notebook.