Recipe

Translation quality evaluator

Build a side-by-side comparison tool that scores translation fidelity using semantic similarity and human-readable metrics.

Overview

This recipe walks through constructing an evaluation pipeline that ingests source text, a candidate translation, and a reference translation, then emits a composite quality score. The output includes lexical overlap, embedding distance, and fluency heuristics — all surfaced in a clean dashboard panel.

Ingredients

Source text corpus (plaintext or JSONL)
Candidate translations from your model
Reference translations (human or gold-standard)
Sentence-transformers embedding model
Scoring module: BLEU, chrF, cosine similarity
Results table with sortable columns

Steps

Load data. Parse source, candidate, and reference files into aligned records keyed by segment ID.
Embed. Encode all three text columns with a multilingual sentence-transformer. Store vectors in-memory.
Score. Compute BLEU and chrF against the reference. Derive cosine similarity between candidate and reference embeddings.
Aggregate. Normalize scores to 0–100. Weight lexical and semantic components equally for the composite metric.
Render. Display a sortable table with per-segment scores and a summary bar chart of the distribution.

Expected output

A dashboard panel showing mean composite score, a histogram of score buckets, and a searchable segment table. Each row highlights low-scoring translations in pink for quick triage.

Need the full implementation with embedding calls and scoring math? Browse the recipes index for the complete notebook.