Recipe

Recipe: Build a fine-tuning dataset

A step-by-step guide to curating high-quality training data for domain-specific model fine-tuning with Meridian.

1. Define your schema

Start with a JSONL schema that captures instruction-input-output triples. Keep fields consistent across every record.

{"instruction": "...", "input": "...", "output": "..."}

Pull from support tickets, internal docs, or chat logs. Aim for 500–2000 diverse examples covering edge cases.

Strip PII, normalize whitespace, and run fuzzy dedup at threshold 0.85. Remove examples shorter than 20 tokens.

Score each example on clarity, correctness, and completeness. Discard anything below a 3/5 aggregate.

Use the CLI or dashboard to upload your JSONL file. Meridian auto-splits into train/val and starts the fine-tune job.

Need help? Browse more recipes or reach out on Discord.