Recipe: Build a fine-tuning dataset
A step-by-step guide to curating high-quality training data for domain-specific model fine-tuning with Meridian.
1. Define your schema
Start with a JSONL schema that captures instruction-input-output triples. Keep fields consistent across every record.
{"instruction": "...", "input": "...", "output": "..."}2. Source raw examples
Pull from support tickets, internal docs, or chat logs. Aim for 500–2000 diverse examples covering edge cases.
3. Clean and deduplicate
Strip PII, normalize whitespace, and run fuzzy dedup at threshold 0.85. Remove examples shorter than 20 tokens.
4. Validate with a rubric
Score each example on clarity, correctness, and completeness. Discard anything below a 3/5 aggregate.
5. Upload to Meridian
Use the CLI or dashboard to upload your JSONL file. Meridian auto-splits into train/val and starts the fine-tune job.
Need help? Browse more recipes or reach out on Discord.