Recipe: Fine-tune dataset builder
Build a clean, deduplicated JSONL dataset from raw conversation logs for LoRA fine-tuning.
1. Ingest raw logs
Point the pipeline at your raw conversation directory. Accepted formats: JSON, CSV, Parquet. Each record must contain messages as an array of role/content pairs.
2. Filter & deduplicate
Strip empty turns, system prompts under 20 chars, and conversations shorter than 2 exchanges. Apply MinHash LSH with threshold 0.85 to remove near-duplicates.
3. Format as JSONL
Each line is a valid JSON object with messages. Output schema matches OpenAI fine-tuning format. No extra fields.
4. Validate & export
Run token-count histogram, role-sequence sanity checks, and UTF-8 validation. Export splits: 80% train, 20% validation. Ready for axolotl or unsloth.
Pro tip: Run a dry pass on 1k samples before full ingestion. Check for PII leakage with the built-in regex scanner.