Recipe: Fine-tune dataset builder

Build a clean, deduplicated JSONL dataset from raw conversation logs for LoRA fine-tuning.

1. Ingest raw logs

Point the pipeline at your raw conversation directory. Accepted formats: JSON, CSV, Parquet. Each record must contain messages as an array of role/content pairs.

2. Filter & deduplicate

Strip empty turns, system prompts under 20 chars, and conversations shorter than 2 exchanges. Apply MinHash LSH with threshold 0.85 to remove near-duplicates.

3. Format as JSONL

Each line is a valid JSON object with messages. Output schema matches OpenAI fine-tuning format. No extra fields.

4. Validate & export

Run token-count histogram, role-sequence sanity checks, and UTF-8 validation. Export splits: 80% train, 20% validation. Ready for axolotl or unsloth.

Pro tip: Run a dry pass on 1k samples before full ingestion. Check for PII leakage with the built-in regex scanner.