Recipe: Output safety classifier pipeline
A production-ready pipeline that classifies LLM outputs for safety violations before they reach end users. Built on Meridian's inference infrastructure with sub-100ms latency.
Architecture
User prompt + LLM response
│
▼
┌─────────────────────────┐
│ Safety classifier │
│ (fine-tuned DeBERTa) │
└───────────┬─────────────┘
│
┌──────┴──────┐
▼ ▼
[safe] [unsafe]
│ │
▼ ▼
Deliver Rewrite / block
to user + log to SIEMClassification categories
- Hate & harassment — slurs, threats, targeted attacks
- Sexual content involving minors — zero-tolerance, immediate block
- Self-harm — methods, encouragement, graphic depiction
- Violence & gore — extreme graphic descriptions
Deployment
Deploy the classifier as a sidecar container alongside your inference server. Meridian handles model caching, batching, and GPU scheduling automatically. Configure thresholds per category via environment variables.
MERIDIAN_SAFETY_THRESHOLD_HATE=0.85
MERIDIAN_SAFETY_THRESHOLD_CSAM=0.99
MERIDIAN_SAFETY_THRESHOLD_SELFHARM=0.90