Back to docs

Recipe: Output safety classifier pipeline

A production-ready pipeline that classifies LLM outputs for safety violations before they reach end users. Built on Meridian's inference infrastructure with sub-100ms latency.

Architecture

User prompt + LLM response
        │
        ▼
┌─────────────────────────┐
│  Safety classifier      │
│  (fine-tuned DeBERTa)   │
└───────────┬─────────────┘
            │
     ┌──────┴──────┐
     ▼              ▼
  [safe]        [unsafe]
     │              │
     ▼              ▼
  Deliver       Rewrite / block
  to user       + log to SIEM

Classification categories

  • Hate & harassment — slurs, threats, targeted attacks
  • Sexual content involving minors — zero-tolerance, immediate block
  • Self-harm — methods, encouragement, graphic depiction
  • Violence & gore — extreme graphic descriptions

Deployment

Deploy the classifier as a sidecar container alongside your inference server. Meridian handles model caching, batching, and GPU scheduling automatically. Configure thresholds per category via environment variables.

MERIDIAN_SAFETY_THRESHOLD_HATE=0.85 MERIDIAN_SAFETY_THRESHOLD_CSAM=0.99 MERIDIAN_SAFETY_THRESHOLD_SELFHARM=0.90