Content moderation

Pre-screen user inputs before they reach your application. The moderation endpoint returns per-category probability scores so you can decide what to flag, block, or review.

Quickstart

Send a POST to /v1/moderations with model=text-moderation-stable and the input text.

curl https://api.meridian.sh/v1/moderations \
  -H "Authorization: Bearer $MERIDIAN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-moderation-stable",
    "input": "I want to hurt someone."
  }'

Response

Each category returns a score between 0 and 1. Higher scores indicate a greater likelihood the content violates the policy. A flagged boolean istrue when any category exceeds the threshold.

{
  "id": "modr-9gH7xK2LpQ",
  "model": "text-moderation-stable",
  "results": [
    {
      "flagged": true,
      "categories": {
        "harassment": false,
        "harassment/threatening": false,
        "hate": false,
        "hate/threatening": false,
        "self-harm": false,
        "self-harm/intent": false,
        "self-harm/instructions": false,
        "sexual": false,
        "sexual/minors": false,
        "violence": true,
        "violence/graphic": false
      },
      "category_scores": {
        "harassment": 0.0003,
        "harassment/threatening": 0.0001,
        "hate": 0.0002,
        "hate/threatening": 0.0001,
        "self-harm": 0.0001,
        "self-harm/intent": 0.0001,
        "self-harm/instructions": 0.0001,
        "sexual": 0.0001,
        "sexual/minors": 0.0001,
        "violence": 0.987,
        "violence/graphic": 0.002
      }
    }
  ]
}

Category key	Description
harassment	Harassment
harassment/threatening	Harassment / Threatening
hate	Hate
hate/threatening	Hate / Threatening
self-harm	Self-harm
self-harm/intent	Self-harm / Intent
self-harm/instructions	Self-harm / Instructions
sexual	Sexual
sexual/minors	Sexual / Minors
violence	Violence
violence/graphic	Violence / Graphic

Best practices

▸Run moderation before persisting or displaying user-generated content.
▸Use per-category scores to build graduated responses — warn on low-confidence flags, block on high-confidence.
▸Combine moderation with a human-review queue for edge cases where scores fall in a middle band.
▸The model is optimized for English. For other languages, test accuracy before deploying to production.

← Back to docs Embeddings →

Content moderation

Quickstart

Response

Categories

Best practices