Content moderation

Pre-screen user inputs before they reach your application. The moderation endpoint returns per-category probability scores so you can decide what to flag, block, or review.

Quickstart

Send a POST to /v1/moderations with model=text-moderation-stable and the input text.

curl https://api.meridian.sh/v1/moderations \
  -H "Authorization: Bearer $MERIDIAN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-moderation-stable",
    "input": "I want to hurt someone."
  }'

Response

Each category returns a score between 0 and 1. Higher scores indicate a greater likelihood the content violates the policy. A flagged boolean istrue when any category exceeds the threshold.

{
  "id": "modr-9gH7xK2LpQ",
  "model": "text-moderation-stable",
  "results": [
    {
      "flagged": true,
      "categories": {
        "harassment": false,
        "harassment/threatening": false,
        "hate": false,
        "hate/threatening": false,
        "self-harm": false,
        "self-harm/intent": false,
        "self-harm/instructions": false,
        "sexual": false,
        "sexual/minors": false,
        "violence": true,
        "violence/graphic": false
      },
      "category_scores": {
        "harassment": 0.0003,
        "harassment/threatening": 0.0001,
        "hate": 0.0002,
        "hate/threatening": 0.0001,
        "self-harm": 0.0001,
        "self-harm/intent": 0.0001,
        "self-harm/instructions": 0.0001,
        "sexual": 0.0001,
        "sexual/minors": 0.0001,
        "violence": 0.987,
        "violence/graphic": 0.002
      }
    }
  ]
}

Categories

Category keyDescription
harassmentHarassment
harassment/threateningHarassment / Threatening
hateHate
hate/threateningHate / Threatening
self-harmSelf-harm
self-harm/intentSelf-harm / Intent
self-harm/instructionsSelf-harm / Instructions
sexualSexual
sexual/minorsSexual / Minors
violenceViolence
violence/graphicViolence / Graphic

Best practices

  • Run moderation before persisting or displaying user-generated content.
  • Use per-category scores to build graduated responses — warn on low-confidence flags, block on high-confidence.
  • Combine moderation with a human-review queue for edge cases where scores fall in a middle band.
  • The model is optimized for English. For other languages, test accuracy before deploying to production.