Back to DocsRecipe

LLM GuardrailsPrimer

A practical framework for constraining language model outputs to prevent prompt injection, data leakage, and unsafe generations in production applications.

Why Guardrails Matter

LLMs are non-deterministic by design. Without guardrails, a single adversarial prompt can extract system instructions, leak PII from context windows, or generate harmful content that bypasses your application logic. Guardrails act as a safety layer between the model and the user — validating inputs, constraining outputs, and enforcing policy before anything reaches the end user.

The Three-Layer Model

1

Input Filtering

Sanitize user prompts before they reach the model. Detect injection patterns, strip control tokens, and enforce length limits at the edge.

2

Context Hardening

Structure system prompts with delimiters and explicit boundaries. Never trust user-supplied data inside the context window without escaping.

3

Output Validation

Post-process every generation. Check for PII patterns, disallowed content, and structural integrity before returning to the client.

Implementation Patterns

  • 01Regex pre-screening. Run fast pattern matching on every user input before it touches the model. Catch known injection vectors like "ignore previous instructions" or role-switching delimiters.
  • 02XML-tagged system prompts. Wrap instructions in unambiguous tags so the model can distinguish trusted directives from user content even under adversarial conditions.
  • 03Secondary classifier. Route outputs through a smaller, faster model or rule engine that scores toxicity, PII presence, and policy compliance before release.

Common Failure Modes

FailureMitigation
Prompt injection via user data concatenationAlways use structured message arrays; never string-concat user input into system context.
PII leakage from training dataApply output scanners for email, phone, SSN patterns. Redact before logging.
Jailbreak via multi-turn manipulationReset context boundaries per session. Rate-limit consecutive refusals to detect probing.

Production note: Guardrails are not a one-time setup. Treat them as a continuous feedback loop — log every blocked input and flagged output, audit weekly, and update your patterns as attack techniques evolve. Start with regex, graduate to classifier-based approaches when your threat model demands it.