Back to Docs
Defense Recipe

Prompt Injection Defense

Hardened input boundaries that prevent user-supplied text from hijacking system prompts, tool calls, or agent reasoning chains.

The Threat Model

Prompt injection occurs when untrusted input is concatenated into a trusted prompt context. Attackers embed meta-instructions that override system behavior, exfiltrate data, or bypass safety guardrails. Every boundary where user text meets model context is a potential injection surface.

Defense Layers

  • 1.Strict delimiters. Wrap every user input in unambiguous boundary tokens (e.g. <user_input>...</user_input>) and instruct the model to treat only delimited content as user data.
  • 2.Input sanitization. Strip or escape control characters, markdown fences, XML tags, and known injection patterns before the input reaches the prompt template.
  • 3.Canary tokens. Embed unique, non-guessable strings in the system prompt. Monitor model outputs for canary leakage — any appearance signals a successful injection.
  • 4.Output validation. Post-process model responses. Reject outputs that contain system-prompt fragments, tool-call syntax outside expected schemas, or instruction-like phrasing that did not originate from the trusted prompt.
  • 5.Least privilege tool access. Scope tool permissions per-request. Never expose sensitive functions to contexts that contain untrusted input.

Implementation Checklist

StepAction
01Define a canonical prompt template with explicit user-input slots
02Add delimiter wrapping at every injection boundary
03Deploy input sanitizer with regex and structural filters
04Insert canary tokens and wire output monitoring
05Add response validation and tool-call schema enforcement