Defense Recipe

Prompt Injection Defense

Hardened input boundaries that prevent user-supplied text from hijacking system prompts, tool calls, or agent reasoning chains.

The Threat Model

Prompt injection occurs when untrusted input is concatenated into a trusted prompt context. Attackers embed meta-instructions that override system behavior, exfiltrate data, or bypass safety guardrails. Every boundary where user text meets model context is a potential injection surface.

Defense Layers

1.Strict delimiters. Wrap every user input in unambiguous boundary tokens (e.g. <user_input>...</user_input>) and instruct the model to treat only delimited content as user data.
2.Input sanitization. Strip or escape control characters, markdown fences, XML tags, and known injection patterns before the input reaches the prompt template.
3.Canary tokens. Embed unique, non-guessable strings in the system prompt. Monitor model outputs for canary leakage — any appearance signals a successful injection.
4.Output validation. Post-process model responses. Reject outputs that contain system-prompt fragments, tool-call syntax outside expected schemas, or instruction-like phrasing that did not originate from the trusted prompt.
5.Least privilege tool access. Scope tool permissions per-request. Never expose sensitive functions to contexts that contain untrusted input.

Implementation Checklist

Step	Action
01	Define a canonical prompt template with explicit user-input slots
02	Add delimiter wrapping at every injection boundary
03	Deploy input sanitizer with regex and structural filters
04	Insert canary tokens and wire output monitoring
05	Add response validation and tool-call schema enforcement