Defense Recipe
Prompt Injection Defense
Hardened input boundaries that prevent user-supplied text from hijacking system prompts, tool calls, or agent reasoning chains.
The Threat Model
Prompt injection occurs when untrusted input is concatenated into a trusted prompt context. Attackers embed meta-instructions that override system behavior, exfiltrate data, or bypass safety guardrails. Every boundary where user text meets model context is a potential injection surface.
Defense Layers
- 1.Strict delimiters. Wrap every user input in unambiguous boundary tokens (e.g.
<user_input>...</user_input>) and instruct the model to treat only delimited content as user data. - 2.Input sanitization. Strip or escape control characters, markdown fences, XML tags, and known injection patterns before the input reaches the prompt template.
- 3.Canary tokens. Embed unique, non-guessable strings in the system prompt. Monitor model outputs for canary leakage — any appearance signals a successful injection.
- 4.Output validation. Post-process model responses. Reject outputs that contain system-prompt fragments, tool-call syntax outside expected schemas, or instruction-like phrasing that did not originate from the trusted prompt.
- 5.Least privilege tool access. Scope tool permissions per-request. Never expose sensitive functions to contexts that contain untrusted input.
Implementation Checklist
| Step | Action |
|---|---|
| 01 | Define a canonical prompt template with explicit user-input slots |
| 02 | Add delimiter wrapping at every injection boundary |
| 03 | Deploy input sanitizer with regex and structural filters |
| 04 | Insert canary tokens and wire output monitoring |
| 05 | Add response validation and tool-call schema enforcement |