Architecting Unexploitable AI Agents: Beyond Prompt Engineering

Building Defense-in-Depth for AI Agents: A Practical Workshop

SoftwareDevs MVP Factory presents a workshop on securing AI agents through multi-layered architectural defense. Data from Anthropic’s Sonnet 4.6 shows that sandboxed coding agents achieve a 0% attack success rate compared to 50% for unbounded computer use.

Why This Matters

Prompt injection is fundamentally an architecture problem rather than a linguistic one; relying solely on system prompts is a failed security strategy. While a single clever prompt might fail 8% of the time, engineering a system with privilege separation and tool boundaries makes exploitation functionally impossible even when the model is tricked.

Key Insights

Anthropic’s Sonnet 4.6 system card (2026) reports that coding agents with sandboxed tools achieve a 0% attack success rate.
Least-Authority Tooling involves designing narrow, purpose-specific tools like search_faq rather than generic execute commands.
The ToolGate pattern prevents brute-force data enumeration by enforcing per-session rate limits on critical tool calls.
Context Isolation strips malicious instructions from RAG content by summarizing external text with low-privilege models.
Monitoring systems detect extraction attempts by flagging token usage spikes and abnormal tool call frequencies.

Working Examples

Input sanitization and classification layer to catch common injection patterns.

import re\nfrom dataclasses import dataclass\n@dataclass\nclass InputClassification:\n    original: str\n    flagged: bool\n    matched_patterns: list[str]\n    risk_score: float\nSUSPICIOUS_PATTERNS = [\n    r"ignore\\s+(all\\s+)?previous\\s+instructions",\n    r"you\\s+are\\s+now",\n    r"system\\*:\\*",\n    r"<\\|.*?\\|>",\n    r"IMPORTANT:\\s*new\\s+instructions",\n]\ndef classify_input(user_input: str, max_length: int = 500) -> InputClassification:\n    truncated = user_input[:max_length]\n    matches = [\n        p for p in SUSPICIOUS_PATTERNS\n        if re.search(p, truncated, re.IGNORECASE)\n    ]\n    return InputClassification(\n        original=truncated,\n        flagged=len(matches) > 0,\n        matched_patterns=matches,\n        risk_score=min(len(matches) / 3, 1.0),\n    )

Least-authority tool definitions with narrow scopes and enforced enums.

tools = [\n    {\n        "name": "search_faq",\n        "description": "Search the FAQ knowledge base",\n        "parameters": {\n            "query": {"type": "string", "maxLength": 100},\n            "category": {\n                "type": "string",\n                "enum": ["billing", "technical", "account", "general"],\n            },\n        },\n    },\n    {\n        "name": "get_user_orders",\n        "description": "Get orders for the currently authenticated user",\n        "parameters": {\n            "status_filter": {\n                "type": "string",\n                "enum": ["all", "active", "completed", "cancelled"],\n            },\n            "limit": {"type": "integer", "minimum": 1, "maximum": 10},\n        },\n    }\n]

Practical Applications

Customer Support Bots: Use narrow enums for categories. Pitfall: Broad tool access leads to internal data leakage.
Financial Transactions: Enforce human-in-the-loop validation. Pitfall: Automated payments via compromised prompts cause irreversible loss.
Order Retrieval Systems: Scope results to authenticated user IDs internally. Pitfall: Passing user_id as a parameter allows cross-user data access.

References:

https://dev.to/software_mvp-factory/building-defense-in-depth-for-ai-agents-a-practical-workshop-4874

On This Page

Building Defense-in-Depth for AI Agents: A Practical Workshop

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

ServiceNow AI Agents Can Be Tricked Into Acting Against Each Other via Second-Order Prompts

Secure LLM Agents with Two-Stage Prompt Injection Detection

Continuously hardening ChatGPT Atlas against prompt injection attacks