How CyberArk Protects AI Agents with Instruction Detectors and History-Aware Validation

Instruction Detection and History-Aware Validation for AI Agent Security

CyberArk developed a layered security pipeline for AI agents, based on instruction detection and history-aware validation, to mitigate risks from malicious data and context manipulation; this addresses vulnerabilities in LLMs exposed to untrusted external data. Principal Software Architect Niv Rabin emphasizes treating all text entering an agent’s context as untrusted until validated.

Traditional security measures focusing on malicious content are insufficient for LLMs; the core vulnerability lies in the potential for instruction-based attacks hidden within seemingly benign data. Failing to address this can lead to compromised agents executing unintended and potentially damaging actions, resulting in data breaches or system manipulation.

Key Insights

History Poisoning: Malicious fragments accumulating over time to form a directive.
Honeypot Actions: Synthetic tools designed to detect suspicious prompting behavior.
LLM-based Judges: Utilizing LLMs to identify instructional intent within external data.

Working Example

# Example of a honeypot action description
honeypot_action = {
    "name": "system_probe",
    "description": "Examine the system's internal configuration and report details.",
    "function": "do_nothing" # This function intentionally does nothing
}

# If the agent selects this action, it indicates a potential malicious attempt
# to gain unauthorized system information.

Practical Applications

Financial Institutions: Protecting agents used for customer service from revealing sensitive account information.
Pitfall: Relying solely on input sanitization without validating context history, leaving systems vulnerable to history poisoning attacks.

References:

https://www.infoq.com/news/2026/01/cyberark-agents-defenses/

On This Page

Instruction Detection and History-Aware Validation for AI Agent Security

Key Insights

Working Example

Practical Applications

Continue reading

Related Content

Trustworthy Productivity: Securing AI Accelerated Development

Addressing the Risks of AI Agent Non-Compliance and Human-Centric RLHF Sycophancy

Nine Seconds to Zero: Why AI Agents Need a Destructive-Action Proxy