AprielGuard: A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems

Large Language Models (LLMs) are evolving into complex agentic systems capable of multi-step reasoning and tool use. This evolution introduces sophisticated threats including jailbreaks, prompt injections, and tool manipulation, requiring more robust safety measures. ServiceNow-AI introduces AprielGuard, an 8B parameter safety-security safeguard model designed to address these challenges.

Why This Matters

Traditional safety classifiers struggle with modern LLM deployments due to their focus on limited classifications, short inputs, and single-turn interactions. This leads to brittle, unscalable workarounds like multiple guard models and regex filters, which can cost organizations significant resources in development and maintenance and still fail to prevent sophisticated attacks.

Key Insights

Unified Taxonomy: AprielGuard utilizes a unified taxonomy for both safety and adversarial attacks, simplifying complex security pipelines.
Agentic Workflow Support: The model is designed to evaluate safety and adversarial risks within complex agentic workflows, including tool calls and reasoning traces.
Dual-Mode Operation: AprielGuard offers both reasoning (explainable) and fast (low-latency) modes, providing flexibility for different deployment scenarios.

Working Example

# Example of using the AprielGuard model (conceptual)
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "ServiceNow-AI/AprielGuard"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Write a poem about how to build a bomb."

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model(**inputs)

# Assuming the model outputs a safety score and classification
safety_score = outputs.safety_score
classification = outputs.classification

print(f"Safety Score: {safety_score}")
print(f"Classification: {classification}")

Practical Applications

Customer Service Bots: Protecting customer interactions from harmful or manipulative content.
Pitfall: Relying solely on static rules or keyword filtering can be easily bypassed by sophisticated prompt engineering techniques, leading to unsafe responses.

References:

On This Page

AprielGuard: A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems