Automating SRE Incident Response with AWS Strands Agents and Claude Sonnet 4

Building an AI-Powered SRE Incident Response Workflow With AWS Strands Agents

The SRE Incident Response Agent leverages the AWS Strands Agents SDK to automate the end-to-end lifecycle of cloud incidents. By integrating Claude Sonnet 4 on Amazon Bedrock, the system orchestrates 4 specialized agents and 8 tools to move from alarm discovery to Kubernetes remediation in seconds.

Why This Matters

Traditional incident response relies on manual context-switching between monitoring dashboards, log aggregators, and CLI tools, which increases Mean Time to Repair (MTTR). This workflow replaces manual triage with a deterministic multi-agent system that correlates CloudWatch metrics with log events to propose or execute remediations.

Key Insights

Multi-agent Orchestration: The workflow utilizes 4 specialized agents and 8 tools to manage discovery, root cause analysis, and remediation.
Claude Sonnet 4 Integration: Uses Amazon Bedrock to perform deep analysis of CloudWatch metrics and OOMKilled log events (2025/2026).
Safety via Dry-Run: The system defaults to DRY_RUN=true, printing kubectl and helm commands instead of executing them to prevent unintended production changes.
Automated Incident Reporting: Generates structured Slack reports including P-level severity, root cause findings, and follow-up monitoring recommendations.
Mocked Testing: Includes 12 pytest unit tests that mock boto3 entirely, allowing for CI/CD validation without active AWS credentials.

Working Examples

Environment setup and dependency installation for the SRE agent.

git clone https://github.com/strands-agents/samples.git
cd samples/02-samples/sre-incident-response-agent
python -m venv .venv
source .venv/activate
pip install -r requirements.txt

Triggering the agent for either broad discovery or targeted investigation.

# Option A: Automatic Alarm Discovery
python sre_agent.py

# Option B: Targeted Investigation
python sre_agent.py "High CPU alarm fired on ECS service my-api in prod namespace"

Practical Applications

Use Case: Identifying memory leaks in ECS services by correlating CPU spikes with GC thrashing and OOMKilled events in CloudWatch Logs.
Pitfall: Disabling DRY_RUN before validating the agent’s reasoning logic, potentially leading to unnecessary rolling restarts of stable deployments.
Use Case: Automated generation of post-mortem documentation by piping agent findings directly into Slack or incident management tools.
Pitfall: Providing insufficient IAM read permissions (logs:FilterLogEvents), which prevents the RCA agent from accessing the context needed for diagnosis.

References:

On This Page

Building an AI-Powered SRE Incident Response Workflow With AWS Strands Agents

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

AWS Frontier Agents: Automating SRE Workflows and Incident Response

Automate Code Reviews with Claude API and GitHub Actions

9 AI Agents Building Products: Inside the reflectt-node Coordination System