Mastering Incident Command: Non-Technical Skills for Production Outages

Incident Command: The Skills They Don’t Teach You

Dr. Samson Tanimawo outlines the operational requirements of running production incidents. He asserts that the majority of the skill required for effective incident command is non-technical.

Why This Matters

In high-pressure production environments, technical expertise alone often fails because time perception warps and communication breaks down. While ideal models suggest a linear path to root cause analysis, the technical reality requires prioritizing immediate mitigation (e.g., rolling back) over investigation to minimize downtime and prevent team burnout.

Key Insights

Operational Cadence (2026): Commanders must force a regular update cycle (‘Update in 5 minutes’) to prevent context fragmentation when engineers are deep in logs.
Mitigation vs. Investigation: Prioritize service restoration over root cause discovery; for example, rolling back a deployment to stop an outage before performing a post-mortem.
Stakeholder Communication: Build trust through honesty rather than certainty by stating ‘We don’t know the cause yet’ while outlining active investigation paths.

Practical Applications

Use Case: Incident Commanders interrupting investigating engineers for 30-second status updates to enable faster decision-making.
Pitfall: Attempting to be the smartest technical person in the room, which distracts from the primary role of coordination and emotional labor.

References:

On This Page

Incident Command: The Skills They Don’t Teach You

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

Beyond Metrics: Why Traditional SRE Dashboards Fail During Kubernetes Incidents

Kubernetes Resource Conflicts: How VPA and Scheduler Mismatches Cause Production Outages

DrP: Meta’s Root Cause Analysis Platform at Scale