Skip to main content

On This Page

Google AI Releases Auto-Diagnose: LLM-Based System for Automated Integration Test Debugging

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale

Google researchers have introduced Auto-Diagnose, an LLM-powered system designed to automate the diagnosis of complex integration test failures. In a manual evaluation of 71 real-world failures across 39 teams, the tool correctly identified the root cause 90.14% of the time.

Why This Matters

Integration tests represent a significant debugging tax because failures often surface as generic symptoms like timeouts, while the actual error is buried deep within disparate component logs. At Google, a survey of 116 developers revealed that 38.4% of these failures take more than an hour to diagnose, and 8.9% take over a day, whereas unit tests rarely exceed an hour for diagnosis. Auto-Diagnose addresses this by aggregating logs across data centers and processes into a single timestamped stream for LLM analysis.

Key Insights

  • Auto-Diagnose achieved a 90.14% root-cause accuracy rate using Gemini 2.5 Flash without any fine-tuning, relying instead on sophisticated prompt engineering and a temperature of 0.1 for near-deterministic results.
  • The system operates with a p50 latency of 56 seconds, enabling 22,962 distinct developers to receive findings before they lose context on their code changes.
  • A survey of 6,059 developers at Google (EngSat) identified integration test failures as one of the top five productivity complaints across the organization.
  • The system uses hard negative constraints in its prompts, forcing the model to report ‘more information is needed’ rather than hallucinating when logs are incomplete.
  • Out of 517 feedback reports, 84.3% were ‘Please fix’ requests from reviewers, ranking Auto-Diagnose #14 in helpfulness out of 370 internal tools at Google.

Practical Applications

  • Use Case: Automated code review comments in Google’s Critique system provide markdown findings with clickable log links, allowing authors to act on root causes immediately.
  • Pitfall: Relying on test driver logs alone often masks the true error; Auto-Diagnose mitigates this by joining SUT component logs at level INFO and above into a unified stream.
  • Pitfall: Incomplete infrastructure logging can cause diagnostic failure; Auto-Diagnose’s refusal to guess has helped surface real infrastructure bugs in logging pipelines.

References:

Continue reading

Next article

A Well-Designed JavaScript Module System is Your First Architecture Decision

Related Content