Google AI Releases Auto-Diagnose: LLM-Based System for Automated Integration Test Debugging

Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale

Google researchers have introduced Auto-Diagnose, an LLM-powered system designed to automate the diagnosis of complex integration test failures. In a manual evaluation of 71 real-world failures across 39 teams, the tool correctly identified the root cause 90.14% of the time.

Why This Matters

Integration tests represent a significant debugging tax because failures often surface as generic symptoms like timeouts, while the actual error is buried deep within disparate component logs. At Google, a survey of 116 developers revealed that 38.4% of these failures take more than an hour to diagnose, and 8.9% take over a day, whereas unit tests rarely exceed an hour for diagnosis. Auto-Diagnose addresses this by aggregating logs across data centers and processes into a single timestamped stream for LLM analysis.

Key Insights

Auto-Diagnose achieved a 90.14% root-cause accuracy rate using Gemini 2.5 Flash without any fine-tuning, relying instead on sophisticated prompt engineering and a temperature of 0.1 for near-deterministic results.
The system operates with a p50 latency of 56 seconds, enabling 22,962 distinct developers to receive findings before they lose context on their code changes.
A survey of 6,059 developers at Google (EngSat) identified integration test failures as one of the top five productivity complaints across the organization.
The system uses hard negative constraints in its prompts, forcing the model to report ‘more information is needed’ rather than hallucinating when logs are incomplete.
Out of 517 feedback reports, 84.3% were ‘Please fix’ requests from reviewers, ranking Auto-Diagnose #14 in helpfulness out of 370 internal tools at Google.

Practical Applications

Use Case: Automated code review comments in Google’s Critique system provide markdown findings with clickable log links, allowing authors to act on root causes immediately.
Pitfall: Relying on test driver logs alone often masks the true error; Auto-Diagnose mitigates this by joining SUT component logs at level INFO and above into a unified stream.
Pitfall: Incomplete infrastructure logging can cause diagnostic failure; Auto-Diagnose’s refusal to guess has helped surface real infrastructure bugs in logging pipelines.

References:

https://www.marktechpost.com/2026/04/17/google-ai-releases-auto-diagnose-an-large-language-model-llm-based-system-to-diagnose-integration-test-failures-at-scale/

On This Page

Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

Why Code Isn't the Only Cause of Production Failures: Insights from SRE Expert Anish

4 FastAPI Projects in 2 Weeks: The Hidden Cost of Boilerplate and 15 CI Failures

Mastering Claude Code: Advanced Tips After Over a Year of Daily Use