Predicting Buggy Files with commit-prophet and Git History
These articles are AI-generated summaries. Please check the original sources for full details.
commit-prophet: I Built a Tool That Predicts Buggy Files Using Git History
Lakshmi Sravya Vedantham developed commit-prophet to mine longitudinal data from git logs. The tool identifies risk by scanning for keywords like ‘fix’ or ‘bug’ in commit messages to calculate a 0–100 risk score.
Why This Matters
Modern linters and review tools focus exclusively on the current state of a diff or style, ignoring the historical trajectory of a file. Technical reality shows that roughly 10% of files account for 90% of bugs, meaning historical instability is a more accurate predictor of future failures than current code quality alone.
Key Insights
- Defect coupling is the strongest risk signal, weighted at 50% in the algorithm because frequent appearances in ‘fix’ commits indicate a file that attracts bugs.
- The 90/10 rule of code history suggests that a small fraction of files are responsible for the vast majority of defects, yet this data is often ignored.
- Co-change analysis can reveal hidden dependencies where two files (e.g., auth and payments) always change together despite having no direct code imports.
- High churn alone is only a yellow flag (40% weight), as frequent changes without bug-fix keywords may simply indicate evolving requirements rather than instability.
- The tool is built in Python using zero external git libraries, relying on subprocess calls to git log for high-performance data extraction.
Working Examples
The weighted scoring algorithm used by commit-prophet to determine file risk.
churn_score = min(file_churn / max_churn, 1.0) * 40
defect_score = min(defect_commits / max_defects, 1.0) * 50
coupling_score = min(risky_cochanged_files / 10, 1.0) * 10
Core implementation steps for parsing git history and calculating risk metrics.
from commit_prophet import get_commits, calculate_churn, calculate_defect_coupling
commits = get_commits("/path/to/repo", since_days=90)
churn = calculate_churn(commits)
defects = calculate_defect_coupling(commits)
Practical Applications
- Use case: Running ‘commit-prophet scan —days 90’ to generate a Hotspot Risk Report identifying critical files like billing processors before deployment.
- Pitfall: Treating high churn files as inherently buggy; commit-prophet distinguishes between evolution and instability by weighing defect coupling more heavily than churn.
- Use case: Using co-change analysis to discover that an auth module and payment module are secretly coupled through shared environmental state.
- Pitfall: Ignoring historical commit data in PR reviews; commit-prophet provides the longitudinal context that standard diff tools lack.
References:
Continue reading
Next article
Implementing DNS Governance in OpenShift with Red Hat Advanced Cluster Management
Related Content
DevPulse: Automating Engineering Journals via Claude Code and Notion MCP
DevPulse uses Claude Code and Notion MCP to automate developer journaling, converting git history into a gamified XP system with a 25-quest achievement engine and 30 badges.
Analyzing Asterisk CDR for ViciDial Performance Optimization
Optimize ViciDial environments by analyzing Asterisk Call Detail Records to resolve routing failures and monitor agent performance using SQL and Bash.
Mastering LLM Distillation: Soft-Label, Hard-Label, and Co-distillation Strategies
LLM distillation uses teacher-student models to transfer reasoning capabilities, reducing costs while maintaining performance through techniques like soft-label and co-distillation.