Skip to main content

On This Page

Predicting Buggy Files with commit-prophet and Git History

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

commit-prophet: I Built a Tool That Predicts Buggy Files Using Git History

Lakshmi Sravya Vedantham developed commit-prophet to mine longitudinal data from git logs. The tool identifies risk by scanning for keywords like ‘fix’ or ‘bug’ in commit messages to calculate a 0–100 risk score.

Why This Matters

Modern linters and review tools focus exclusively on the current state of a diff or style, ignoring the historical trajectory of a file. Technical reality shows that roughly 10% of files account for 90% of bugs, meaning historical instability is a more accurate predictor of future failures than current code quality alone.

Key Insights

  • Defect coupling is the strongest risk signal, weighted at 50% in the algorithm because frequent appearances in ‘fix’ commits indicate a file that attracts bugs.
  • The 90/10 rule of code history suggests that a small fraction of files are responsible for the vast majority of defects, yet this data is often ignored.
  • Co-change analysis can reveal hidden dependencies where two files (e.g., auth and payments) always change together despite having no direct code imports.
  • High churn alone is only a yellow flag (40% weight), as frequent changes without bug-fix keywords may simply indicate evolving requirements rather than instability.
  • The tool is built in Python using zero external git libraries, relying on subprocess calls to git log for high-performance data extraction.

Working Examples

The weighted scoring algorithm used by commit-prophet to determine file risk.

churn_score = min(file_churn / max_churn, 1.0) * 40
defect_score = min(defect_commits / max_defects, 1.0) * 50
coupling_score = min(risky_cochanged_files / 10, 1.0) * 10

Core implementation steps for parsing git history and calculating risk metrics.

from commit_prophet import get_commits, calculate_churn, calculate_defect_coupling
commits = get_commits("/path/to/repo", since_days=90)
churn = calculate_churn(commits)
defects = calculate_defect_coupling(commits)

Practical Applications

  • Use case: Running ‘commit-prophet scan —days 90’ to generate a Hotspot Risk Report identifying critical files like billing processors before deployment.
  • Pitfall: Treating high churn files as inherently buggy; commit-prophet distinguishes between evolution and instability by weighing defect coupling more heavily than churn.
  • Use case: Using co-change analysis to discover that an auth module and payment module are secretly coupled through shared environmental state.
  • Pitfall: Ignoring historical commit data in PR reviews; commit-prophet provides the longitudinal context that standard diff tools lack.

References:

Continue reading

Next article

Implementing DNS Governance in OpenShift with Red Hat Advanced Cluster Management

Related Content