Skip to main content

On This Page

Why AI Detection Tools Fail: Vibe-Check Scores 0/100 on AI-Generated Codebase

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

I Built a Vibe-Check Tool — Then Ran It on an AI-Built Codebase and It Scored 0/100

Lakshmi Sravya Vedantham developed vibe-check to identify AI-authored code by detecting patterns like over-commenting and placeholder naming. When tested on a 30,000-line full-stack application, the tool returned a 0/100 ‘Mostly Human’ score despite the codebase being approximately 50% AI-generated.

Why This Matters

Technical detection tools often rely on ‘style markers of careless AI usage,’ such as generic variable names or hallucinatory imports, which modern models easily bypass when provided with deep domain context. As AI moves from producing sloppy boilerplate to expert-level code with perfect docstring uniformity and comprehensive error handling, the ‘distribution shift’ makes AI code look better than average human code, rendering surface-level heuristics obsolete and creating a massive accuracy gap in security and auditing tools.

Key Insights

  • Fact: The vibe-check tool returned a 0/100 score on a repository containing 30,000 lines across React and FastAPI (2026).
  • Concept: ‘Consistency Scoring’ measures variance in style; absolute consistency at scale is a stronger AI signal than specific keywords like ‘helper’ or ‘manager’.
  • Tool: commit-prophet, a CLI tool built entirely by an AI agent in one session, scored only 2/100 on standard detection metrics.
  • Concept: The ‘Vocabulary Specificity Index’ reveals that AI given domain context produces more precise terminology than junior developers, defeating generic naming detectors.
  • Fact: Approximately 70% of the tested codebase (TypeScript/JavaScript) was invisible to the detector because it was limited to Python analysis.

Working Examples

Example of domain-specific variable names that defeat generic AI detectors.

confidence_weighted_score = weighted_avg(model_outputs, confidence_weights)
normalized_feature_vector = standardize(raw_features, per_channel=True)
inter_class_variance = between_class / within_class
calibrated_threshold = baseline_mean + (2.5 * baseline_std)
rolling_accuracy = ema(correct_predictions, window=50)

A textbook AI signature: perfectly organized, multi-line import blocks added in a single session.

import { Switch, Route } from "wouter";
import { QueryClientProvider } from "@tanstack/react-query";
import Dashboard from "@/pages/dashboard";
import Analytics from "@/pages/analytics";
import Settings from "@/pages/settings";
// ... 20 more page imports

Practical Applications

  • Use case: Git history analysis; identifying AI generation by monitoring for ‘burst’ commits that add thousands of lines of documented code with zero fix-up cycles.
  • Pitfall: Lexical naming detectors; relying on keywords like ‘process_data’ fails when AI uses terms like ‘inter_class_variance’ derived from technical literature.
  • Use case: Structural uniformity auditing; measuring the coefficient of variation for docstring length and test-to-source ratios to find suspiciously perfect coverage.
  • Pitfall: Single-language scanning; ignoring polyglot components in a stack leads to total invisibility of AI-generated frontends or middleware.

References:

Continue reading

Next article

Solving Prompt Drift: A Git-Like Version Control System for LLM Prompts

Related Content