Failure Modes of Real-World Data

In Q3 2023, a mid-size fintech company deployed a transaction risk model. Validation accuracy: 99.5%. The team celebrated. The model was promoted to production, where it scored every incoming wire transfer in real time.

Within eight weeks, fraud losses had doubled. The model was flagging almost nothing.

The post-mortem took three days. The root cause took three minutes to explain: the training data included a column called review_outcome — a field populated by the fraud investigation team after manual review. This column encoded the answer the model was supposed to predict. During training, the model learned to read the answer key instead of detecting fraud patterns. Validation accuracy was 99.5% because the answer key was available in the validation set too.

The total cost: $2.1M in undetected fraud, plus the engineering time to rebuild the pipeline from scratch. The failure had a name — target leakage — and a three-line fix: drop the column before training. But nobody on the team knew to look for it.

This is not a story about incompetent engineers. The team had PhDs. They ran cross-validation. They checked feature importance. Target leakage does not announce itself. It hides in plain sight behind suspiciously good metrics.

The Four Failure Modes

Every dataset you work with in production has at least one of these problems. Most have two or three. The question is never “is my data clean?” — the question is “which failure mode is active, and how badly is it distorting my model?”

1. Missing Data Mechanisms — Data that is absent for reasons that correlate with the target variable. A health survey where sick patients skip questions. An e-commerce dataset where churned users have no recent activity. The pattern of missingness is itself a signal, and ignoring it with df.dropna() introduces bias that no amount of hyperparameter tuning will fix.

2. Target Leakage — Features that carry information about the target that would not be available at prediction time. The silent model killer. It inflates validation metrics to near-perfect scores and produces models that are completely useless in production. It comes in two forms: temporal leakage (using future data) and data leakage (using target-derived columns).

3. Outliers — Not all extreme values are created equal. True rare events (a legitimate $500K transaction), system errors (a sensor reporting −9999°C), and ETL bugs (a date parsed as a dollar amount) require entirely different handling strategies. Blindly clipping at the 99th percentile destroys the rare events your model most needs to learn.

4. Distribution Shift — The world moves. Your model does not. A recommendation model trained on pre-pandemic shopping data. A credit model calibrated during an economic boom. When the distribution of inputs or the relationship between inputs and outputs changes, your model degrades silently — accuracy metrics computed on stale test sets remain high while real-world performance collapses.

Diagnostic Table

When you suspect something is wrong with your data, start here:

Symptom	Likely Failure Mode	Section
Model accuracy is suspiciously high (>99%)	Target leakage	§3.2
One feature dominates importance scores	Target leakage	§3.2
Validation accuracy is great, production accuracy is poor	Distribution shift or leakage	§3.2, §3.4
`dropna()` removes >20% of rows	Missing data mechanism (likely MAR/MNAR)	§3.1
Imputed features have lower variance than raw features	Improper imputation (variance distortion)	§3.1
Residuals have heavy tails or extreme values	Outlier contamination	§3.3
Model performance degrades steadily over weeks	Concept drift	§3.4
Model performance drops suddenly after a deploy	Covariate shift (upstream data change)	§3.4
Feature distributions differ between train and serve	Covariate shift	§3.4
Certain subgroups have much worse accuracy	Missing data mechanism or sampling bias	§3.1

Why “Clean Data” Is a Myth

You will hear people say “garbage in, garbage out” as if data cleaning is a one-time preprocessing step — something you do before the real work begins. This framing is wrong in a way that causes real damage.

Data quality is not a phase. It is a continuous adversarial process. Your data sources change schema without notice. Your upstream ETL pipelines silently drop rows. Your users change behavior. Your vendors redefine column semantics. The data you trained on last month is not the data you are scoring today.

The four failure modes in this chapter are not bugs to be fixed. They are ongoing conditions to be monitored. A robust ML pipeline does not assume clean data — it instruments every stage to detect when data deviates from expectations and responds automatically.

In the sections that follow, we address each failure mode with detection techniques, code you can run against your own datasets, and decision frameworks for choosing the right mitigation strategy. We start with missing data and target leakage — the two failure modes that most frequently survive the entire development cycle undetected.

Data Failure Taxonomy

The taxonomy above shows how each failure mode connects to the stage of the ML pipeline it most affects. Missing data and outliers corrupt your features. Target leakage corrupts your labels and evaluation. Distribution shift corrupts your deployment. All four are your responsibility.