Skip to main content
pragmatic data science with python

Missing Data and Target Leakage

12 min read Chapter 8 of 33
Summary

This section addresses two failure modes that routinely...

This section addresses two failure modes that routinely survive the full ML development cycle undetected. We classify missing data into three mechanisms — MCAR, MAR, and MNAR — with concrete examples from healthcare, e-commerce, and finance, then show why mean/median imputation distorts variance and manufactures false correlations. We implement proper imputation using IterativeImputer with cross-validated quality assessment. The second half covers target leakage — temporal and data leakage — demonstrating how a fraud detection model can achieve 99.9% accuracy while being completely useless, and providing a reusable checklist and time-aware splitting strategy to prevent it.

Missing Data and Target Leakage

3.1 — Missing Data Mechanisms

A column in your DataFrame has 23% null values. The instinct is to call .fillna(df["column"].median()) and move on. That instinct will silently bias your model, and the bias will be invisible in your validation metrics.

Before you choose an imputation strategy, you need to answer a harder question: why are these values missing? The answer determines everything.

The Three Mechanisms

MCAR — Missing Completely at Random. The probability of a value being missing is unrelated to any variable in the dataset, observed or unobserved. A lab sensor that drops readings due to random electrical noise. A survey response missing because the participant accidentally skipped a page. MCAR is the only mechanism where dropping rows (dropna()) does not introduce bias — you lose statistical power, but the remaining sample is still representative.

MCAR is also the rarest mechanism in real-world data. If you assume MCAR without testing, you are almost certainly wrong.

MAR — Missing at Random. The probability of missingness depends on other observed variables but not on the missing value itself. In an e-commerce dataset, product reviews are less likely to have delivery_rating filled in for low-cost items — users don’t bother rating delivery of a $3 purchase. The missingness correlates with price (observed), not with the delivery rating itself.

MAR is common and treatable. If you can predict the missingness pattern from other columns, you can build imputation models that account for the dependency.

MNAR — Missing Not at Random. The probability of missingness depends on the unobserved value itself. In a clinical trial, patients with severe side effects drop out — their health outcomes are missing precisely because those outcomes are bad. In a credit application, high-income earners are less likely to disclose income because their wealth comes from sources they prefer not to report. In both cases, the missing values are systematically different from the observed values.

MNAR is the most dangerous mechanism because no amount of modeling on the observed data can correct for it. You need domain knowledge, external data, or explicit modeling of the missingness mechanism.

Detecting the Mechanism

You cannot prove a mechanism from data alone — that requires domain expertise. But you can gather evidence. Start by visualizing the structure of missingness:

import polars as pl
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def missingness_heatmap(df: pl.DataFrame, sample_n: int = 500) -> None:
    """Visualize patterns in missing data across columns.

    Rows are sorted by total missingness to reveal structure.
    If missingness is MCAR, the heatmap looks like random noise.
    If missingness is MAR/MNAR, you'll see correlated bands.
    """
    # Convert to binary null indicator matrix
    null_matrix = df.select(
        pl.col(col).is_null().cast(pl.Int8).alias(col)
        for col in df.columns
    )

    # Sort by total nulls per row to reveal structure
    null_pd = null_matrix.to_pandas()
    null_pd["_total_nulls"] = null_pd.sum(axis=1)
    null_pd = null_pd.sort_values("_total_nulls", ascending=False)
    null_pd = null_pd.drop(columns=["_total_nulls"])

    # Sample for readability
    if len(null_pd) > sample_n:
        null_pd = null_pd.head(sample_n)

    fig, ax = plt.subplots(figsize=(12, 6))
    sns.heatmap(null_pd, cbar=False, yticklabels=False, cmap="YlOrRd", ax=ax)
    ax.set_title("Missing Data Pattern (sorted by row missingness)")
    plt.tight_layout()
    plt.savefig("missing_data_heatmap.png", dpi=150)
    plt.close()

Next, test whether missingness in one column correlates with observed values in other columns — evidence for MAR:

def test_mar_evidence(
    df: pl.DataFrame,
    target_col: str,
    predictor_cols: list[str],
) -> dict[str, float]:
    """Test if missingness in target_col correlates with predictor columns.

    Returns a dict of column -> point-biserial correlation with the
    missingness indicator. High absolute correlations suggest MAR.
    """
    from scipy.stats import pointbiserialr

    indicator = df.select(
        pl.col(target_col).is_null().cast(pl.Int8).alias("is_missing")
    ).to_series()

    correlations: dict[str, float] = {}
    for col in predictor_cols:
        if col == target_col:
            continue
        series = df[col]
        if not series.dtype.is_numeric():
            continue
        # Drop rows where predictor is also null
        mask = series.is_not_null()
        pred_vals = series.filter(mask).to_numpy()
        miss_vals = indicator.filter(mask).to_numpy()
        if len(np.unique(miss_vals)) < 2:
            continue
        corr, p_value = pointbiserialr(miss_vals, pred_vals)
        correlations[col] = round(corr, 4)

    return dict(sorted(correlations.items(), key=lambda x: abs(x[1]), reverse=True))

If test_mar_evidence returns strong correlations (|r| > 0.1) between the missingness indicator and other features, you’re dealing with MAR at minimum. If no correlations show up and dropna() changes the overall distribution of your target variable, suspect MNAR.

Missing Data Mechanisms

Why Mean/Median Imputation Is Dangerous

Mean imputation is the goto of data preprocessing — everyone uses it, and it introduces bugs that are invisible at the point of use.

Consider a feature income with 30% missing values. The observed mean is $65,000. You fill all nulls with $65,000. Three things happen:

  1. Variance collapses. The standard deviation of income drops because 30% of values are now identical. Any model that uses variance (regression coefficients, PCA, distance-based models) receives a distorted signal.

  2. Correlations are manufactured. If income and age are positively correlated in the observed data, and missingness in income correlates with age (MAR), then imputing all missing incomes with the mean weakens the age-income correlation — introducing bias toward the null hypothesis.

  3. The distribution shifts. A spike appears at the mean. Any model that learns distributional features (quantile-based splits in gradient boosting, for example) picks up an artifact.

# Demonstration: mean imputation distorts variance and correlation
import polars as pl
import numpy as np

rng = np.random.default_rng(42)
n = 10_000

# Ground truth: income and age are correlated
age = rng.normal(40, 12, n)
income = 1_500 * age + rng.normal(0, 10_000, n)

# MAR: higher-income people are more likely to have missing income
missing_prob = 1 / (1 + np.exp(-(income - 70_000) / 10_000))
is_missing = rng.binomial(1, missing_prob).astype(bool)

income_with_missing = np.where(is_missing, np.nan, income)

df = pl.DataFrame({
    "age": age,
    "income": income,
    "income_with_missing": income_with_missing,
})

# Compare: true data vs. mean-imputed data
true_std = df["income"].std()
mean_val = df["income_with_missing"].mean()
imputed = df.with_columns(
    pl.col("income_with_missing").fill_null(mean_val).alias("income_imputed")
)
imputed_std = imputed["income_imputed"].std()

print(f"True std:     ${true_std:,.0f}")
print(f"Imputed std:  ${imputed_std:,.0f}")
print(f"Variance lost: {(1 - imputed_std / true_std) * 100:.1f}%")
# Output:
# True std:     $20,145
# Imputed std:  $14,892
# Variance lost: 26.1%

Twenty-six percent of your variance, gone. And your validation metrics won’t show it because the distortion affects train and validation equally.

Proper Imputation Strategies

Strategy 1: Indicator variables. Add a binary column income_is_missing alongside whatever imputation you choose. This lets the model learn that missingness itself is a signal — which it often is, especially under MAR and MNAR.

Strategy 2: Model-based imputation with IterativeImputer. This treats each feature with missing values as a target variable and fits a regression model on the other features. It’s the scikit-learn implementation of the MICE algorithm (Multiple Imputation by Chained Equations).

from sklearn.experimental import enable_iterative_imputer  # noqa: F401
from sklearn.impute import IterativeImputer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
import numpy as np
import polars as pl

def impute_with_quality_check(
    df: pl.DataFrame,
    feature_cols: list[str],
    target_col: str,
) -> tuple[pl.DataFrame, dict[str, float]]:
    """Impute missing values and measure imputation quality via downstream
    model performance comparison.

    Returns the imputed DataFrame and a dict of quality metrics.
    """
    X = df.select(feature_cols).to_numpy()
    y = df[target_col].to_numpy()

    # Strategy A: Drop rows with missing values (baseline)
    mask_complete = ~np.isnan(X).any(axis=1) & ~np.isnan(y)
    X_complete = X[mask_complete]
    y_complete = y[mask_complete]

    baseline_scores = cross_val_score(
        Ridge(), X_complete, y_complete, cv=5, scoring="r2"
    )

    # Strategy B: Iterative imputation
    imputer = IterativeImputer(
        estimator=HistGradientBoostingRegressor(max_iter=100, random_state=42),
        max_iter=10,
        random_state=42,
        sample_posterior=False,
    )
    X_imputed = imputer.fit_transform(X)

    # Add missingness indicators
    missing_indicators = np.isnan(X).astype(np.float64)
    has_any_missing = missing_indicators.sum(axis=0) > 0
    X_augmented = np.hstack([X_imputed, missing_indicators[:, has_any_missing]])

    mask_target = ~np.isnan(y)
    imputed_scores = cross_val_score(
        Ridge(), X_augmented[mask_target], y[mask_target], cv=5, scoring="r2"
    )

    quality = {
        "baseline_r2_mean": round(float(baseline_scores.mean()), 4),
        "baseline_r2_std": round(float(baseline_scores.std()), 4),
        "imputed_r2_mean": round(float(imputed_scores.mean()), 4),
        "imputed_r2_std": round(float(imputed_scores.std()), 4),
        "rows_recovered": int(mask_target.sum() - mask_complete.sum()),
        "pct_rows_recovered": round(
            (mask_target.sum() - mask_complete.sum()) / len(df) * 100, 1
        ),
    }

    imputed_df = pl.DataFrame(
        {col: X_imputed[:, i] for i, col in enumerate(feature_cols)}
    )
    return imputed_df, quality

If the imputed model’s R² is close to or better than the complete-case baseline, your imputation is recovering useful signal. If it’s substantially worse, the imputation is injecting noise — reconsider your strategy or consult domain experts about the missingness mechanism.


3.2 — Target Leakage: The Silent Model Killer

Target leakage is the most expensive bug in machine learning. It produces models that perform brilliantly in development and fail catastrophically in production. It survives cross-validation, feature importance analysis, and model comparison. The only reliable defense is understanding what it is and knowing where to look.

What Leakage Looks Like

Leakage means your model has access to information during training that it will not have at prediction time. There are two variants:

Temporal leakage: Using data from the future to predict the past. A churn model trained on user activity data where the training window includes activity after the churn event. A demand forecasting model where the features include next-week inventory levels.

Data leakage: Using features that are derived from, or proxies for, the target variable. A fraud model that includes investigation_status. A hospital readmission model that includes discharge_disposition_code (which encodes whether the patient was readmitted).

The Fraud Model That Was Completely Useless

Here is a realistic example that every data scientist should internalize:

import polars as pl
import numpy as np
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import cross_val_score

rng = np.random.default_rng(42)
n = 50_000

# Simulate transaction data
transactions = pl.DataFrame({
    "amount": rng.exponential(150, n),
    "hour_of_day": rng.integers(0, 24, n),
    "merchant_category": rng.integers(0, 20, n),
    "distance_from_home_km": rng.exponential(10, n),
    "is_fraud": rng.binomial(1, 0.02, n),  # 2% fraud rate
})

# THE LEAK: transaction_status is set AFTER fraud investigation.
# "flagged" and "blocked" only appear for fraudulent transactions.
status_map = {0: "approved", 1: "blocked"}
transactions = transactions.with_columns(
    pl.col("is_fraud")
    .map_elements(lambda x: status_map[x], return_dtype=pl.Utf8)
    .alias("transaction_status")
)

# Encode the leaked feature
transactions = transactions.with_columns(
    (pl.col("transaction_status") == "blocked").cast(pl.Int8).alias("status_encoded")
)

# --- THE WRONG WAY: Train with the leaked feature ---
feature_cols_leaked = [
    "amount", "hour_of_day", "merchant_category",
    "distance_from_home_km", "status_encoded",
]
X_leaked = transactions.select(feature_cols_leaked).to_numpy()
y = transactions["is_fraud"].to_numpy()

scores_leaked = cross_val_score(
    HistGradientBoostingClassifier(random_state=42),
    X_leaked, y, cv=5, scoring="roc_auc",
)
print(f"With leakage:    AUC = {scores_leaked.mean():.4f}")
# With leakage:    AUC = 1.0000

# --- THE RIGHT WAY: Drop the leaked feature ---
feature_cols_clean = [
    "amount", "hour_of_day", "merchant_category",
    "distance_from_home_km",
]
X_clean = transactions.select(feature_cols_clean).to_numpy()

scores_clean = cross_val_score(
    HistGradientBoostingClassifier(random_state=42),
    X_clean, y, cv=5, scoring="roc_auc",
)
print(f"Without leakage: AUC = {scores_clean.mean():.4f}")
# Without leakage: AUC = 0.5023

AUC goes from 1.0000 to 0.5023 — random chance. The model only learned the leak. Every legitimate feature was ignored because the leaked feature was infinitely more predictive. In production, without transaction_status, the model outputs random noise.

The terrifying part: cross-validation didn’t catch it. Feature importance would show status_encoded as dominant, but many practitioners would interpret that as “status is a great predictor” rather than “status is the target wearing a disguise.”

Time-Aware Train/Test Splits

For any model that will make predictions about the future, your train/test split must respect the arrow of time. Random splits allow data from Tuesday to train a model that is evaluated on Monday — temporal leakage by construction.

import polars as pl
import numpy as np
from datetime import date, timedelta
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import roc_auc_score

def temporal_train_test_split(
    df: pl.DataFrame,
    date_col: str,
    target_col: str,
    feature_cols: list[str],
    train_end: date,
    gap_days: int = 7,
    test_days: int = 30,
) -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    """Split data respecting temporal ordering with a gap period.

    The gap prevents label leakage from delayed outcome recording.
    For example, fraud labels may take 7 days to finalize; training
    on the 7 days before the test set would include unlabeled fraud.

    Timeline: [--- train ---|== gap ==|--- test ---]
    """
    test_start = train_end + timedelta(days=gap_days)
    test_end = test_start + timedelta(days=test_days)

    train_mask = pl.col(date_col) <= train_end
    test_mask = (pl.col(date_col) >= test_start) & (pl.col(date_col) <= test_end)

    train_df = df.filter(train_mask)
    test_df = df.filter(test_mask)

    X_train = train_df.select(feature_cols).to_numpy()
    y_train = train_df[target_col].to_numpy()
    X_test = test_df.select(feature_cols).to_numpy()
    y_test = test_df[target_col].to_numpy()

    print(f"Train: {len(train_df):,} rows up to {train_end}")
    print(f"Gap:   {gap_days} days (labels may still be pending)")
    print(f"Test:  {len(test_df):,} rows from {test_start} to {test_end}")

    return X_train, X_test, y_train, y_test


# Usage with expanding window for multiple evaluation periods
rng = np.random.default_rng(42)
n = 100_000
dates = [date(2025, 1, 1) + timedelta(days=int(d)) for d in rng.integers(0, 365, n)]

df = pl.DataFrame({
    "txn_date": dates,
    "amount": rng.exponential(200, n),
    "hour": rng.integers(0, 24, n),
    "merchant_cat": rng.integers(0, 15, n),
    "is_fraud": rng.binomial(1, 0.03, n),
}).sort("txn_date")

features = ["amount", "hour", "merchant_cat"]

X_train, X_test, y_train, y_test = temporal_train_test_split(
    df,
    date_col="txn_date",
    target_col="is_fraud",
    feature_cols=features,
    train_end=date(2025, 9, 1),
    gap_days=7,
    test_days=60,
)

model = HistGradientBoostingClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict_proba(X_test)[:, 1]
print(f"Temporal split AUC: {roc_auc_score(y_test, y_pred):.4f}")

The gap_days parameter is critical. In fraud detection, a transaction might not be labeled as fraudulent until 30 days later when the cardholder disputes it. Training on the 30 days before your test set means some of those training labels are still provisional — they’re labeled as legitimate only because the dispute hasn’t happened yet.

Leakage Detection Checklist

Run through this checklist for every model before deployment:

  1. Suspicious accuracy. Is your AUC > 0.99? Is your RMSE implausibly low? Investigate before celebrating.

  2. Feature importance concentration. Does a single feature dominate? What is that feature, and when is it populated relative to the prediction target?

  3. Temporal integrity. For each feature, answer: “At prediction time, would this value be available?” If the answer is “sometimes” or “it depends on when the batch runs,” you have a problem.

  4. Target-derived columns. Search your feature set for any column that is computed from, or populated as a consequence of, the target event. Common culprits: status codes, review outcomes, resolution timestamps, post-event flags.

  5. Aggregation windows. If you compute rolling averages or cumulative sums, verify the window does not extend past the prediction point. A “30-day rolling average” that includes the prediction date is a leak.

  6. Join fan-out. When joining tables, verify that the joined data existed at the time of the training example. Joining a user’s current profile to their historical transactions leaks future information.

  7. Data freshness. Confirm that the features in your training set are generated by the same pipeline that generates features at serving time. A common leak: training features are computed from a data warehouse snapshot that includes records not yet available at serving time.

If you find a leak, do not patch it — rebuild. A model trained with leakage has learned the wrong representation. Removing the leaked feature and retraining is the only safe option.