Outliers and Distribution Shift

3.3 — Outliers: Not All Extreme Values Are Equal

A transaction for $487,000 appears in your dataset. Is it a wealthy customer buying jewelry? A fat-finger error where someone typed 487000 instead of 487? A test transaction from a QA engineer? An ETL bug that multiplied two columns instead of one?

Your answer to this question determines whether you keep the row, clip it, or drop it. And there is no statistical test that can answer it for you — this is a domain decision dressed in statistical clothing.

The Three Types

True outliers (rare events). These are genuine observations from the tail of the distribution. A legitimate high-value transaction. A patient with an extremely rare genetic variant. A power grid sensor recording a once-per-decade voltage spike. These outliers are often the most important data points in your dataset — they represent exactly the scenarios your model most needs to handle correctly. Removing them because they’re “extreme” is removing the signal.

Anomalies (system errors). Sensor malfunctions that report −9999°C. Network timeouts that produce zero-byte responses logged as valid entries. GPS coordinates that place a delivery in the middle of the Pacific Ocean. These are measurements of your instrumentation’s failure modes, not measurements of the phenomenon you care about. They should be removed or flagged, never imputed.

Bad data (ETL bugs). A date field parsed as a numeric value (20250215 as a dollar amount). A currency conversion applied twice. A join that duplicated rows, inflating aggregates. These are pipeline bugs masquerading as data points. The fix is in the pipeline, not in the model.

Decision Framework

Question	If Yes →	If No →
Could this value physically/logically exist?	Possible true outlier	Bad data or system error → investigate pipeline
Is the measurement system reliable for this range?	Possible true outlier	Anomaly → flag and remove
Does removing it change a business-critical metric by >1%?	Keep it — it carries signal	Removal is low-risk
Does your model need to generalize to this region?	Keep it and ensure adequate representation	Consider winsorizing

The key principle: outlier handling is a domain decision, not a statistical one. A dermatology model should keep the rare skin condition. A pricing model should remove the test transaction. No algorithm can make this distinction for you.

Detection Methods: IQR vs. Z-Score vs. Isolation Forest

Each method encodes different assumptions about your data. Choosing the wrong one is worse than not checking at all.

import polars as pl
import numpy as np
from sklearn.ensemble import IsolationForest

def detect_outliers_comparison(
    df: pl.DataFrame,
    col: str,
    iqr_multiplier: float = 1.5,
    z_threshold: float = 3.0,
    contamination: float = 0.05,
) -> pl.DataFrame:
    """Compare three outlier detection methods on a single column.

    Returns the original DataFrame with three boolean indicator columns.
    """
    values = df[col]

    # --- Method 1: IQR (Interquartile Range) ---
    # Assumption: Data is roughly symmetric. Robust to moderate skew.
    # Best for: Initial screening of univariate distributions.
    # Fails when: Data is multimodal or heavily skewed (e.g., income, claim amounts).
    q1 = values.quantile(0.25)
    q3 = values.quantile(0.75)
    iqr = q3 - q1
    lower_iqr = q1 - iqr_multiplier * iqr
    upper_iqr = q3 + iqr_multiplier * iqr

    # --- Method 2: Z-Score ---
    # Assumption: Data is normally distributed. Uses mean and std, which are
    # themselves sensitive to outliers — a circular problem.
    # Best for: Roughly Gaussian features after log transform.
    # Fails when: Distribution has heavy tails (the outliers inflate std,
    # making Z-scores smaller, hiding the very outliers you're looking for).
    mean_val = values.mean()
    std_val = values.std()

    # --- Method 3: Isolation Forest ---
    # Assumption: Outliers are "few and different." No distributional assumption.
    # Best for: Multivariate outlier detection across many features.
    # Fails when: Contamination parameter is poorly calibrated or inlier
    # clusters have very different densities.
    iso_forest = IsolationForest(
        contamination=contamination, random_state=42, n_jobs=-1
    )
    iso_labels = iso_forest.fit_predict(values.to_numpy().reshape(-1, 1))

    result = df.with_columns(
        ((pl.col(col) < lower_iqr) | (pl.col(col) > upper_iqr))
        .alias(f"{col}_outlier_iqr"),
        (((pl.col(col) - mean_val) / std_val).abs() > z_threshold)
        .alias(f"{col}_outlier_zscore"),
        pl.Series(f"{col}_outlier_iforest", iso_labels == -1),
    )

    # Summary
    n = len(df)
    for method in ["iqr", "zscore", "iforest"]:
        flagged = result[f"{col}_outlier_{method}"].sum()
        print(f"{method:>8}: {flagged:,} flagged ({flagged / n * 100:.1f}%)")

    return result


# Example: right-skewed financial data where methods disagree
rng = np.random.default_rng(42)
n = 10_000
amounts = np.concatenate([
    rng.exponential(100, n - 50),       # bulk of transactions
    rng.exponential(5_000, 30),          # legitimate high-value purchases
    np.array([999_999.0] * 10),          # ETL bug: duplicated values
    rng.normal(50, 5, 10),               # normal low-value transactions
])
rng.shuffle(amounts)

df = pl.DataFrame({"amount": amounts})
result = detect_outliers_comparison(df, "amount")

On right-skewed data like transaction amounts, IQR will flag legitimate high-value purchases as outliers. Z-score will under-flag because the extreme values inflate the standard deviation. Isolation Forest is the most appropriate here — but only if your contamination estimate is reasonable.

Robust Statistics

When outliers are present (and they always are), use estimators that don’t break:

import numpy as np
from scipy.stats import median_abs_deviation

def robust_summary(values: np.ndarray) -> dict[str, float]:
    """Compute robust location and scale estimates.

    - Median: unaffected by up to 50% contamination
    - MAD (Median Absolute Deviation): robust scale estimate
    - Trimmed mean: compromise between mean and median
    """
    clean = values[np.isfinite(values)]

    # Trimmed mean: drop top and bottom 5%
    sorted_vals = np.sort(clean)
    trim_n = int(len(sorted_vals) * 0.05)
    trimmed = sorted_vals[trim_n:-trim_n] if trim_n > 0 else sorted_vals

    return {
        "mean": float(np.mean(clean)),
        "median": float(np.median(clean)),
        "trimmed_mean_5pct": float(np.mean(trimmed)),
        "std": float(np.std(clean)),
        "mad": float(median_abs_deviation(clean)),
        "mad_scaled": float(median_abs_deviation(clean, scale="normal")),
    }


# With 0.1% contamination, mean shifts by 40%. Median doesn't move.
rng = np.random.default_rng(42)
clean_data = rng.normal(100, 15, 10_000)
contaminated = np.concatenate([clean_data, np.array([1_000_000] * 10)])

print("Clean data:")
for k, v in robust_summary(clean_data).items():
    print(f"  {k}: {v:.2f}")

print("\nContaminated data (10 values at 1,000,000):")
for k, v in robust_summary(contaminated).items():
    print(f"  {k}: {v:.2f}")

The MAD scaled by 1.4826 (the scale="normal" parameter) estimates the standard deviation of the underlying clean distribution, even when outliers are present. Use it wherever you would use std when you suspect contamination.

3.4 — Distribution Shift: When Your Model’s World Changes

Your model was trained on data from the world as it was. It makes predictions in the world as it is. These are not the same world.

A credit scoring model trained during 2019’s economic expansion encounters 2020’s pandemic. An e-commerce recommendation engine trained on desktop browsing behavior meets a user base that has migrated to mobile. A medical imaging model trained on data from one hospital is deployed at another with different scanner hardware.

In every case, the model’s assumptions — encoded implicitly in its learned parameters — no longer hold. Performance degrades, sometimes gradually, sometimes overnight.

Three Types of Shift

Covariate shift — the distribution of input features $P(X)$ changes, but the relationship $P(Y|X)$ stays the same. Your model knows how to map inputs to outputs, but the inputs it encounters in production look different from training. Example: a loan default model trained on a customer base aged 25–45 deployed to a new market with customers aged 18–65. The model has never seen an 18-year-old applicant.

Concept drift — the relationship $P(Y|X)$ itself changes. The same customer profile that predicted low default risk in 2019 predicts high risk in 2020 because an economic recession altered the causal mechanism. The inputs look the same; their meaning has changed.

Prior probability shift — the class distribution $P(Y)$ changes. A fraud detection model trained on 2% fraud rate deployed in a market with 8% fraud rate. The decision boundary was calibrated for the wrong base rate, producing miscalibrated confidence scores.

Detecting Covariate Shift with a Domain Classifier

The most practical test for covariate shift: train a classifier to distinguish between your training data and your production data. If it succeeds, the distributions differ.

import polars as pl
import numpy as np
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import cross_val_score

def detect_covariate_shift(
    train_df: pl.DataFrame,
    prod_df: pl.DataFrame,
    feature_cols: list[str],
    threshold: float = 0.6,
) -> dict[str, float | bool | list[str]]:
    """Detect covariate shift by training a domain classifier.

    If a classifier can distinguish train from production data with
    AUC > threshold, significant covariate shift is present.

    Returns:
        - auc: cross-validated AUC of the domain classifier
        - shift_detected: boolean
        - top_shifted_features: features ranked by importance in
          distinguishing the two domains
    """
    # Label: 0 = training, 1 = production
    X_train = train_df.select(feature_cols).to_numpy()
    X_prod = prod_df.select(feature_cols).to_numpy()

    X = np.vstack([X_train, X_prod])
    y = np.concatenate([np.zeros(len(X_train)), np.ones(len(X_prod))])

    # Shuffle
    rng = np.random.default_rng(42)
    idx = rng.permutation(len(X))
    X, y = X[idx], y[idx]

    clf = HistGradientBoostingClassifier(max_iter=100, random_state=42)
    scores = cross_val_score(clf, X, y, cv=5, scoring="roc_auc")
    auc = float(scores.mean())

    # Fit on all data to extract feature importances
    clf.fit(X, y)

    # Feature importances tell you WHICH features shifted
    importances = clf.feature_importances_
    ranked_idx = np.argsort(importances)[::-1]
    top_features = [
        f"{feature_cols[i]} ({importances[i]:.3f})"
        for i in ranked_idx[:5]
    ]

    return {
        "auc": round(auc, 4),
        "shift_detected": auc > threshold,
        "top_shifted_features": top_features,
    }


# Example: age distribution shifts between training and production
rng = np.random.default_rng(42)

train_data = pl.DataFrame({
    "age": rng.normal(35, 8, 5_000),
    "income": rng.normal(60_000, 15_000, 5_000),
    "credit_score": rng.normal(720, 40, 5_000),
})

# Production: younger customer base, similar income and credit
prod_data = pl.DataFrame({
    "age": rng.normal(27, 10, 5_000),       # shifted!
    "income": rng.normal(58_000, 16_000, 5_000),
    "credit_score": rng.normal(710, 45, 5_000),
})

result = detect_covariate_shift(
    train_data, prod_data,
    feature_cols=["age", "income", "credit_score"],
)
print(f"Domain classifier AUC: {result['auc']}")
print(f"Shift detected: {result['shift_detected']}")
print(f"Top shifted features: {result['top_shifted_features']}")

An AUC of 0.5 means the classifier can’t tell the two datasets apart — no detectable shift. An AUC of 0.8+ is a red alert: your production data comes from a noticeably different distribution. The feature importances tell you which features shifted, guiding your investigation.

Monitoring with KS Test and Population Stability Index

For continuous monitoring, you need lightweight statistical tests that run on every batch of incoming data:

import numpy as np
from scipy.stats import ks_2samp

def compute_psi(
    reference: np.ndarray,
    current: np.ndarray,
    n_bins: int = 10,
) -> float:
    """Population Stability Index (PSI) between two distributions.

    PSI < 0.1: no significant shift
    PSI 0.1–0.25: moderate shift — investigate
    PSI > 0.25: significant shift — retrain or intervene
    """
    # Use reference distribution to define bin edges
    _, bin_edges = np.histogram(reference, bins=n_bins)

    ref_counts = np.histogram(reference, bins=bin_edges)[0]
    cur_counts = np.histogram(current, bins=bin_edges)[0]

    # Convert to proportions, clip to avoid log(0)
    ref_pct = np.clip(ref_counts / ref_counts.sum(), 1e-6, None)
    cur_pct = np.clip(cur_counts / cur_counts.sum(), 1e-6, None)

    psi = float(np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct)))
    return round(psi, 4)


def monitor_feature_distributions(
    reference_df: dict[str, np.ndarray],
    current_df: dict[str, np.ndarray],
    ks_alpha: float = 0.05,
) -> list[dict[str, str | float | bool]]:
    """Run KS test and PSI on each feature, return monitoring report.

    Args:
        reference_df: Feature name → array from training/reference period
        current_df: Feature name → array from current production batch
        ks_alpha: Significance level for KS test

    Returns:
        List of per-feature monitoring results, sorted by severity.
    """
    results: list[dict[str, str | float | bool]] = []

    for feature in reference_df:
        ref = reference_df[feature]
        cur = current_df[feature]

        ks_stat, ks_pval = ks_2samp(ref, cur)
        psi = compute_psi(ref, cur)

        severity = "OK"
        if psi > 0.25 or ks_pval < ks_alpha:
            severity = "CRITICAL"
        elif psi > 0.1:
            severity = "WARNING"

        results.append({
            "feature": feature,
            "ks_statistic": round(ks_stat, 4),
            "ks_p_value": round(ks_pval, 6),
            "psi": psi,
            "severity": severity,
        })

    # Sort by PSI descending (most shifted first)
    results.sort(key=lambda x: x["psi"], reverse=True)
    return results


# Example: monitoring a production deployment
rng = np.random.default_rng(42)

reference = {
    "age": rng.normal(35, 8, 10_000),
    "income": rng.normal(60_000, 15_000, 10_000),
    "session_duration_s": rng.exponential(120, 10_000),
}

# Week 1: no shift
week1 = {
    "age": rng.normal(35, 8, 2_000),
    "income": rng.normal(60_000, 15_000, 2_000),
    "session_duration_s": rng.exponential(120, 2_000),
}

# Week 8: mobile app launches, session durations change dramatically
week8 = {
    "age": rng.normal(34, 9, 2_000),
    "income": rng.normal(59_000, 14_000, 2_000),
    "session_duration_s": rng.exponential(45, 2_000),  # shifted!
}

print("=== Week 1 (stable) ===")
for r in monitor_feature_distributions(reference, week1):
    print(f"  {r['feature']:>20}: PSI={r['psi']:.4f}  KS p={r['ks_p_value']:.4f}  [{r['severity']}]")

print("\n=== Week 8 (after mobile launch) ===")
for r in monitor_feature_distributions(reference, week8):
    print(f"  {r['feature']:>20}: PSI={r['psi']:.4f}  KS p={r['ks_p_value']:.4f}  [{r['severity']}]")

Mitigation Strategies

Once you detect shift, you need a response strategy. The right choice depends on whether the shift is gradual or sudden, and whether it affects inputs or the target relationship.

Retraining windows. The simplest strategy: retrain your model periodically on a sliding window of recent data. A 90-day window means your model always reflects the last three months of reality. The window size is a bias-variance tradeoff — shorter windows adapt faster but have less data; longer windows are more stable but slower to respond.

Triggered retraining. Instead of a fixed schedule, retrain when monitoring detects significant shift. This is more efficient — you don’t retrain when nothing has changed, and you retrain immediately when something does. Use PSI > 0.25 on any critical feature as a trigger.

Robust features. Some features resist shift better than others. Ratios (debt-to-income) shift less than raw values (income). Relative rankings shift less than absolute values. Engineered features that capture relationships rather than magnitudes are more portable across distribution shifts.

Calibration. For prior probability shift, recalibrating your model’s output probabilities on recent data can restore performance without full retraining. Platt scaling or isotonic regression on a small labeled sample from the new distribution adjusts the decision boundary.

Distribution Shift Types

The monitoring pipeline above should be the first thing you deploy alongside any production model. A model without distribution monitoring is a model that fails silently — and silent failures are the ones that cost $2M.