Metrics That Matter and Cross-Validation Done Right

8.1 — Choosing the Right Metric

Here is a question that will expose whether your evaluation is serious or performative: how much does each type of error cost?

Not in abstract terms. In dollars. In lives. In customer trust. Every classification error has a concrete consequence, and the magnitude of that consequence is almost never symmetric. Missing a cancer diagnosis (false negative) is not the same as ordering an unnecessary biopsy (false positive). Approving a fraudulent transaction is not the same as declining a legitimate one. Sending an irrelevant push notification is not the same as failing to surface a breaking news alert.

The metric you choose is the loss function of your evaluation. If that metric does not reflect the actual cost structure of your problem, you are optimizing for the wrong thing — and you will get exactly what you optimized for.

The Cost Matrix

Every binary classification problem has four outcomes, each with a business cost:

	Predicted Positive	Predicted Negative
Actually Positive	True Positive (benefit: +B)	False Negative (cost: -C_FN)
Actually Negative	False Positive (cost: -C_FP)	True Negative (benefit: +0)

The optimal decision threshold is not 0.5. It is the threshold that minimizes the total expected cost. This depends on three things: the cost ratio C_FN / C_FP, the class prevalence, and the model’s score distribution.

import numpy as np
from sklearn.metrics import confusion_matrix
from dataclasses import dataclass


@dataclass
class CostMatrix:
    """Business costs for each classification outcome."""
    cost_fp: float  # Cost of false positive (e.g., unnecessary biopsy: $2,000)
    cost_fn: float  # Cost of false negative (e.g., missed cancer: $500,000)
    benefit_tp: float = 0.0  # Benefit of true positive (e.g., early treatment)
    benefit_tn: float = 0.0  # Benefit of true negative


def optimize_threshold(
    y_true: np.ndarray,
    y_scores: np.ndarray,
    costs: CostMatrix,
    thresholds: np.ndarray | None = None,
) -> tuple[float, float]:
    """
    Find the decision threshold that minimizes total business cost.

    Returns (best_threshold, min_cost).
    """
    if thresholds is None:
        thresholds = np.linspace(0.01, 0.99, 200)

    best_threshold = 0.5
    min_cost = float("inf")

    for threshold in thresholds:
        y_pred = (y_scores >= threshold).astype(int)
        tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()

        total_cost = (
            fp * costs.cost_fp
            + fn * costs.cost_fn
            - tp * costs.benefit_tp
            - tn * costs.benefit_tn
        )

        if total_cost < min_cost:
            min_cost = total_cost
            best_threshold = threshold

    return best_threshold, min_cost


# --- Fraud detection example ---
# Missing a fraud: $5,000 loss. Flagging legitimate: $20 review cost.
fraud_costs = CostMatrix(cost_fp=20.0, cost_fn=5_000.0, benefit_tp=5_000.0)

# Simulate a model's probability scores
rng = np.random.default_rng(42)
n_samples = 10_000
y_true = rng.binomial(1, 0.005, size=n_samples)  # 0.5% fraud rate
y_scores = np.where(
    y_true == 1,
    rng.beta(5, 2, size=n_samples),    # frauds score higher
    rng.beta(2, 5, size=n_samples),    # legit scores lower
)

best_thresh, min_cost = optimize_threshold(y_true, y_scores, fraud_costs)
print(f"Optimal threshold: {best_thresh:.3f}")
print(f"Minimum total cost: ${min_cost:,.0f}")

# Compare with naive 0.5 threshold
naive_pred = (y_scores >= 0.5).astype(int)
tn, fp, fn, tp = confusion_matrix(y_true, naive_pred).ravel()
naive_cost = fp * 20 + fn * 5_000 - tp * 5_000
print(f"Naive 0.5 threshold cost: ${naive_cost:,.0f}")

The optimal threshold will be well below 0.5. With a 250:1 cost ratio (FN/FP), the model should be aggressive about flagging potential fraud — the cost of missing one far exceeds the cost of reviewing many false alarms.

Matching Metrics to Problems

Stop defaulting to AUC-ROC. Use this decision framework:

AUC-ROC — Use when you care about rank-ordering across all thresholds and class balance is moderate. Do not use on heavily imbalanced data — a model that identifies 1 positive out of 1,000 negatives can still achieve high AUC by ranking that one positive highly, even though it is useless in practice.

PR-AUC (Average Precision) — Use on imbalanced data where the positive class is rare and you care about the precision-recall trade-off. Fraud detection, disease screening, rare event prediction. PR-AUC is sensitive to the performance on the minority class in a way that AUC-ROC is not.

F-beta score — Use when you have a specific preference for precision vs. recall. F1 weights them equally. F2 weights recall twice as much (you are more afraid of missing positives). F0.5 weights precision twice as much (you are more afraid of false alarms). Choose beta based on your cost ratio.

Log loss (cross-entropy) — Use when you need well-calibrated probabilities, not just rank ordering. Log loss penalizes confident wrong predictions severely. If your model says “99% positive” and the true label is negative, log loss makes it pay dearly.

Brier score — The mean squared difference between predicted probabilities and actual outcomes. Like log loss but on a more interpretable 0–1 scale. Use when you need calibrated probabilities and want a metric that decomposes into calibration, refinement, and resolution components.

For regression problems, the same principle applies — match the metric to the loss structure:

RMSE penalizes large errors quadratically. Use when big errors are disproportionately costly (energy demand forecasting — a 50 MW error is more than 5x worse than a 10 MW error).
MAE treats all errors linearly. Use when errors scale proportionally (delivery time estimation — being off by 10 minutes is exactly twice as bad as being off by 5).
MAPE measures percentage error. Use when relative error matters more than absolute error (revenue forecasting — being off by $1M matters more when revenue is $10M than when it is $1B).

Custom Scoring in Sklearn

You can plug your cost-driven metric directly into sklearn’s model selection tools:

from sklearn.metrics import make_scorer, fbeta_score
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier


def business_cost_score(
    y_true: np.ndarray,
    y_pred: np.ndarray,
    cost_fp: float = 20.0,
    cost_fn: float = 5_000.0,
) -> float:
    """
    Custom scorer: negative total business cost.
    Negative because sklearn maximizes scorers.
    """
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    total_cost = fp * cost_fp + fn * cost_fn
    return -total_cost  # Negative: sklearn maximizes, we minimize cost


cost_scorer = make_scorer(business_cost_score, cost_fp=20.0, cost_fn=5_000.0)

model = GradientBoostingClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(model, X, y, cv=5, scoring=cost_scorer)
print(f"Mean business cost: ${-scores.mean():,.0f} ± ${scores.std():,.0f}")

This forces every model comparison to speak in dollars, not abstract scores. When your stakeholder asks “which model is better?”, you answer with the expected cost difference — a number they can act on.

8.2 — Cross-Validation Strategies

Cross-validation has one job: estimate how your model will perform on data it has never seen. If your CV strategy allows information from the test set to leak into training — even subtly — your estimate is worthless. Worse than worthless, because you will trust it.

The Cardinal Sin: Random K-Fold on Time-Series Data

Random k-fold shuffles your data, then splits it into folds. For time-series data, this means your model trains on data from March and July while predicting February and June. You are training on the future to predict the past. Every pattern that exists in the time-series structure — trends, seasonality, regime changes — will leak across the fold boundary.

The result: your CV estimate will be higher than real-world performance. Every time.

import pandas as pd
from sklearn.model_selection import KFold, TimeSeriesSplit, cross_val_score
from sklearn.ensemble import RandomForestRegressor

# Simulate daily stock feature data with a trend
n_days = 1_000
dates = pd.date_range("2022-01-01", periods=n_days, freq="D")
rng = np.random.default_rng(42)

trend = np.linspace(0, 3, n_days)
noise = rng.normal(0, 1, n_days)
y = trend + noise  # Target with upward trend

# Features: lagged values + some noise features
X = pd.DataFrame({
    "lag_1": np.roll(y, 1),
    "lag_7": np.roll(y, 7),
    "day_of_week": dates.dayofweek,
    "noise_1": rng.normal(0, 1, n_days),
    "noise_2": rng.normal(0, 1, n_days),
})
X.iloc[:7] = 0  # Zero out the rolled-over values

model = RandomForestRegressor(n_estimators=100, random_state=42)

# --- WRONG: Random K-Fold ---
naive_scores = cross_val_score(model, X, y, cv=KFold(5, shuffle=True, random_state=42))
print(f"Random K-Fold R²:     {naive_scores.mean():.4f} ± {naive_scores.std():.4f}")

# --- RIGHT: Time-Series Split ---
ts_scores = cross_val_score(model, X, y, cv=TimeSeriesSplit(n_splits=5))
print(f"TimeSeries Split R²:  {ts_scores.mean():.4f} ± {ts_scores.std():.4f}")

Run this. The random k-fold R² will be substantially higher — not because the model is better, but because future data leaked into the training set. The time-series split gives you the honest number: the one that matches what you would see in production.

Time-Series CV with a Gap Period

In many forecasting problems, your features include lagged values. If you split right at the boundary with no gap, the last few training observations contain information about the first test observations through those lags. Insert a gap:

from sklearn.model_selection import BaseCrossValidator


class TimeSeriesSplitWithGap(BaseCrossValidator):
    """Time-series cross-validation with a gap between train and test."""

    def __init__(self, n_splits: int = 5, gap: int = 7, test_size: int | None = None):
        self.n_splits = n_splits
        self.gap = gap
        self.test_size = test_size

    def get_n_splits(self, X=None, y=None, groups=None) -> int:
        return self.n_splits

    def split(self, X, y=None, groups=None):
        n_samples = len(X)
        test_size = self.test_size or n_samples // (self.n_splits + 1)

        for i in range(self.n_splits):
            test_end = n_samples - (self.n_splits - 1 - i) * test_size
            test_start = test_end - test_size
            train_end = test_start - self.gap

            if train_end <= 0:
                continue

            train_idx = np.arange(0, train_end)
            test_idx = np.arange(test_start, test_end)
            yield train_idx, test_idx


ts_gap_cv = TimeSeriesSplitWithGap(n_splits=5, gap=14)
gap_scores = cross_val_score(model, X, y, cv=ts_gap_cv)
print(f"TimeSeries + 14-day gap R²: {gap_scores.mean():.4f} ± {gap_scores.std():.4f}")

The gap period should match your prediction horizon. If you forecast 7 days ahead, a 7-day gap ensures no leakage through lagged features.

Group Cross-Validation

When observations are not independent — multiple readings per patient, multiple transactions per customer, multiple images per object — random k-fold will split the same group across train and test. The model memorizes group-specific patterns and you get an inflated estimate of generalization.

from sklearn.model_selection import GroupKFold
from sklearn.linear_model import LogisticRegression

# Medical study: 200 patients, 10 readings each
n_patients = 200
readings_per_patient = 10
n_total = n_patients * readings_per_patient

rng = np.random.default_rng(42)

# Each patient has a baseline risk level — this is what leaks
patient_baselines = rng.normal(0, 1, n_patients)
patient_ids = np.repeat(np.arange(n_patients), readings_per_patient)

X_medical = np.column_stack([
    np.repeat(patient_baselines, readings_per_patient).reshape(-1, 1),
    rng.normal(0, 1, (n_total, 4)),  # Other features
])
y_medical = (
    np.repeat(patient_baselines, readings_per_patient) +
    rng.normal(0, 0.5, n_total)
) > 0.5
y_medical = y_medical.astype(int)

lr = LogisticRegression(random_state=42)

# --- WRONG: Random K-Fold leaks patient info ---
naive = cross_val_score(lr, X_medical, y_medical, cv=KFold(5, shuffle=True, random_state=42))
print(f"Random K-Fold AUC:  {naive.mean():.4f}")

# --- RIGHT: Group K-Fold keeps all patient data in one fold ---
from sklearn.metrics import roc_auc_score
group = cross_val_score(
    lr, X_medical, y_medical,
    cv=GroupKFold(n_splits=5),
    groups=patient_ids,
    scoring="roc_auc",
)
print(f"Group K-Fold AUC:   {group.mean():.4f}")

The difference will be stark. GroupKFold ensures that if patient #42 appears in training, none of patient #42’s readings appear in testing. This is the only honest estimate of how the model performs on new patients — which is what you actually care about.

Cross-Validation Strategies

Nested Cross-Validation

Here is a subtle trap: you use cross-validation to tune hyperparameters, then report the best CV score as your performance estimate. That estimate is biased upward. You searched over many configurations and reported the winner — the “best” score includes variance from the search process itself.

Nested CV solves this. The outer loop estimates generalization performance. The inner loop tunes hyperparameters. They use different data splits, so the outer estimate is unbiased.

import optuna
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

optuna.logging.set_verbosity(optuna.logging.WARNING)


def nested_cv_with_optuna(
    X: np.ndarray,
    y: np.ndarray,
    n_outer_splits: int = 5,
    n_inner_splits: int = 3,
    n_trials: int = 30,
) -> list[float]:
    """
    Nested cross-validation with Optuna hyperparameter search.
    Outer loop: unbiased performance estimation.
    Inner loop: hyperparameter optimization on training data only.
    """
    outer_cv = StratifiedKFold(n_splits=n_outer_splits, shuffle=True, random_state=42)
    outer_scores: list[float] = []

    for fold_idx, (train_idx, test_idx) in enumerate(outer_cv.split(X, y)):
        X_train, X_test = X[train_idx], X[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]

        # Inner loop: tune hyperparameters on training set only
        def objective(trial: optuna.Trial) -> float:
            params = {
                "n_estimators": trial.suggest_int("n_estimators", 50, 300),
                "max_depth": trial.suggest_int("max_depth", 3, 10),
                "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
                "subsample": trial.suggest_float("subsample", 0.6, 1.0),
                "min_samples_leaf": trial.suggest_int("min_samples_leaf", 1, 20),
            }
            inner_cv = StratifiedKFold(n_splits=n_inner_splits, shuffle=True, random_state=42)
            inner_scores = cross_val_score(
                GradientBoostingClassifier(**params, random_state=42),
                X_train, y_train,
                cv=inner_cv, scoring="roc_auc",
            )
            return inner_scores.mean()

        study = optuna.create_study(direction="maximize")
        study.optimize(objective, n_trials=n_trials, show_progress_bar=False)

        # Train best model on full training set, evaluate on held-out test
        best_model = GradientBoostingClassifier(
            **study.best_params, random_state=42
        )
        best_model.fit(X_train, y_train)
        test_score = roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1])
        outer_scores.append(test_score)
        print(f"  Fold {fold_idx + 1}: AUC = {test_score:.4f}")

    return outer_scores


# Example usage with synthetic data
from sklearn.datasets import make_classification

X_cls, y_cls = make_classification(
    n_samples=2_000, n_features=20, n_informative=10,
    n_redundant=5, weights=[0.7, 0.3], random_state=42,
)

print("Nested CV results:")
scores = nested_cv_with_optuna(X_cls, y_cls, n_trials=20)
print(f"\nUnbiased AUC estimate: {np.mean(scores):.4f} ± {np.std(scores):.4f}")

The nested CV estimate will typically be lower than the inner CV’s best score. That difference is the optimism bias you eliminate. This is the number you report to your stakeholder — it is the honest one.

One caution: nested CV is computationally expensive. With 5 outer folds, 3 inner folds, and 30 Optuna trials, you train 450 models. For large datasets or slow models, you may need to reduce the search space or use a faster surrogate model in the inner loop.