Categorical Encoding and Time-Series Features

4.1 — High-Cardinality Categoricals

Your dataset has a zip_code column with 10,000 unique values. You call pd.get_dummies() — or its Polars equivalent — and your feature matrix explodes from 20 columns to 10,020 sparse columns. Each column is almost entirely zeros. Your model now spends most of its capacity fitting noise in those 10,000 sparse indicators, and your training time increases by two orders of magnitude.

One-hot encoding works for low-cardinality categoricals: gender (2–3 values), day of week (7 values), product category (maybe 20 values). The moment cardinality exceeds a few dozen, one-hot encoding becomes a liability. The feature matrix becomes sparse, memory-hungry, and filled with columns that contain almost no information — a single zip code that appears in 0.01% of rows gives the model one observation to learn from.

The solution is to encode categoricals as dense, low-dimensional numeric features that capture the relationship between the category and the target variable. This is where target encoding enters — and where most implementations get it wrong.

Target Encoding: The Idea, the Math, the Danger

Target encoding replaces each category with the mean of the target variable for that category. If zip code 90210 has a 12% fraud rate in your training data, every row with zip_code == 90210 gets the value 0.12.

The idea is elegant: you compress a 10,000-dimensional one-hot vector into a single numeric feature that directly encodes each category’s relationship to the target.

The danger is severe: you are encoding the target into the features. If a zip code appears only 3 times in training and 2 of those are fraudulent, the encoding will be 0.667 — a dramatically overfit estimate that tells the model “this zip code is extremely high risk” based on almost no evidence.

Here is the naive implementation — do not use this in production:

import polars as pl
import numpy as np


def naive_target_encoding(
    df: pl.DataFrame,
    cat_col: str,
    target_col: str,
) -> pl.DataFrame:
    """DANGEROUS: Naive target encoding with no regularization.

    This leaks the target into the features and overfits on
    low-frequency categories. Shown here only to illustrate
    the problem.
    """
    means = df.group_by(cat_col).agg(
        pl.col(target_col).mean().alias(f"{cat_col}_encoded")
    )
    return df.join(means, on=cat_col, how="left")


# Why this fails: a zip code with 2 observations and 100% fraud rate
# gets encoded as 1.0 — the model treats it as a guaranteed fraud indicator
# that is pure overfitting noise.

The fix requires two components: smoothing (regularization toward the global mean) and out-of-fold computation (never compute the encoding for a row using that row’s own target value).

Regularized Target Encoding with Smoothing

Smoothing blends the category mean with the global mean, weighted by the number of observations. Categories with many observations get encodings close to their actual mean. Categories with few observations get pulled toward the global mean.

The formula: $\text{encoded}_i = \frac{n_i \cdot \bar{y}i + m \cdot \bar{y}{\text{global}}}{n_i + m}$

Where $n_i$ is the count for category $i$, $\bar{y}i$ is the mean target for category $i$, $\bar{y}{\text{global}}$ is the global target mean, and $m$ is the smoothing parameter (higher $m$ = more regularization).

import polars as pl
import numpy as np
from sklearn.model_selection import KFold


def regularized_target_encoding(
    df: pl.DataFrame,
    cat_col: str,
    target_col: str,
    smoothing: float = 10.0,
    n_splits: int = 5,
    seed: int = 42,
) -> pl.DataFrame:
    """Cross-validated target encoding with Bayesian smoothing.

    Each fold's encoding is computed from the other folds only,
    preventing the target from leaking into its own encoding.
    The smoothing parameter controls regularization toward the
    global mean for low-frequency categories.
    """
    encoded_col = f"{cat_col}_target_enc"
    global_mean = df[target_col].mean()

    # Initialize output with nulls
    result = np.full(len(df), np.nan)
    indices = np.arange(len(df))

    kf = KFold(n_splits=n_splits, shuffle=True, random_state=seed)
    target_arr = df[target_col].to_numpy()
    cat_arr = df[cat_col].to_numpy()

    for train_idx, val_idx in kf.split(indices):
        # Compute encoding from training fold only
        train_cats = cat_arr[train_idx]
        train_targets = target_arr[train_idx]

        # Category-level aggregations on training fold
        cat_stats: dict[str, tuple[float, int]] = {}
        for cat, tgt in zip(train_cats, train_targets):
            if cat not in cat_stats:
                cat_stats[cat] = (0.0, 0)
            cat_stats[cat] = (
                cat_stats[cat][0] + tgt,
                cat_stats[cat][1] + 1,
            )

        # Apply smoothed encoding to validation fold
        for idx in val_idx:
            cat = cat_arr[idx]
            if cat in cat_stats:
                cat_sum, cat_count = cat_stats[cat]
                cat_mean = cat_sum / cat_count
                result[idx] = (
                    cat_count * cat_mean + smoothing * global_mean
                ) / (cat_count + smoothing)
            else:
                # Unseen category: fall back to global mean
                result[idx] = global_mean

    return df.with_columns(pl.Series(name=encoded_col, values=result))

The smoothing parameter of 10.0 means a category needs at least ~10 observations before its encoding diverges meaningfully from the global mean. For high-stakes applications (fraud, medical), increase this to 20–50.

The Encoding Toolkit

Target encoding is not the only option. Different encoding strategies make different tradeoffs:

Frequency encoding replaces each category with its frequency in the training set. No target leakage risk, and it captures the intuition that common categories behave differently from rare ones. Works well when frequency itself is predictive (popular products, busy merchants).

Hash encoding maps each category to a fixed-size vector using a hash function. You choose the output dimensionality (e.g., 64 bins for 10,000 categories). Collisions are inevitable, but in practice they rarely matter. The key advantage: the encoding is deterministic and does not need to be fit on the training set, so it handles unseen categories at inference time with zero code changes.

Leave-one-out encoding computes the target mean for each category excluding the current row. This reduces overfitting compared to naive target encoding but does not eliminate it — with very small categories, removing one observation changes the mean dramatically. Combine with smoothing for production use.

Encoding	Leakage Risk	Handles Unseen	Dimensionality	Best For
One-hot	None	No (crash or ignore)	= cardinality	Low cardinality (<30)
Target (regularized)	Low (with CV + smoothing)	Yes (global mean fallback)	1 per column	High cardinality with strong target correlation
Frequency	None	No (zero or fallback)	1 per column	When frequency is intrinsically predictive
Hash	None	Yes (deterministic)	Configurable (e.g., 64)	Extremely high cardinality, production systems
Leave-one-out	Medium	Yes (global mean fallback)	1 per column	Moderate cardinality, when regularized

Categorical Encoding Pipeline

In production, you rarely use a single encoding. Here is a complete pipeline that selects the encoding strategy based on cardinality:

import polars as pl
import numpy as np
from sklearn.model_selection import KFold


def categorical_encoding_pipeline(
    train_df: pl.DataFrame,
    test_df: pl.DataFrame,
    cat_cols: list[str],
    target_col: str,
    cardinality_threshold: int = 30,
    smoothing: float = 10.0,
    n_splits: int = 5,
    seed: int = 42,
) -> tuple[pl.DataFrame, pl.DataFrame]:
    """Apply appropriate encoding based on cardinality.

    Low-cardinality columns get one-hot encoding.
    High-cardinality columns get cross-validated target encoding.
    """
    train_result = train_df.clone()
    test_result = test_df.clone()
    global_mean = train_df[target_col].mean()

    for col in cat_cols:
        n_unique = train_df[col].n_unique()

        if n_unique <= cardinality_threshold:
            # One-hot encode: safe at low cardinality
            unique_vals = train_df[col].unique().to_list()
            for val in unique_vals:
                indicator_name = f"{col}_{val}"
                train_result = train_result.with_columns(
                    (pl.col(col) == val).cast(pl.Int8).alias(indicator_name)
                )
                test_result = test_result.with_columns(
                    (pl.col(col) == val).cast(pl.Int8).alias(indicator_name)
                )
        else:
            # Target encode with CV + smoothing
            enc_name = f"{col}_target_enc"

            # Train: out-of-fold encoding
            train_enc = np.full(len(train_result), np.nan)
            cat_arr = train_result[col].to_numpy()
            target_arr = train_result[target_col].to_numpy()
            kf = KFold(n_splits=n_splits, shuffle=True, random_state=seed)

            for trn_idx, val_idx in kf.split(cat_arr):
                stats: dict = {}
                for i in trn_idx:
                    c = cat_arr[i]
                    stats.setdefault(c, [0.0, 0])
                    stats[c][0] += target_arr[i]
                    stats[c][1] += 1

                for i in val_idx:
                    c = cat_arr[i]
                    if c in stats:
                        s, n = stats[c]
                        cat_mean = s / n
                        train_enc[i] = (n * cat_mean + smoothing * global_mean) / (n + smoothing)
                    else:
                        train_enc[i] = global_mean

            train_result = train_result.with_columns(
                pl.Series(name=enc_name, values=train_enc)
            )

            # Test: full training set statistics
            full_stats: dict = {}
            for c, t in zip(cat_arr, target_arr):
                full_stats.setdefault(c, [0.0, 0])
                full_stats[c][0] += t
                full_stats[c][1] += 1

            test_cat = test_result[col].to_numpy()
            test_enc = np.array([
                (full_stats[c][1] * (full_stats[c][0] / full_stats[c][1])
                 + smoothing * global_mean) / (full_stats[c][1] + smoothing)
                if c in full_stats else global_mean
                for c in test_cat
            ])
            test_result = test_result.with_columns(
                pl.Series(name=enc_name, values=test_enc)
            )

    return train_result, test_result

Notice that the test set is encoded using the full training set statistics, not its own data. This is critical — computing target statistics on the test set would be target leakage.

Target Encoding

The diagram above shows how cross-validated target encoding prevents information leakage: each fold’s encoding is computed exclusively from the remaining folds, and the test set is encoded from the full training set.

4.2 — Time-Series Features: Extracting Temporal Signal Without Lookahead

Time-series data is seductive. It is dense with patterns — trends, cycles, seasonality, autocorrelation — and the features practically write themselves: lag the target by one period, compute a rolling average, add day-of-week dummies. The danger is equally dense: every one of these operations can silently introduce lookahead bias, where your features use information from the future to predict the past.

The Cardinal Sin: Shuffled Splits on Temporal Data

When you call train_test_split(X, y, test_size=0.2, shuffle=True) on time-series data, you create a test set with rows from the middle of your timeline. The training set includes rows from after the test rows — rows that contain future information. Your model’s validation metrics reflect its ability to interpolate rather than extrapolate, and they will be dramatically optimistic compared to real-world performance.

The rule is non-negotiable: temporal data requires temporal splits. Training data comes from before the test data. Always. A gap period between training and test handles label delay (fraud reports, outcomes that take days or weeks to materialize).

Lag Features and Rolling Statistics

Lag features encode the assumption that recent history is predictive of the present. A customer who made 5 purchases in the last 7 days behaves differently from one who made 5 purchases in the last 90 days. Rolling statistics — mean, standard deviation, min, max over a sliding window — capture the distribution of recent behavior rather than a single snapshot.

The critical detail: the window must be backward-looking only. A rolling mean computed as df["value"].rolling(7).mean() in most libraries defaults to a trailing window, which is correct. But if you sort your data incorrectly, or if you compute the rolling stat on an unsorted DataFrame, the window can include future values.

import polars as pl
import numpy as np
from datetime import date, timedelta


def build_temporal_features(
    df: pl.DataFrame,
    date_col: str,
    value_col: str,
    group_col: str | None = None,
    lags: list[int] | None = None,
    rolling_windows: list[int] | None = None,
) -> pl.DataFrame:
    """Build time-series features with strict backward-looking semantics.

    All features are computed using only past data relative to each row.
    No future information leaks into any feature.

    Args:
        df: Input DataFrame, must be sorted by date_col (ascending).
        date_col: Name of the date/datetime column.
        value_col: Column to compute features from.
        group_col: Optional grouping column (e.g., customer_id).
        lags: List of lag periods (e.g., [1, 7, 14, 28]).
        rolling_windows: List of rolling window sizes (e.g., [7, 14, 30]).

    Returns:
        DataFrame with original columns plus temporal features.
    """
    if lags is None:
        lags = [1, 7, 14, 28]
    if rolling_windows is None:
        rolling_windows = [7, 14, 30]

    result = df.sort(date_col)

    over = [group_col] if group_col else []

    # Lag features: value at t-k
    for lag in lags:
        result = result.with_columns(
            pl.col(value_col)
            .shift(lag)
            .over(over) if over else pl.col(value_col).shift(lag)
        ).rename({value_col: value_col}).with_columns(
            pl.col(value_col)
            .shift(lag)
            .over(*over) if over else pl.col(value_col).shift(lag)
        )

    # Cleaner approach using expressions
    lag_exprs = []
    for lag in lags:
        expr = pl.col(value_col).shift(lag)
        if over:
            expr = expr.over(over)
        lag_exprs.append(expr.alias(f"{value_col}_lag_{lag}"))

    rolling_exprs = []
    for window in rolling_windows:
        # Rolling mean — shift(1) ensures we exclude the current row
        mean_expr = pl.col(value_col).shift(1).rolling_mean(window_size=window)
        std_expr = pl.col(value_col).shift(1).rolling_std(window_size=window)
        min_expr = pl.col(value_col).shift(1).rolling_min(window_size=window)
        max_expr = pl.col(value_col).shift(1).rolling_max(window_size=window)

        if over:
            mean_expr = mean_expr.over(over)
            std_expr = std_expr.over(over)
            min_expr = min_expr.over(over)
            max_expr = max_expr.over(over)

        rolling_exprs.extend([
            mean_expr.alias(f"{value_col}_rolling_mean_{window}"),
            std_expr.alias(f"{value_col}_rolling_std_{window}"),
            min_expr.alias(f"{value_col}_rolling_min_{window}"),
            max_expr.alias(f"{value_col}_rolling_max_{window}"),
        ])

    # Exponential moving average (manual, since Polars uses ewm_mean)
    for span in [7, 21]:
        ema_expr = pl.col(value_col).shift(1).ewm_mean(span=span)
        if over:
            ema_expr = ema_expr.over(over)
        rolling_exprs.append(
            ema_expr.alias(f"{value_col}_ema_{span}")
        )

    result = df.sort(date_col).with_columns(lag_exprs + rolling_exprs)
    return result


# Build features on synthetic daily sales data
rng = np.random.default_rng(42)
n_days = 365
dates = [date(2025, 1, 1) + timedelta(days=i) for i in range(n_days)]

sales_df = pl.DataFrame({
    "date": dates,
    "daily_sales": (
        100
        + 20 * np.sin(2 * np.pi * np.arange(n_days) / 7)      # weekly cycle
        + 5 * np.sin(2 * np.pi * np.arange(n_days) / 365.25)   # yearly cycle
        + rng.normal(0, 8, n_days)                               # noise
    ),
})

featured = build_temporal_features(
    sales_df,
    date_col="date",
    value_col="daily_sales",
    lags=[1, 7, 14],
    rolling_windows=[7, 14, 30],
)
print(featured.head(35).select(
    "date", "daily_sales",
    "daily_sales_lag_1", "daily_sales_lag_7",
    "daily_sales_rolling_mean_7", "daily_sales_ema_7",
))

The shift(1) before each rolling calculation is essential. Without it, the rolling mean at time $t$ includes the value at time $t$ — which means the feature encodes the current observation. If daily_sales is your target (or derived from your target), this is direct leakage.

Lookahead Bias: How Future Information Sneaks In

Lookahead bias is not limited to rolling windows. It appears in at least four forms:

Shuffled train/test splits. Covered above. Always split temporally.
Global statistics computed before splitting. If you standardize features using the full dataset’s mean and standard deviation before the temporal split, the training set’s standardization includes information from the test period.
Lag features across group boundaries. If you compute shift(1) without grouping by entity, the last row of customer A’s history “leaks” into the first row of customer B’s features. Always specify the group column.
Target encoding on time-series. Cross-validated target encoding (§4.1) uses random folds, which mixes future and past. For time-series data, you must use expanding-window target encoding: the encoding for each row uses only data from before that row’s timestamp.

Seasonality Decomposition: STL

When your data contains strong seasonal patterns — weekly sales cycles, monthly billing patterns, annual weather effects — decomposing the series into trend, seasonal, and residual components gives your model cleaner inputs than the raw series.

STL (Seasonal and Trend decomposition using Loess) is the standard tool. It is non-parametric (no distributional assumptions), robust to outliers, and handles multiple seasonal periods.

import polars as pl
import numpy as np
from datetime import date, timedelta
from statsmodels.tsa.seasonal import STL


def stl_decompose_features(
    df: pl.DataFrame,
    date_col: str,
    value_col: str,
    period: int = 7,
    robust: bool = True,
) -> pl.DataFrame:
    """Extract trend, seasonal, and residual components via STL.

    These three components replace the raw series in your feature matrix,
    giving the model cleaner inputs to work with.

    Args:
        df: DataFrame sorted by date_col.
        value_col: The series to decompose.
        period: Seasonal period (7 for weekly, 12 for monthly, 365 for yearly).
        robust: If True, uses robust fitting that downweights outliers.
    """
    series = df[value_col].to_numpy()

    stl = STL(series, period=period, robust=robust)
    result = stl.fit()

    return df.with_columns(
        pl.Series(name=f"{value_col}_trend", values=result.trend),
        pl.Series(name=f"{value_col}_seasonal", values=result.seasonal),
        pl.Series(name=f"{value_col}_residual", values=result.resid),
    )


# Decompose the sales data from above
decomposed = stl_decompose_features(
    sales_df, date_col="date", value_col="daily_sales", period=7
)

print("STL decomposition (first 14 days):")
print(decomposed.head(14).select(
    "date", "daily_sales",
    "daily_sales_trend", "daily_sales_seasonal", "daily_sales_residual"
))

A word of caution: STL decomposition on the full dataset (train + test) introduces lookahead bias, because the trend estimate at each point uses the entire series. For production use, fit STL on the training period only and apply the learned seasonal pattern to the test period, or use an incremental decomposition approach.

The features from this section — regularized target encodings, temporal lag features, rolling statistics, and seasonality components — form the backbone of most competitive tabular ML pipelines. In the next section, we move from structured data to unstructured text and address what to do when your feature space has grown beyond your model’s capacity to exploit it.