Skip to main content
pragmatic data science with python

Linear Baselines and Gradient Boosted Trees

15 min read Chapter 14 of 33
Summary

This section covers the two foundational model families...

This section covers the two foundational model families for tabular data. We start with linear baselines — logistic and linear regression with L1, L2, and ElasticNet regularization — and demonstrate their value as diagnostic tools for feature selection, calibration checking, and establishing performance floors. We then move to gradient-boosted trees: the sequential ensemble mechanics of XGBoost and LightGBM, the ten hyperparameters that drive performance (and the fifty you can ignore), proper tuning with Optuna and early stopping, native categorical handling in LightGBM, and reliable feature importance via SHAP. Every technique includes runnable code with type hints and explicit guidance on when it applies.

Linear Baselines and Gradient Boosted Trees

5.1 — Linear Baselines

A linear model is not a stepping stone you rush past on the way to XGBoost. It is a diagnostic instrument. When you train a logistic regression before anything else, you get three things for free: a performance floor that any complex model must beat to justify its existence, interpretable coefficients that reveal which features the data actually supports, and calibrated probability estimates that serve as a reference for evaluating tree model outputs.

If your XGBoost model achieves 0.82 AUC and your logistic regression achieves 0.80 AUC, that two-point gap should give you pause. It means the non-linear interactions and threshold effects that trees capture are contributing very little. Your features are doing most of the work, and a linear model captures most of the signal. In that case, the linear model might be the right production choice — it is faster to train, easier to debug, and its predictions are directly interpretable.

If your XGBoost achieves 0.92 AUC and your logistic regression achieves 0.72 AUC, the twenty-point gap tells you that your features have strong non-linear structure. Trees are earning their complexity. Ship the tree model.

Regularization: L1, L2, and ElasticNet

Regularization penalizes large coefficients to prevent overfitting. The three standard penalties serve different purposes:

L1 (Lasso) adds the sum of absolute coefficient values to the loss function. The key property: L1 drives coefficients to exactly zero, performing automatic feature selection. If you have 200 features and suspect only 20 matter, L1 will zero out the rest.

L2 (Ridge) adds the sum of squared coefficient values. L2 shrinks coefficients toward zero but never reaches it. Its strength is handling multicollinearity — when features are correlated, L2 distributes the coefficient weight across them instead of assigning it arbitrarily to one.

ElasticNet combines both: an L1 component for sparsity and an L2 component for stability. When features are both numerous and correlated, ElasticNet gives you the best of both worlds.

The choice is not aesthetic. If you need to identify which features matter, use L1. If you need stable predictions with correlated features, use L2. If you have both problems, use ElasticNet.

Feature Selection with L1 Regularization

Here is the practical application: you have 200 features and need to find the ones that carry signal. Train a logistic regression with L1 penalty and let the regularizer do the work.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score


def l1_feature_selection(
    n_informative: int = 5,
    n_features: int = 200,
    n_samples: int = 5_000,
    random_state: int = 42,
) -> dict[str, list[int] | float]:
    """Use L1-penalized logistic regression to identify informative features.

    Generates a synthetic dataset with `n_informative` real features
    buried among `n_features` total features, then recovers them
    using L1 regularization.
    """
    X, y = make_classification(
        n_samples=n_samples,
        n_features=n_features,
        n_informative=n_informative,
        n_redundant=10,
        n_clusters_per_class=2,
        random_state=random_state,
    )

    pipe = Pipeline([
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression(
            penalty="l1",
            solver="saga",
            C=0.1,          # Lower C = stronger regularization
            max_iter=5_000,
            random_state=random_state,
        )),
    ])

    # Cross-validated performance
    scores = cross_val_score(pipe, X, y, cv=5, scoring="roc_auc")
    pipe.fit(X, y)

    # Extract non-zero coefficients — these are the selected features
    coefs = pipe.named_steps["clf"].coef_[0]
    selected = np.where(np.abs(coefs) > 1e-6)[0].tolist()
    importance = sorted(
        zip(selected, coefs[selected]),
        key=lambda x: abs(x[1]),
        reverse=True,
    )

    print(f"Features selected: {len(selected)} out of {n_features}")
    print(f"True informative features: indices 0–{n_informative - 1}")
    print(f"AUC: {scores.mean():.4f} ± {scores.std():.4f}")
    print("\nTop features by |coefficient|:")
    for idx, coef in importance[:10]:
        marker = " ← informative" if idx < n_informative else ""
        print(f"  Feature {idx:3d}: {coef:+.4f}{marker}")

    return {
        "selected_features": selected,
        "auc_mean": float(scores.mean()),
    }


# Run it:
result = l1_feature_selection()
# Typical output: L1 selects ~15-25 features, with the 5 informative
# features having the largest coefficients.

The critical detail: you must scale your features before L1 regularization. L1 penalizes the absolute magnitude of coefficients. If one feature has values in the range [0, 1] and another in [0, 1000], the regularizer will shrink the large-scale feature’s coefficient disproportionately — not because the feature is uninformative, but because its coefficient is numerically larger. StandardScaler equalizes the playing field.

Calibrated Probabilities

Logistic regression has a property that tree models lack: its predicted probabilities are inherently calibrated. When a logistic regression says “this customer has a 30% churn probability,” that means roughly 30% of customers with similar predictions actually churn. This is a direct consequence of the logistic loss function, which is a proper scoring rule.

Tree models — random forests, XGBoost, LightGBM — produce probability estimates that are not calibrated out of the box. A random forest’s “probabilities” are the fraction of trees voting for each class, which tends to cluster away from 0 and 1. XGBoost’s sigmoid-transformed outputs are better but still not calibrated in the statistical sense.

This matters when your application uses predicted probabilities as inputs to downstream decisions — expected value calculations, risk scoring, bid optimization. If you need calibrated probabilities from a tree model, you must apply post-hoc calibration (Platt scaling or isotonic regression). But first, verify that your logistic regression baseline is well-calibrated, so you have a reference point.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import calibration_curve
from sklearn.model_selection import train_test_split


def compare_calibration(
    n_samples: int = 20_000,
    random_state: int = 42,
) -> dict[str, np.ndarray]:
    """Compare calibration curves of logistic regression vs. random forest.

    Returns fraction_of_positives and mean_predicted_value for each model,
    suitable for plotting calibration diagrams.
    """
    X, y = make_classification(
        n_samples=n_samples,
        n_features=30,
        n_informative=10,
        random_state=random_state,
    )
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=random_state, stratify=y,
    )

    lr = LogisticRegression(max_iter=1_000, random_state=random_state)
    rf = RandomForestClassifier(
        n_estimators=200, max_depth=10, random_state=random_state,
    )

    lr.fit(X_train, y_train)
    rf.fit(X_train, y_train)

    lr_probs = lr.predict_proba(X_test)[:, 1]
    rf_probs = rf.predict_proba(X_test)[:, 1]

    # Calibration curves: fraction_of_positives vs mean_predicted_value
    lr_frac, lr_mean = calibration_curve(y_test, lr_probs, n_bins=10)
    rf_frac, rf_mean = calibration_curve(y_test, rf_probs, n_bins=10)

    print("Calibration comparison (ideal: fraction ≈ mean predicted)")
    print("\nLogistic Regression:")
    for frac, mean in zip(lr_frac, lr_mean):
        delta = abs(frac - mean)
        print(f"  Predicted: {mean:.3f}  Actual: {frac:.3f}  Gap: {delta:.3f}")

    print("\nRandom Forest (uncalibrated):")
    for frac, mean in zip(rf_frac, rf_mean):
        delta = abs(frac - mean)
        print(f"  Predicted: {mean:.3f}  Actual: {frac:.3f}  Gap: {delta:.3f}")

    return {
        "lr_fraction": lr_frac, "lr_mean": lr_mean,
        "rf_fraction": rf_frac, "rf_mean": rf_mean,
    }


# The logistic regression calibration gaps will be small (< 0.02).
# The random forest gaps will be larger, especially near 0 and 1.
result = compare_calibration()

The logistic regression curve will hug the diagonal — its predicted probabilities match observed frequencies. The random forest curve will bow inward, underconfident at the extremes. When you see this pattern with your own data, you know that the tree model’s raw probabilities need post-hoc calibration before they can be used for decision-making.


5.2 — Gradient Boosted Trees

Gradient boosting builds an ensemble sequentially. Each new tree is trained not on the original targets, but on the residuals — the errors — of the ensemble so far. The first tree captures the dominant patterns. The second tree corrects where the first tree was wrong. The third tree corrects where the first two were wrong. After hundreds of iterations, the ensemble has incrementally carved the prediction surface into a highly accurate approximation of the true function.

This is fundamentally different from a random forest, which trains trees independently in parallel and averages their predictions. Boosting’s sequential nature means each tree is specialized: it focuses on the examples that the current ensemble gets wrong. This specialization is what gives boosted trees their edge — and their susceptibility to overfitting if you do not control the learning rate and tree depth.

XGBoost vs. LightGBM vs. CatBoost

All three implement gradient-boosted decision trees. The algorithmic core is the same. The differences that matter in practice:

FeatureXGBoostLightGBMCatBoost
Tree growthLevel-wise (balanced)Leaf-wise (can be unbalanced)Oblivious trees (symmetric)
Categorical handlingRequires manual encodingNative via categorical_featureNative, ordered target statistics
Training speedFastFastest (2–5× XGBoost on large data)Slowest, but less tuning needed
GPU supportYesYesYes
Missing value handlingBuilt-in (learns optimal direction)Built-inBuilt-in
Default behaviorNeeds tuningNeeds tuningStrong defaults, less tuning

When to choose which: LightGBM for large datasets where training speed matters and you have the patience to tune. CatBoost when you have many categorical features and want strong defaults with minimal tuning. XGBoost as the reliable default when you want the broadest ecosystem support and the most documentation.

In practice, the performance differences between the three are within noise on most datasets. Choose based on your workflow preferences and run a quick comparison on your specific data if the decision matters.

The 10 Hyperparameters That Matter

Gradient-boosted tree libraries expose 50+ hyperparameters. Most of them do not meaningfully affect performance. Here are the 10 that do:

ParameterXGBoost nameLightGBM nameWhat it controlsTypical range
Learning rateeta / learning_ratelearning_rateStep size per tree0.01–0.3
Number of treesn_estimatorsn_estimatorsEnsemble size (use early stopping)100–10,000
Max depthmax_depthmax_depthTree complexity3–10
Min child weightmin_child_weightmin_child_samplesMinimum samples per leaf1–100
Subsamplesubsamplebagging_fractionRow sampling per tree0.5–1.0
Column subsamplecolsample_bytreefeature_fractionFeature sampling per tree0.5–1.0
L1 regularizationreg_alphalambda_l1Coefficient sparsity0–10
L2 regularizationreg_lambdalambda_l2Coefficient shrinkage0–10
Gamma / min split lossgammamin_gain_to_splitMinimum loss reduction for split0–5
Scale pos weightscale_pos_weightscale_pos_weightClass imbalance correctionratio of neg/pos

Everything else — max_bin, grow_policy, tree_method, max_leaves — has second-order effects that you tune only after the big 10 are set.

XGBoost with Optuna: Proper Hyperparameter Tuning

Manual grid search is dead. Optuna’s Bayesian optimization explores the hyperparameter space more efficiently than grid or random search, and its pruning mechanism kills unpromising trials early.

import numpy as np
import optuna
import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold, cross_val_score


def tune_xgboost_with_optuna(
    n_trials: int = 50,
    n_samples: int = 10_000,
    random_state: int = 42,
) -> dict:
    """Tune XGBoost hyperparameters using Optuna with early stopping.

    Returns the best parameters and the corresponding cross-validated AUC.
    """
    X, y = make_classification(
        n_samples=n_samples,
        n_features=50,
        n_informative=15,
        n_redundant=10,
        weights=[0.7, 0.3],
        random_state=random_state,
    )

    def objective(trial: optuna.Trial) -> float:
        params = {
            "n_estimators": 2_000,  # High ceiling — early stopping will cut it
            "learning_rate": trial.suggest_float(
                "learning_rate", 0.01, 0.3, log=True,
            ),
            "max_depth": trial.suggest_int("max_depth", 3, 10),
            "min_child_weight": trial.suggest_int("min_child_weight", 1, 50),
            "subsample": trial.suggest_float("subsample", 0.5, 1.0),
            "colsample_bytree": trial.suggest_float(
                "colsample_bytree", 0.5, 1.0,
            ),
            "reg_alpha": trial.suggest_float("reg_alpha", 1e-3, 10, log=True),
            "reg_lambda": trial.suggest_float(
                "reg_lambda", 1e-3, 10, log=True,
            ),
            "gamma": trial.suggest_float("gamma", 0, 5),
            "random_state": random_state,
            "eval_metric": "auc",
            "early_stopping_rounds": 50,
            "verbosity": 0,
        }

        cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=random_state)
        scores: list[float] = []

        for train_idx, val_idx in cv.split(X, y):
            X_train, X_val = X[train_idx], X[val_idx]
            y_train, y_val = y[train_idx], y[val_idx]

            model = xgb.XGBClassifier(**params)
            model.fit(
                X_train, y_train,
                eval_set=[(X_val, y_val)],
                verbose=False,
            )
            # Use the best iteration, not the last one
            y_prob = model.predict_proba(X_val)[:, 1]
            from sklearn.metrics import roc_auc_score
            scores.append(roc_auc_score(y_val, y_prob))

        return float(np.mean(scores))

    # Suppress Optuna's verbose logging
    optuna.logging.set_verbosity(optuna.logging.WARNING)
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=n_trials)

    print(f"Best AUC: {study.best_value:.4f}")
    print(f"Best params:")
    for key, val in study.best_params.items():
        print(f"  {key}: {val}")

    return {
        "best_auc": study.best_value,
        "best_params": study.best_params,
    }


result = tune_xgboost_with_optuna(n_trials=30)

Three details in this code are non-negotiable:

  1. Early stopping, not fixed n_estimators. Set n_estimators high (2,000+) and let early_stopping_rounds halt training when validation performance plateaus. This decouples the number of trees from the learning rate — a low learning rate with many trees is almost always better than a high learning rate with few trees.

  2. Stratified k-fold, not random split. StratifiedKFold preserves the class distribution in each fold. With imbalanced data, a random split can produce folds where the minority class is nearly absent.

  3. Log-scale sampling for learning rate and regularization. These parameters span orders of magnitude. Sampling uniformly between 0.01 and 0.3 wastes most of the budget in the 0.1–0.3 range. Log-scale sampling distributes trials evenly across the magnitude range.

LightGBM with Native Categorical Support

LightGBM handles categorical features natively — no one-hot encoding, no target encoding, no ordinal encoding. You pass the raw categorical column (as integers or strings) and tell LightGBM which columns are categorical. Internally, it finds the optimal split by partitioning category values into two groups, which is more expressive than the single-threshold splits that ordinal encoding allows.

import numpy as np
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score


def lgbm_with_categoricals(
    n_samples: int = 10_000,
    random_state: int = 42,
) -> dict[str, float]:
    """Train LightGBM with native categorical handling.

    Generates synthetic data with both numeric and categorical features
    to demonstrate native categorical support without manual encoding.
    """
    rng = np.random.default_rng(random_state)

    # Synthetic features: 10 numeric + 2 categorical
    X_numeric = rng.standard_normal((n_samples, 10))

    # Categorical: city (50 values) and product_type (8 values)
    city = rng.integers(0, 50, size=n_samples)
    product_type = rng.integers(0, 8, size=n_samples)

    X = np.column_stack([X_numeric, city, product_type])
    feature_names = [f"num_{i}" for i in range(10)] + ["city", "product_type"]

    # Target depends on numeric features AND categorical interactions
    logit = (
        0.5 * X_numeric[:, 0]
        - 0.3 * X_numeric[:, 1]
        + 0.1 * (city % 5 == 0).astype(float)  # Some cities are high-risk
        + 0.2 * (product_type > 5).astype(float)
    )
    y = (rng.random(n_samples) < 1 / (1 + np.exp(-logit))).astype(int)

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=random_state, stratify=y,
    )

    # Create LightGBM dataset with categorical feature specification
    train_data = lgb.Dataset(
        X_train,
        label=y_train,
        feature_name=feature_names,
        categorical_feature=["city", "product_type"],  # <-- Native handling
    )
    val_data = lgb.Dataset(
        X_test, label=y_test, reference=train_data,
    )

    params = {
        "objective": "binary",
        "metric": "auc",
        "learning_rate": 0.05,
        "num_leaves": 31,
        "max_depth": -1,           # No limit — controlled by num_leaves
        "min_child_samples": 20,
        "feature_fraction": 0.8,
        "bagging_fraction": 0.8,
        "bagging_freq": 5,
        "verbose": -1,
    }

    callbacks = [
        lgb.early_stopping(stopping_rounds=50),
        lgb.log_evaluation(period=0),
    ]

    model = lgb.train(
        params,
        train_data,
        num_boost_round=1_000,
        valid_sets=[val_data],
        callbacks=callbacks,
    )

    y_prob = model.predict(X_test)
    auc = roc_auc_score(y_test, y_prob)
    print(f"AUC with native categoricals: {auc:.4f}")
    print(f"Best iteration: {model.best_iteration}")

    return {"auc": auc, "best_iteration": model.best_iteration}


result = lgbm_with_categoricals()

The advantage over manual encoding is not performance (though it can be slightly better) — it is simplicity. You do not need to maintain an encoding pipeline, handle unseen categories at inference time, or worry about target leakage from naively computed target encodings. LightGBM handles all of this internally.

Feature Importance: Why SHAP Is the Only Reliable Method

Tree models offer built-in feature importance metrics. You should not trust them.

Gain-based importance measures the total reduction in loss contributed by splits on each feature. The problem: it is biased toward high-cardinality features. A feature with 1,000 unique values has more splitting opportunities than a binary feature, so it accumulates higher importance even if it is less predictive.

Split-count importance measures how often each feature is used in splits. Same bias — high-cardinality features are split on more often because they offer more candidate thresholds.

SHAP (SHapley Additive exPlanations) is the only feature importance method grounded in game theory with guaranteed consistency properties. It measures each feature’s marginal contribution to every prediction, averaged over all possible feature coalitions. SHAP values are additive (they sum to the model’s output), locally accurate (they explain individual predictions), and consistent (if a feature’s true contribution increases, its SHAP value increases).

import numpy as np
import shap
import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split


def shap_importance_analysis(
    n_samples: int = 5_000,
    random_state: int = 42,
) -> None:
    """Compute and display SHAP values for an XGBoost model.

    Generates a summary plot showing global feature importance and
    a dependence plot showing how one feature's effect varies.
    """
    X, y = make_classification(
        n_samples=n_samples,
        n_features=20,
        n_informative=8,
        n_redundant=4,
        random_state=random_state,
    )
    feature_names = [f"feature_{i}" for i in range(20)]

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=random_state,
    )

    model = xgb.XGBClassifier(
        n_estimators=300,
        max_depth=6,
        learning_rate=0.1,
        random_state=random_state,
        eval_metric="logloss",
    )
    model.fit(X_train, y_train)

    # TreeExplainer is exact and fast for tree models
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(X_test)

    # Global importance: mean |SHAP| per feature
    mean_abs_shap = np.abs(shap_values).mean(axis=0)
    ranked = sorted(
        zip(feature_names, mean_abs_shap),
        key=lambda x: x[1],
        reverse=True,
    )
    print("SHAP-based feature importance (mean |SHAP value|):")
    for name, importance in ranked[:10]:
        bar = "█" * int(importance * 50)
        print(f"  {name:>12s}: {importance:.4f}  {bar}")

    # Summary plot: shows feature importance AND feature effect direction
    # Each dot is one observation: x-position = SHAP value, color = feature value
    shap.summary_plot(shap_values, X_test, feature_names=feature_names, show=False)
    print("\n(Summary plot rendered — save with plt.savefig() in production)")

    # Dependence plot: how one feature's SHAP value varies with its value
    top_feature_idx = ranked[0][1]
    shap.dependence_plot(
        0, shap_values, X_test,
        feature_names=feature_names,
        show=False,
    )
    print("(Dependence plot rendered for top feature)")


shap_importance_analysis()

The SHAP summary plot gives you two pieces of information that gain-based importance cannot: the direction of each feature’s effect (does higher value → higher or lower prediction?) and the distribution of effects across observations (is the feature important for all observations, or only for a subset?). These are the insights that drive feature engineering iteration — they tell you not only which features matter, but how they matter.

Gradient Boosting Mechanics