The “You Probably Just Need XGBoost” Chapter

Here is the uncomfortable truth that no one at a deep learning conference will tell you: for tabular data, gradient-boosted trees win. Not sometimes. Not in certain niches. In the overwhelming majority of structured data problems — fraud detection, churn prediction, credit scoring, demand forecasting, clinical risk stratification — XGBoost or LightGBM will match or beat whatever neural architecture you throw at the problem. And they will do it in a fraction of the training time, with a fraction of the infrastructure cost, and with interpretability that no transformer variant can touch.

This is not an opinion. The evidence is extensive. The 2022 paper “Why do tree-based models still outperform deep learning on tabular data?” by Grinsztajn et al. ran systematic benchmarks across 45 datasets and found that tree ensembles dominated. Kaggle competition results tell the same story: gradient-boosted trees appear in the winning solution of the majority of tabular competitions, year after year. Deep learning on tabular data has a narrow window of superiority — very large datasets with high-cardinality categorical features and complex feature interactions — and even there, the margin is slim.

The implication for your work is direct: unless your data is images, audio, text, or video, your default model should be a gradient-boosted tree. If someone proposes a neural network for a tabular problem, the burden of proof is on them to demonstrate it outperforms XGBoost on your specific dataset with your specific evaluation metric. That demonstration almost never survives rigorous testing.

The Modeling Hierarchy

This does not mean you should skip straight to XGBoost for every problem. There is a hierarchy, and each level serves a purpose:

Level 1 — Linear Baseline. Always start here. A logistic regression or linear regression takes minutes to train, gives you calibrated probabilities (for classification), and serves as an indispensable diagnostic tool. If your XGBoost model cannot beat logistic regression, your features are the problem, not your model. The linear baseline also provides interpretable coefficients that serve as a sanity check for feature importance.

Level 2 — Gradient-Boosted Trees. Your production model for 90% of tabular problems. XGBoost, LightGBM, or CatBoost — the differences between them are real but rarely decisive. This is where you invest your hyperparameter tuning budget and your feature engineering effort.

Level 3 — Deep Learning. Reserve this for problems where the data modality demands it (images, text, sequences) or where you have tens of millions of rows with complex feature interactions that trees cannot capture. If you reach for a neural network on a 50,000-row tabular dataset, you are optimizing for résumé padding, not model performance.

Level 4 — Ensembles of the above. Stacking a linear model, a tree model, and a neural network can squeeze out the last 0.1% of performance. This is Kaggle territory — high effort, diminishing returns, and rarely worth the production complexity.

The hierarchy is not a suggestion. It is a diagnostic tool. Each level tells you something about your problem. If linear models work well, your problem has strong linear signal and does not need model complexity. If trees dramatically outperform linear models, your features have non-linear interactions and threshold effects that trees capture naturally. If deep learning outperforms trees, your data likely has structure (sequential, spatial, semantic) that flat feature vectors destroy.

Model Selection Hierarchy

Why Trees Win on Tabular Data

The dominance of gradient-boosted trees on tabular data is not accidental — it reflects a structural advantage that neural networks cannot easily replicate.

Trees partition the feature space with axis-aligned splits. Each split is a threshold on a single feature: “if income > $75,000, go left; otherwise, go right.” This matches how tabular features work — they are independent measurements on heterogeneous scales, not spatially correlated pixels or sequentially ordered tokens. A tree does not need to learn that feature 3 and feature 17 have no spatial relationship; it handles them independently by construction.

Neural networks, by contrast, learn continuous transformations of the input vector. They excel when features have spatial structure (convolutional networks exploit pixel adjacency), temporal structure (recurrent networks exploit sequential ordering), or semantic structure (transformers exploit token relationships). On tabular data, where feature 3 might be “age in years” and feature 17 might be “zip code population density,” there is no inherent structure for the network to exploit. The network must learn from scratch what trees get for free: that each feature should be evaluated independently via threshold comparisons.

Trees also handle mixed feature types natively. Numeric features, ordinal features, and categorical features (with proper handling) all coexist in the same model without normalization, scaling, or embedding layers. Neural networks require you to engineer the input representation — normalize numerics, embed categoricals, handle missing values as a preprocessing step. Trees handle missing values internally by learning optimal split directions for absent features.

The one area where neural networks have a structural advantage on tabular data is feature interaction discovery. A tree interaction requires one split per feature in a path — a three-way interaction needs a tree at least three levels deep. A neural network can learn arbitrary feature interactions in a single hidden layer. This advantage matters on datasets with complex, high-order interactions across many features — but those datasets are rare relative to the vast majority of tabular problems where two-way and three-way interactions suffice.

What This Chapter Covers

We work through the hierarchy in order. Section 5.1 builds the linear baseline — not as a throwaway step, but as a first-class diagnostic tool that reveals feature quality, multicollinearity, and calibration properties. Section 5.2 covers gradient-boosted tree mechanics — how boosting works, which hyperparameters matter, and how to tune them efficiently with Optuna and early stopping. We also address feature importance properly, because gain-based importance is unreliable and SHAP is the only method you should trust.

Section 5.3 introduces monotonic constraints — the mechanism for injecting domain knowledge into tree models when the directional relationship between a feature and the target is known with certainty. Section 5.4 tackles imbalanced data, the problem that derails more production models than any other. We show why SMOTE usually fails, what actually works (class weights, threshold tuning, proper metrics), and how to build a complete pipeline for extreme class imbalance.

The Uncomfortable Checklist

Before you reach for a neural network on tabular data, answer these questions honestly:

Question	If Yes	If No
Do you have > 10 million training rows?	Deep learning might help	Trees will almost certainly win
Do your features have spatial or sequential structure?	Consider CNNs or RNNs	Trees handle unstructured features better
Have you already tuned XGBoost with Optuna and early stopping?	Compare fairly against deep learning	Tune XGBoost first — it might be all you need
Is your team experienced with PyTorch training loops, learning rate schedules, and GPU debugging?	Proceed with caution	The engineering overhead will dwarf any accuracy gain
Does a 0.2% accuracy improvement justify 10× training cost and 5× serving latency?	Build the neural network	Ship the tree model

Most honest answers to this checklist lead to the same place: gradient-boosted trees. The remainder of this chapter ensures you use them at a professional level.

Section	Topics	Key Insight
§5.1 Linear Baselines	L1/L2 regularization, feature selection, calibration	Linear models are diagnostic tools, not just weak baselines
§5.2 Gradient Boosted Trees	XGBoost, LightGBM, Optuna tuning, SHAP	10 hyperparameters matter; ignore the other 50
§5.3 Monotonic Constraints	Domain-enforced feature directions	Constraints improve interpretability with negligible accuracy cost
§5.4 Imbalanced Data	Class weights, threshold tuning, PR-AUC	SMOTE is the wrong default; class weights and threshold tuning are the right ones

Every code example in this chapter is designed to run end-to-end on synthetic or publicly available data. Copy the code, run it, inspect the outputs. The patterns generalize directly to your production datasets.

The 'You Probably Just Need XGBoost' Chapter

The “You Probably Just Need XGBoost” Chapter

The Modeling Hierarchy

Why Trees Win on Tabular Data

What This Chapter Covers

The Uncomfortable Checklist

Navigation