Skip to main content
pragmatic data science with python

Deep Learning Where It Counts

7 min read Chapter 16 of 33
Summary

Deep learning is the most overapplied tool in...

Deep learning is the most overapplied tool in modern data science. On tabular data, gradient-boosted trees win — the evidence from Chapter 5 is unambiguous. But the moment your data is images, text, audio, sequences, or any combination of modalities, trees are not a contender. Deep learning is not optional — it is the only viable approach. This chapter draws a clean decision boundary: when DL genuinely earns its complexity cost, and when you are paying GPU bills for no measurable gain. We cover PyTorch from the ground up (the 20% you use 80% of the time), confront the narrow cases where tabular deep learning edges out trees, build transfer learning pipelines that leverage billions of parameters you did not have to train, and close with the hardware realities that determine whether your model ships or sits in a notebook.

Deep Learning Where It Counts

Most data scientists reach for deep learning too early. There is an entire industry built on the assumption that neural networks are the default — GPU cloud providers, framework evangelists, conference keynote speakers whose careers depend on you believing that every problem needs a transformer. Chapter 5 showed you the reality for tabular data: XGBoost wins. Consistently, decisively, and at a fraction of the cost.

So why does this chapter exist?

Because the moment your data stops being a flat table of numeric and categorical columns, trees cannot help you. A random forest cannot look at a chest X-ray and identify pneumonia. XGBoost cannot read a legal contract and extract liability clauses. LightGBM cannot listen to an audio recording and detect speaker emotion. These are not edge cases — they are some of the highest-value problems in applied machine learning. And for every one of them, deep learning is not an option among alternatives. It is the only viable approach.

The distinction is clean: data modality determines model family. Tabular data → trees. Images, text, audio, video, sequences, multi-modal inputs → deep learning. The exceptions exist (tabular DL on very large datasets, tree-based image features), but they are narrow and we will quantify them precisely.

This framing matters because it protects you from two failure modes. The first: using deep learning on tabular data because it feels more sophisticated, then spending three weeks tuning a transformer that underperforms a 10-minute XGBoost run. The second: refusing to use deep learning on image data because “trees are always better,” then delivering a system that cannot compete with a fine-tuned ResNet that a junior engineer could deploy in an afternoon. Both failure modes stem from treating model selection as a preference rather than an engineering decision driven by data characteristics.

The Decision Framework

Before you write a single line of PyTorch, run through this checklist. If you cannot answer all four questions, you are not ready to start building. If the answers point to trees, use trees — no matter how much pressure you feel to do something more impressive.

1. What is your data modality?

  • Structured tabular data (rows and columns, mixed types): Start with XGBoost. Deep learning is a stretch goal, not a starting point.
  • Images, video: Deep learning. No alternative exists that competes.
  • Text (documents, queries, conversations): Deep learning. Bag-of-words features fed to trees work for simple classification, but any task requiring understanding needs a neural model.
  • Audio, speech: Deep learning. Spectrogram + CNN or end-to-end models like Whisper.
  • Sequences with temporal structure (sensor data, event logs): Deep learning has an edge when sequence length exceeds what manual feature engineering can capture.
  • Multi-modal (image + text, tabular + text): Deep learning. Trees cannot fuse modalities.

2. How much data do you have?

  • Under 10,000 samples: Transfer learning or do not use deep learning at all. Training from scratch with this little data is a recipe for overfitting.
  • 10,000–1,000,000 samples: Deep learning works if the modality demands it. For tabular data, trees still win here.
  • Over 1,000,000 samples: Deep learning starts to show advantages even on tabular data, particularly with entity embeddings and multi-task objectives.

3. What hardware is available?

  • No GPU: Stick to trees, or use pre-trained models for inference only by quantizing to CPU. Training a real neural network on CPU is not slow — it is non-viable.
  • Single GPU (8–24 GB VRAM): Sufficient for fine-tuning pre-trained models and training medium-scale models from scratch.
  • Multi-GPU or cloud instances: Required for training large models or running experiments at scale.

4. What is your iteration budget?

  • Need results this week: Trees. You can run hundreds of experiments in an afternoon on a laptop.
  • Have two to four weeks: Deep learning is feasible if the modality demands it. Budget time for data pipeline construction, hyperparameter search, and debugging training instability.
  • Research timeline (months): You can explore custom architectures, pre-training on domain data, and architecture search. Most production teams do not have this luxury.

Deep Learning Decision Framework

Hardware Reality Check

A GPU is not a magic accelerator. It is a massively parallel processor optimized for the specific computation pattern that neural networks require: dense matrix multiplications on floating-point tensors. An NVIDIA A100 delivers ~312 TFLOPS of float16 throughput — roughly 100x what a modern CPU achieves. But that throughput only materializes when your computation pattern matches the hardware: large batch sizes, regular tensor shapes, and operations that saturate the GPU’s streaming multiprocessors.

The cost is real. A single A100 80GB instance on AWS (p4d.24xlarge, 8 GPUs) runs $32.77/hour on-demand. A T4 instance (g4dn.xlarge, 1 GPU) is $0.526/hour. The 60x price difference reflects genuine performance differences — but if your model fits on a T4, the A100 is not 60x faster for your workload. Matching hardware to workload is an engineering decision that directly affects your project budget.

There is a second cost that rarely appears in blog posts: engineer time. A gradient-boosted tree trains in seconds on a laptop CPU. You can run 100 experiments in an afternoon, iterate on feature engineering, and ship by the end of the week. A deep learning model takes minutes to hours per training run on a GPU. Hyperparameter sweeps multiply that by 50–200x. Debugging a training run that diverges at epoch 37 costs half a day. The total cost of a deep learning solution is not just compute dollars — it is the weeks of engineering time that trees do not require. This overhead is justified when deep learning delivers a capability that trees cannot (understanding images, parsing language, processing audio). It is not justified when your data is a CSV with 40 columns.

The practical implication: before you provision a GPU instance, write a one-paragraph justification. State what your data modality is, why trees cannot solve the problem, what pre-trained model you plan to start from, and how many GPU-hours you estimate the project will require. If you cannot write that paragraph, you are not ready to use deep learning. If someone else is asking you to use deep learning and cannot answer those questions, push back. The cost of a wrong model choice is not a slow training run — it is a project that ships late, over budget, and underperforms the XGBoost baseline you should have used.

The PyTorch Ecosystem

This chapter uses PyTorch exclusively. The core library covers model definition (torch.nn), automatic differentiation (torch.autograd), optimization (torch.optim), and data loading (torch.utils.data). Beyond the core:

  • torchvision: Pre-trained image models (ResNet, EfficientNet, ViT), image transforms, and standard datasets. We use it for transfer learning in Section 6.3.
  • torchaudio: Audio processing and pre-trained speech models. Outside this chapter’s scope but follows the same patterns.
  • torch.amp: Automatic mixed precision for float16/bfloat16 training. Covered in Section 6.4.
  • Lightning / Fabric: Higher-level training loop abstractions. Useful for production codebases, but this chapter teaches the raw loop because you need to understand what the abstraction hides before you adopt it.

What This Chapter Covers

This chapter is structured around two sections:

Section 6.1–6.2: PyTorch Fundamentals and Tabular Deep Learning. The 20% of PyTorch you use 80% of the time — tensors, autograd, Datasets, DataLoaders, and the training loop. Then we confront tabular deep learning honestly: where entity embeddings earn their complexity cost, and where XGBoost remains the better choice.

Section 6.3–6.4: Transfer Learning and Hardware Realities. How to leverage pre-trained models so you inherit billions of parameters without paying the training cost. Then the engineering constraints that determine whether your model runs: GPU memory management, mixed precision training, gradient accumulation, and cost estimation before you commit resources.

By the end of this chapter, you will have a clear decision framework for when to use deep learning, the PyTorch skills to implement models when that decision is “yes,” and the hardware engineering knowledge to ensure your models actually train within your budget and timeline. Chapter 7 extends these foundations to NLP and large language models — the domain where deep learning has had its most dramatic impact.