Setting Up for Production, Not Just Notebooks - pragmatic data science with python • Dev|Journal

The 2 AM Production Call You Could Have Prevented

It’s a Tuesday night. Your phone buzzes. The nightly retraining pipeline — the one that “worked on my machine” for three months — has been failing silently for six hours. The model serving stale predictions. Revenue is leaking.

You SSH into the production server and start debugging. The error trace points to pandas==2.1.0 expecting a method that existed in 2.0.3, the version pinned in your notebook. Except nobody pinned it. The requirements.txt said pandas>=2.0 and the last deploy pulled a minor version bump that changed the return type of DataFrame.groupby().apply(). Your downstream code expected a DataFrame; it got a Series. No type checker caught it because there was no type checker. No test caught it because the test notebook “took too long to run” and was skipped.

This scenario is not hypothetical. It is the median outcome for teams that treat Python environments as disposable and notebooks as deployment artifacts.

The Four Pillars of Production Readiness

Production data science code must survive contact with reality: different machines, different Python versions, different data distributions, and different engineers maintaining it six months from now. Four pillars hold that weight.

Pillar	Notebook Approach	Production Approach	Failure Mode When Missing
Dependency Management	`pip install` into global env	Locked environments with `uv` or Poetry	Silent version drift breaks inference
Type Safety	Runtime surprises	Type hints + Pydantic validation	Bad data corrupts models undetected
Repository Structure	Monolithic `.ipynb` files	Modular `src/` with clear boundaries	Untestable, unreviewable, unreproducible
Data Version Control	Files on someone’s desktop	DVC-tracked datasets with lineage	Cannot reproduce past results

Each pillar reinforces the others. Locked dependencies mean nothing if your repo structure makes it impossible to run tests. Type safety is theater if your data files aren’t versioned alongside the code that processes them. You need all four.

The Modern Python Data Science Toolchain

Here is what a well-structured ML project looks like in 2025. Study this layout — the rest of the chapter explains every decision behind it.

ml-forecasting/
├── pyproject.toml              # Single source of truth for deps + metadata
├── uv.lock                     # Deterministic dependency resolution
├── dvc.yaml                    # Pipeline definition: data → features → model
├── dvc.lock                    # Exact data/model versions
├── .python-version             # Pin Python version (e.g., 3.12)
├── src/
│   └── forecasting/
│       ├── __init__.py
│       ├── data/
│       │   ├── loader.py       # Pydantic-validated data ingestion
│       │   └── transforms.py   # Feature engineering (pure functions)
│       ├── models/
│       │   ├── train.py        # Training entrypoint
│       │   └── evaluate.py     # Metrics computation
│       └── serving/
│           ├── predict.py      # Inference API
│           └── schemas.py      # Request/response Pydantic models
├── configs/
│   ├── train.yaml              # Hyperparameters (never hardcoded)
│   └── features.yaml           # Feature definitions
├── data/
│   ├── raw/                    # DVC-tracked, immutable
│   └── processed/              # DVC-tracked, reproducible
├── models/                     # DVC-tracked serialized models
├── notebooks/                  # Exploration ONLY — never production code
│   └── 01_eda.ipynb
└── tests/
    ├── test_loader.py
    ├── test_transforms.py
    └── test_model.py

ML Project Architecture

Notice what is not in this layout: there is no train_and_evaluate_and_deploy_v3_final_FINAL.ipynb. Notebooks live in a quarantine zone labeled “exploration only.” Production logic lives in importable Python modules with type hints, tests, and clear interfaces.

What This Chapter Builds

By the end of this chapter, you will have:

A locked, reproducible Python environment using uv that installs in seconds, not minutes, and guarantees byte-identical dependency trees across machines.
Type-safe data validation using Pydantic models that catch malformed data at ingestion time — before it silently corrupts your feature pipeline.
A repository structure that separates concerns, enables testing, and makes code review possible for data science work.
A DVC pipeline that versions your datasets and models alongside your code, so you can answer the question “what data and code produced this model?” for any model in your history.

These are not aspirational best practices. They are the minimum viable infrastructure that separates a prototype from a system. The next two sections walk through each pillar with runnable code, realistic failure scenarios, and the specific tool configurations that prevent them.

A Note on Tool Choices

This book is opinionated. When two tools solve the same problem and one is measurably better for production data science, we pick it and explain why. uv over pip because it’s 10–100x faster and produces deterministic lockfiles. Polars over Pandas where applicable because it’s memory-safe and multithreaded by default. Pydantic over ad-hoc validation because it generates schemas, serializes cleanly, and catches errors with precise messages.

You are welcome to disagree — but you should disagree with data. Every recommendation in this book comes with a concrete failure scenario that motivates it. If your project genuinely doesn’t face that failure mode, use whatever you want. But know what you’re opting out of.

Let’s start with the foundation: making sure pip install never ruins your week again.