Monitoring and Continual Learning

Deploying a model is not the finish line — it is the starting line. Every model degrades over time as the world changes around it. User behavior evolves, market conditions shift, upstream data pipelines break silently. Without monitoring, you are flying blind. Without retraining triggers, you are accumulating technical debt with compound interest.

Consider what happens without monitoring. Your e-commerce recommendation model was trained on pre-pandemic shopping behavior. Lockdowns hit, and suddenly customers are buying home office equipment instead of travel accessories. The model keeps recommending luggage because nobody told it the world changed. Your click-through rate drops 40% over three weeks, but nobody notices until a VP asks why revenue is down. By then, you have served millions of bad recommendations.

This is not a contrived example. It is the default outcome for any deployed model that lacks a monitoring stack.

The Monitoring Stack

Production ML monitoring is not a single tool — it is four interlocking systems:

Artifact tracking. You need to know which model is in production right now, what data it was trained on, what hyperparameters were used, and how it performed on every evaluation metric. Without this, debugging a production issue means guessing. MLflow is the de facto standard, and Section 11.1 shows how to use it properly.

Drift detection. Your model was trained on a specific data distribution. When the production data distribution diverges from the training distribution, predictions degrade — sometimes gradually, sometimes overnight. Section 11.2 covers three types of drift: data drift (feature distributions shift), concept drift (the relationship between features and targets changes), and feature degradation (an upstream pipeline breaks and a feature becomes constant or null).

Feedback loops. Eventually, you learn whether your predictions were correct. A customer clicks or does not click. A loan defaults or does not default. A patient recovers or does not recover. The delay between prediction and outcome ranges from milliseconds (ad clicks) to years (five-year default rates). Section 11.3 builds the pipeline that collects this feedback and uses it to trigger retraining.

Safe deployment. You have retrained a model and it looks better on your evaluation set. But your evaluation set is a sample, and production traffic is the population. Shadow deployments let you run the new model alongside the old one without affecting users. Canary releases route a small fraction of traffic to the new model and monitor for regressions. Automatic rollback cuts traffic to the new model if error rates spike. Section 11.4 covers all three.

MLOps Lifecycle

These four systems form a cycle. You train a model, register it in the artifact store, deploy it behind a canary, monitor for drift, collect feedback, and retrain when monitoring signals degrade. The cycle runs continuously. There is no “done.”

What Goes Wrong Without Each Layer

Missing Layer	What Happens	Time to Detection
No artifact tracking	You cannot reproduce results or roll back to a known-good model	Immediately on first incident
No drift detection	Model accuracy degrades silently for weeks before a business metric drops enough for someone to notice	Days to months
No feedback loops	You never learn from production outcomes; retraining uses the same stale distribution	Indefinitely
No safe deployment	A bad model update goes to 100% of traffic; rollback requires an emergency redeployment	Minutes, but damage is done

Each layer is cheap to build and expensive to skip. A drift detector is a hundred lines of Python. The cost of not having one is serving degraded predictions to your entire user base for weeks.

What Separates Monitoring from Observability

Traditional software observability — logs, metrics, traces — tells you that something is wrong. The API returned a 500 error. Latency spiked above the SLA. Memory usage hit 90%.

ML monitoring tells you what is wrong with the model itself. Prediction confidence has dropped. Feature age_bucket has a new value the model has never seen. The correlation between feature income and target default has flipped sign. These are problems that produce no errors, no latency spikes, no memory warnings. The API returns 200 OK with a valid JSON response. The prediction is technically a valid float. It is also wrong.

This is why you need ML-specific monitoring on top of standard application observability. Standard monitoring tells you the service is healthy. ML monitoring tells you the model is healthy. They are different questions with different answers.

Roadmap

Section 11.1 covers model registries with MLflow: logging experiments, registering model versions, loading models by stage, and documenting model limitations with model cards. Section 11.2 builds drift detection: statistical tests for data drift, methods for detecting concept drift, and monitoring for feature degradation from broken upstream pipelines. Section 11.3 addresses feedback loops: collecting prediction outcomes, choosing retraining triggers, avoiding the retraining trap of blindly appending data, and knowing when to involve human experts. Section 11.4 covers safe deployment: shadow mode for risk-free comparison, canary releases for gradual rollout, feature flags for instant toggling, and automatic rollback for self-healing systems.

Every code example in this chapter is designed to run alongside the deployment infrastructure you built in Chapter 10. You will not see monitoring systems that require a dedicated ML platform team. You will see patterns that a single data scientist can implement, operate, and extend as the system grows.

This is the final chapter because monitoring is the final capability a production ML system needs — and the first one most teams skip.