How Models Fail in Production
ML models don't crash — they degrade. A fraud detection model starts missing a new fraud pattern. A demand forecast model that worked perfectly through summer starts underperforming in Q4. A churn model trained on pre-pandemic behaviour quietly becomes wrong. None of these produce an error. They produce quietly wrong predictions that erode business outcomes until someone notices a metric moving in the wrong direction.
Data Drift vs. Concept Drift
There are two distinct drift problems, and confusing them leads to the wrong remediation:
- Data drift: The statistical distribution of your input features changes. Your model was trained on a customer population with average age 35; your current customers average age 28. The model still works — it just hasn't seen this population.
- Concept drift: The underlying relationship between features and the target variable changes. Fraud patterns evolve. Customer behaviour shifts. The model has seen this type of input before, but the correct output for it has changed.
- Both require monitoring, but concept drift is harder to detect because you need labelled ground truth to measure it, and ground truth often arrives with a lag.
Monitoring Architecture That Works
A production ML monitoring stack needs three layers:
- Infrastructure monitoring: Is the model serving? Latency, throughput, error rates. Table stakes.
- Data drift monitoring: Statistical tests (PSI, KS test, Jensen-Shannon divergence) on input feature distributions, running continuously against a baseline.
- Performance monitoring: Prediction distribution tracking (detects concept drift without waiting for labels), plus retrospective accuracy monitoring as labels arrive.
The Alert That Actually Gets Acted On
Most drift monitoring implementations have too many alerts and no clear action mapping. The alert that gets acted on is specific, has a severity threshold calibrated to the business impact, and routes to someone who can do something about it. An email saying 'PSI score exceeded 0.2 for feature X' goes unread. An alert saying 'Fraud model prediction volume is 40% below baseline for the last 4 hours — potential drift detected, review recommended' gets investigated.
Retraining vs. Recalibration
Not all drift requires a full retrain. Recalibration — adjusting the model's output probabilities without changing its weights — can correct distributional shift with far less effort. Full retraining is necessary when concept drift is confirmed. Having a documented decision tree for 'drift detected → is it data drift or concept drift → retrain or recalibrate' reduces the mean time to remediation significantly.
Key Takeaways
- ML models degrade silently — you need monitoring to catch it before the business does.
- Distinguish data drift (input distribution change) from concept drift (relationship change) — remediation differs.
- Three monitoring layers: infrastructure, data drift, and performance monitoring.
- Prediction distribution shift is an early concept drift signal that doesn't require ground truth labels.
- Map every alert to a clear action — unmapped alerts get ignored.