Machine Learning Monitoring, Part 1: What It Is and How It Differs

Indeed, even the design of machine learning courses and the landscape of machine learning tools add to this perception. They extensively address data preparation, iterative model building, and (most recently) the deployment phase. Still, both in tutorials and practice, what happens after the model goes into production is often left up to chance.

The simple reason for this neglect is a lack of maturity. Aside from a few technical giants that live and breathe machine learning, most industries are only starting up. There is limited experience with real-life machine learning applications. Companies are overwhelmed with sorting many things out for the first time and rushing to deploy. Data scientists do everything from data cleaning to the A/B test setup. Model operations, maintenance, and support are often only an afterthought. One of the critical, but often overlooked components of this machine learning afterlife is monitoring.

Why monitoring matters

An ounce of prevention is worth a pound of cure. Benjamin Franklin

With the learning techniques we use these days, a model is never final. In training, it studies the past examples. Once released into the wild, it works with new data: this can be user clickstream, product sales, or credit applications. With time, this data deviates from what the model has seen in training. Sooner or later, even the most accurate and carefully tested solution starts to degrade. The recent pandemic illustrated this all too vividly. Some cases even made the headlines:

  • Instacart's model's accuracy predicting item availability at stores dropped from 93% to 61% due to a drastic shift in shopping habits.

  • Bankers question whether credit models trained on good times can adapt to the stress scenarios.

  • Trading algorithms misfired in response to market volatility. Some funds had a 21% fall.

  • Image classification models had to learn the new normal: a family at home in front of laptops can now mean "work," not "leisure."

  • Even weather forecasts are less accurate since valuable data disappeared with the reduction of commercial flights.

A new concept of "office work" your image classification model might need to learn in 2020. (Image by Ketut Subiyanto, Pexels)

On top of this, all sorts of issues occur with live data. There are input errors and database outages. Data pipelines break. User demographic changes. If a model receives wrong or unusual input, it will make an unreliable prediction. Or many, many of those. Model failures and untreated decay cause damage. Sometimes this is just a minor inconvenience, like a silly product recommendation or wrongly labeled photo. The effects go much further in high-stake domains, such as hiring, grading, or credit decisions. Even in otherwise "low-risk" areas like marketing or supply chain, underperforming models can severely hit the bottom line when they operate at scale. Companies waste money in the wrong advertising channel, display incorrect prices, understock items, or harm the user experience. Here comes monitoring. We don't just deploy our models once. We already know that they will break and degrade. To operate them successfully, we need a real-time view of their performance. Do they work as expected? What is causing the change? Is it time to intervene? This sort of visibility is not a nice-to-have, but a critical part of the loop. Monitoring bakes into the model development lifecycle, connecting production with modeling. If we detect a quality drop, we can trigger retraining or step back into the research phase to issue a model remake.

Let us propose a formal definition Machine learning monitoring is a practice of tracking and analyzing production model performance to ensure acceptable quality as defined by the use case. It provides early warnings on performance issues and helps diagnose their root cause to debug and resolve.

How machine learning monitoring is different

One might think: we have been deploying software for ages, and monitoring is nothing new. Just do the same with your machine learning stuff. Why all the fuss? There is some truth to it. A deployed model is a software service, and we need to track the usual health metrics such as latency, memory utilization, and uptime. But in addition to that, a machine learning system has its unique issues to look after.

First of all, data adds an extra layer of complexity. It is not just the code we should worry about, but also data quality and its dependencies. More moving pieces—more potential failure modes! Often, these data sources reside completely out of our control. And even if the pipelines are perfectly maintained, the environmental change creeps in and leads to a performance drop. Is the world changing too fast? In machine learning monitoring, this abstract question becomes applied. We watch out for data shifts and casually quantify the degree of change. Quite a different task from, say, checking a server load. To make things worse, models often fail silently. There are no "bad gateways" or "404"s. Despite the input data being odd, the system will likely return the response. The individual prediction might seemingly make sense⁠—while being harmful, biased, or wrong. Imagine, we rely on machine learning to predict customer churn, and the model fell short. It might take weeks to learn the facts (such as whether an at-risk client eventually left) or notice the impact on the business KPI (such as a drop in quarterly renewals). Only then, we would suspect the system needs a health check! You'd hardly miss a software outage for that long. In the land of unmonitored models, this invisible downtime is an alarming norm. To save the day, you have to react early. This means assessing just the data that went in and how the model responded: a peculiar type of half-blind monitoring.

The distinction between "good" and "bad" is not clear-cut. One accidental outlier does not mean the model went rogue and needs an urgent update. At the same time, stable accuracy can also be misleading. Hiding behind an aggregate number, a model can quietly fail on some critical data region.

Metrics are useless without context. Acceptable performance, model risks, and costs of errors vary across use cases. In lending models, we care about fair outcomes. In fraud detection, we barely tolerate false negatives. With stock replenishment, ordering more might be better than less. In marketing models, we would want to keep tabs on the premium segment performance. All these nuances inform our monitoring needs, specific metrics to keep an eye on, and the way we'll interpret them. With this, machine learning monitoring falls somewhere in between traditional software and product analytics. We still look at "technical" performance metrics—accuracy, mean absolute error, and so on. But what we primarily aim to check is the quality of the decision-making that machine learning enables: whether it is satisfactory, unbiased, and serves our business goal.

In a nutshell

Looking only at software metrics is too little. Looking at the downstream product or business KPIs is too late. Machine learning monitoring is a distinct domain, and it requires appropriate practices, strategies, and tools.

Source: evidently ai