If you’ve been reading my posts so far, you’ve probably learned solidly that AI models become useful when they leave the lab and become part of a deployed system, where they are integrated into business software and processes. So, say you’ve done that. You built your model. You improved your accuracy and demonstrated that it’s high enough to drive business value. Then your software engineering team integrated your shiny new model into an enterprise application. You tested it and the users are delighted. You can relax, sip a latte, and enjoy your ROI, right?

Sorry, not so fast. David Talby, writing in Forbes, says that your challenges aren’t over. Indeed, “there’s a higher marginal cost to operating ML products compared to traditional software.” 

To make his point, Talby shares a case study that represents most AI projects. His team built a model did a good job of predicting hospital readmissions: a well-studied problem with stable best practices and guidelines. And initially his customers were happy. But the system degraded over time. Reports Talby, “Within three months of deploying new ML software that passed traditional acceptance tests, our customers were unhappy because the system was predicting poorly.” And, “Models would change in different ways at different hospitals — or even buildings within the same hospital.”

What happened? As Elena Samuylova observes in Is there life after deployment? ,“If a model receives wrong or unusual input, it will make an unreliable prediction. Or many, many of those.”

Most of the software engineering stack has unit tests, designed to quickly catch when a change breaks the performance of one of the components. But Machine Learning (ML) models are designed to change, and don’t really come with unit tests. Because accuracy depends on a match to a world. And that world changes. Dr. Lorien Pratt describes this problem in this MCubed keynote as a “double hinge”: most projects must surf the ever-changing relationship between product and market to maintain value. As if this wasn’t hard enough, machine learning projects have a second moving target: the outside world is constantly changing, and systems that contain at their heart such a “world model” must keep up with it as well.

ML systems can be sensitive to small details in any of the other components of a system. So, changing the user interface a little bit, like changing a field name from “name” to “username”, merging first and last names into a “name” field, localizing the interface for different geographies, or changing a country designation from “US” to “USA” may completely break the performance of a recommender system that depended on these field names for its historical training data. I’ve found amongst the companies I advise that errors like these are pernicious and occur all too easily.

In What Can Go Wrong With The Data?, Samuylova’s colleague Emeli Dral describes two important categories of data issues:

  1. something goes wrong with the data itself; or
  2. the data changes because the environment does.

In the case of Talby’s hospital system, both problems occurred. First, medical data relies heavily on codes, and changes to outside providers like labs often cause the same results to be coded differently by different providers. To understand this, let’s assume there’s a situation that’s highly predictive of readmission. Then, a new lab codes its result differently, and that code that was not in the ML training data. Humans assessing readmission likelihood would of course correctly interpret the new code and correctly assess patient risk. But ML cannot apply meanings to codes. The previously working model simply found that a code value in one field had a high correlation with readmission. As long as the code values stay the same, ML made good predictions. But when the coded value changes, ML can no longer use that field effectively. If that field’s important for prediction, then ML accuracy will decrease.

Dral’s second problem manifests here in changes in the patient population. One way that happens derives from the fact that, when possible, people use hospitals and ERs that accept their insurance. Then, as insurance systems change with government policies, and hospitals change their contracts with insurance companies, patient populations change. This reduces the accuracy of ML models that depend, as they often do, on patient demographics.

This accuracy reduction can happen gradually; instead of ML systems that just break, as production data drifts away from training data they may continue to return scores and classifications, so the application doesn’t crash or generate errors. Instead, the model accuracy just quietly declines until it becomes apparent to the users.

New infrastructure and roles are required to create mature systems to prevent, catch, and mitigate these problems. In other words, AI systems need preventative maintenance. As Paul Barba, Chief Scientist at Lexalytics puts it, “Unfortunately, failing to maintain your AI is a great way to end up totaling your project.” For Barba, maintenance means “monitoring, measuring and adjusting.”

Applied AI is an emerging discipline, and the field is still learning best practices for keeping AI systems healthy in production. However, a fairly simple program of measuring the readmissions model performance on production data would have surfaced both the speed and extent of the accuracy decline for Talby’s model. Similar governance procedures can be planned for and put in place for your production system as well.

There’s a more pernicious issue as well: if your required retraining cadence is too high, it might take your project out of the AI Goldilocks Zone.