When I talk with AI entrepreneurs, I often ask, “How good does the AI need to be, to be useful in your application? What actually is the quality bar?”[1]Note that in this article, I use the word “quality” as synonymous with “accuracy”, as opposed to the more general usage of “quality” in software development, which can also include things … Continue reading I’m sorry to say that, more often not, the answer I hear isn’t very satisfying: “That’s a really good question; we should go and think about that.” But this “useful quality bar” question actually drives half of your product design, and is critical for the success or failure of your product.

We tend to think that AI will either work or won’t work, but it turns out that this isn’t a binary question. And the quality level as well as type of accuracy that you need for your use case (e.g. ability to tolerate a false negative better than a false positive) depends on your desired business outcome(s). By way of illustration, in an earlier post I talked about a model with high accuracy only in some cases—identifying certain types of normal mammograms—could save many hours of radiologist time. It’s not perfect, but it is of great value.

The accuracy requirement differs from one application to the next. Other AI applications like self-driving vehicles require greater than 99% accuracy in every situation. It is essential, then, to have a well-defined and measurable definition of quality for your business use case, and it should be equivalent to—or at least aligned with—the customer’s requirements and preferences. We call this quality definition the objective function for the system, and it defines how it will be evaluated and optimized, and how you know you’re ready to deploy.

Sometimes, it’s hard to know exactly how good is good enough. Several years ago, we were working with a diabetes glucose level prediction system, intended to give 60 minutes’ advance warning of a blood sugar problem. The idea was that it would be great to let parents of a diabetic child sleep instead of waking to react to alarms of problems that happened 15 minutes ago. We knew we could get to 82% accuracy, meaning our system’s predictions were within 9% plus or minus of the true glucose levels. That’s a pretty big swing. The question on the table then was, “Is this high enough accuracy for a parent to sleep soundly?”

Here’s another example. We had a neat company for implementing a wearable calorie counter on your wrist, which in this time before fitbit and Apple Health Kit was a big deal, and an unsolved challenge. Of course, the question we asked was: “Well, how good does it need to be?”

And the AI team didn’t know. They said: “Well, the FDA says that as long as 80% of the time it’s within 20% of the right calorie number, it counts as a calorie counter.”

But if you think about it, that means that a consumer will eat a steak 20% of the time and it can say that it’s 100 calories; and a salad will say it’s a thousand calories. That kind of accuracy will implode any sense of customer trust: upon seeing this level of inaccuracy, this company could never succeed in the market. 

Measuring and reaching the quality bar

Once you decide what quality you need, the next question is: “How are you going to measure it?” Can you measure how good your system is today? How are you measuring it? Is it the right measurement?

If you haven’t yet reached your desired quality level yet, then ask, “How hard will it be to achieve the level of quality we need? What is our path to success?” Let’s assume, for instance, that your desired business outcomes require 95% accuracy. A typical machine learning path might get you to 70% accuracy in three weeks, and everyone feels great. But after you celebrate ML actually finding a pattern in your data, you are confronted with the 95% reality. And typically, you have no idea how much additional work, data, and tweaking is going to be required—or if it’s even possible—to get to that level of accuracy. Indeed, sometimes that last 15% of accuracy can cost 90% of your effort – this is often a nonlinear situation. Recent research shows that, for every x% improvement in performance requires the previous amount of computer power expended, to the power of nine. Says this article, “This ninth power means that to halve the error rate, you can expect to need more than 500 times the computational resources.”. So there’s either big uncertainty, or big known costs, in improving performance of many systems.

Dun & Bradstreet example

Here’s how this situation can play out. The client of one of my machine learning companies was Dun & Bradstreet, which maintains one of the world’s largest high-quality database of information about public and private companies. At the time of our project start, D&B obtained this information from a variety of sources, including yellow pages, mailing lists, company registrations, and phone calls. This was a time-consuming, and often manual process, so we offered them the chance to supplement or replace their data with automatically extracted data from the web.

Our technology was based on both machine learning (ML) and natural language processing (NLP). We assembled a set of labeled training data for how the content we desired was typically laid out on example web pages, and then built systems that could learn to navigate, extract, and collate that information to create high-quality records from semi-structured content on the open web. The idea was an opportunity to dramatically lower the customer’s cost of content acquisition while increasing the freshness and comprehensiveness of the data, and also adding new fields that are only available online.

The client needed very high accuracy, specifically what it called “99% accuracy at the record level”. This included correct joins between data from different sources. This was very hard to do in a general way by writing programs in scripting languages, and very expensive to do using people. To do it using automated means would require very advanced AI like ours, so we felt good about beating the competition and alternatives (so in the language of my Goldilocks post, the problem was not Too Soft).

We offered to both build the system as well as to maintain it. To do so, we needed to build a business case which included our costing, so that the client could evaluate the project’s profitability. The problem in this case was that, without building the actual system, including creating a large set of labeled training and test data, we didn’t know how much accuracy we would ultimately be able to achieve for any specific amount of time and effort. Our engineers built a prototype in three weeks that showed we could achieve 70% accuracy. But, as I described above, we had no idea how long it would take to get to 90% accuracy, and then even once we achieved that we wouldn’t know how long it would take to get to 99% accuracy, or if that was even possible.

Because of this uncertainty, we didn’t know if the problem was actually Too Hard or Just Right. In the end, we negotiated a deal that factored in this uncertainty as best we could, and we achieved a good outcome for both sides. But not knowing up front what kind of problem we faced made each stage of the deal difficult and risky for all involved.

This way of thinking should be more widespread than it is today, because this D&B project was no exception. They all need to know “How accurate?”, “How do we measure?”, and “How do we achieve our required accuracy?”. 

In the next two posts I’ll dig deeper into strategies to address these questions, along with how to maintain AI quality in systems that use AI as a component.

1 Note that in this article, I use the word “quality” as synonymous with “accuracy”, as opposed to the more general usage of “quality” in software development, which can also include things like low defects, fit-for-purpose, and more.