Getting the right data—and getting the data right—is essential for good AI accuracy, whether that data is collected before or during system deployment. 

Obtaining Data

Obtaining data to train your AI system may seem like a straightforward task, but there are some hidden gotchas.

Consider a company I’ll call WebLeads. It set out to create the largest marketing database of executive leads, using web extraction technology. At the time, Dun and Bradstreet was the leading company in this area. This competitor used human teams to laboriously create lead databases: expensive but very comprehensive.

The WebLeads team ultimately discovered that, regardless of its AI capabilities, the data it needed simply didn’t exist: much of it was not publicly available on the web to be extracted. Therefore, no matter how good the company’s AI was, it was not ultimately able to succeed.

So this may seem obvious, but please do your homework to ensure your data will be available. It is an unfortunate habit of too many AI companies to skip this step, possibly because obtaining data in the real world is generally off the radar of academic AI.

Data quality

Data needs to reach a good quality bar as well. You don’t want your data to be too noisy, with an abundance of meaningless information, distortions, and/or corruptions. This requires that your data be cleaned and prepared for use in an AI system.

That being said, all data isn’t created equal. It is more important to cleanse the highly predictive fields than the ones that will be largely ignored in your system. 

Preprocessing

It’s a widespread belief that preprocessing machine learning training data is always a good idea, leading to better accuracy and/or speed. But there are cases where preprocessing data can actually reduce performance along these metrics.

For instance, consider a startup I worked with through the Creative Destruction Lab (CDL): the company’s technology analyzed audio files containing human speech, and classified speech segments into categories like “happy” or “talking about banking”. This is different than many speech systems, which use a two-step process: 1) convert the speech to text, then 2) classify the text into categories. As it turns out, this single-step process can actually handle noise better and learn from less data. The reason is that step (1) can lose information.

Another example comes from the history of computer vision: the general discipline of labelling, identifying, and segmenting images (e.g. drawing a red box around a person, and a green box around a street for a self-driving car). Before Convolutional Neural Networks (CNNs) were invented, standard practice in computer vision was also a two-step process: 1) convert the image into features like edges, circles, and more; then 2) do the classification task. One of the reasons that CNNs have considerable greater accuracy than these older systems is that they don’t incur any information loss during step (1).

These are just a few examples of non-obvious elements of data preparation that you’ll want to keep in mind as you get ready for machine learning.