Supervised Machine Learning done right: getting the labels that you want

“Garbage in, garbage out” is an expression that Machine Learning practitioners are very familiar with. Despite having spent most of their classroom time learning various algorithms, newly employed engineers and data scientists soon find that no amount of state-of-the-art models and clever fine-tuning of hyperparameters can make up for the poor quality of their training data.

Here we’ll discuss a few of the most common data issues, and the possible solutions to circumvent them.

It must be noted that the nature of problems encountered differs when dealing with structured (i.e. tabular) and unstructured (e.g., imaging, video, or textual) data. Of course, in both cases you want the training data to come from the same distribution as the data that the inference will be performed on. Having a balanced dataset (one where the number of examples belonging to different classes does not differ too much) is also desirable, although not always feasible. Fortunately, there are algorithmic approaches that allow us to deal with imbalanced datasets. When it comes to data “quality” however, there is not much that can be done on the engineering side, so special care must be taken to prepare the dataset before the modeling can even begin. And this is where the definitions of what constitutes a good structured vs. unstructured dataset begin to diverge.

The ideal structured dataset has rows upon rows of entries, that are clean (i.e. contain values that fall within expected types and/or ranges) and, well, not missing (a rarity in many real-life scenarios). In a supervised Machine Learning scenario, one of the columns is set aside as the “target”: this is the value that the model is trained to produce, given the rest of the row. As for an image or an NLP (Natural Language Processing) dataset, it should be similarly filled with examples that are representative of the real world, i.e. not contain any “noise”. However, there is an additional challenge of data annotation in order to assign targets to training examples. The targets are generally assigned by hand, which, like any manual process, is prone to error and interpretation.

Getting the labels that you want

In the beginning of a Machine Learning project, someone (typically, the engineer in charge of developing the model) decides how they want their data labeled. In the computer vision example, the common choices to make are whether to apply the label to the whole image, certain parts of it (e.g. bounding boxes), or at the pixel level (i.e. image segmentation). One also needs to come up with a list of possible class labels, whether they are mutually exclusive and/or nested. Once these requirements are set, the engineer might label a small subset of data herself, but for a project that goes beyond a proof-of-concept, the bulk of the annotation efforts are usually outsourced to increase the cost-efficiency - whether to other employees in the company, or to a third-party data labeling provider.

Photo by Alexandra Lammerink

There are certain best practices that can be put in place to make sure the annotators’ labels follow the standards set by the machine learning engineer.

Providing annotators with clear instructions and examples

For an image-level classification task, what should the annotators do if objects from multiple classes appear in one photo? Should the bounding box include parts of the object that are covered by another, or should they use a separate bounding box for each visible part? Should two dogs be outlined and labeled separately, or as a group? Whichever choice you make, make sure the labeling team is aware, and applies it consistently.

Consolidate multiple annotators’ labels

Even with the most precise instructions, errors are inevitable. To increase the accuracy of the annotations, it is common to have multiple people (say, three) label each training sample, and consolidate their labels in the end. The resulting label could be assigned by a majority vote, or annotations that the labelers disagreed about could be set aside to be validated separately. Naturally, this increases the labeling costs threefold, but the higher accuracy could still be well worth the extra expense, especially in the case of crowdsourcing the annotations.

Photo by Dylan Gillis on Unsplash

Use ground truth labels for Quality Assurance

Remember the subset that was labeled by the engineer herself? Since these are presumably the best labels available (i.e. the ground truth), one way to assess the quality of the labels assigned by the annotators is to mix the images from the ground truth subset into the data sent out for annotation. A direct comparison between the ground truth labels and those coming from the annotation team can then be made for quality assurance purposes.

In the event that no ground truth labels are available, quality assurance can be carried out after the fact. Select a small subset of the annotated training samples at random, and verify that the labels conform to what you had in mind. If multiple annotators are involved in the task, it is a good idea to do this separately for each of them: this way, any systematic inconsistencies in the labelers’ understanding of your instructions can be caught early.

The objective of the four tips above is to get you the labels that you want: these are the targets that your Machine Learning model will be optimized to reproduce for the training samples. Paired with a powerful GPU, these labels are the key ingredients to building a high-quality model. However, once the model ends up in production, you may sometimes find that this was just not enough. In the upcoming second part of this blog post series, I am going to list a few other things to watch out for when you define the data annotation task.

Recommended articles