Active Learning, part 1: the Theory

Active learning: when and why do we need it?

Active learning is still a relatively niche approach in the machine learning world, but that is bound to change. After all, active learning provides solutions to not one, but two challenging problems that have been poisoning the lives of many a data scientist. I am, of course, talking about the data: namely, its (a) quantity and (b) quality.

Let us start with the former. It is no secret that training modern machine learning (ML) models requires large quantities of training data. This is especially true for deep learning (ML carried out via deep artificial neural nets), where it is not uncommon for training sets to number in the hundreds of thousands and beyond. To make matters worse, many practical applications come in the form of supervised ML tasks: i.e., not only do we require all these training samples, but we also need a way to label them. Labeling data is time-consuming and expensive. It is bad enough if it involves human experts manually annotating text or images, but what if labeling entails invasive medical tests to confirm a patient's diagnosis? Or drilling down into the rock to test for oil? There are plenty of scenarios where unlabeled data may be easy to obtain, yet the labeling budget may impose severe limitations. This is precisely where semi-supervised learning shines: leveraging unlabeled data to achieve a supervised ML task with fewer labels that would be needed otherwise. It is a well-established fact that the performance of semi-supervised models often strongly depends on the selection of the training samples that have been labeled. Roughly speaking, the more "representative" or "informative" the labeled samples are, the better. If we have the choice of what samples to label, however, how can we determine which ones would be of most use for our task? Sometimes this can be seen from a manual inspection (although the approach of combing through unlabeled data manually does not scale particularly well), and other times it cannot. Would it not be nice if the model itself could tell us which datapoints it would prefer to know the labels for? Well, it can - in fact, that is precisely what active learning is all about.

Paraphrasing George Orwell, some training samples are more equal than others!

The issue of data quality does not only manifest itself in the semi-supervised setting. Even if all of your data is already labeled, more is not always better when it comes to training an ML model. On the contrary, outliers and other types of spammy data may lead your model astray. If these represent a significant portion of your training set, a model that is trained in an active learning regime, where unhelpful data is ranked down, may even outperform a fully supervised model, that had access to the entire dataset from the start!

Active learning: what is it?

Active learning is part of the so-called Human-in-the-Loop (HITL) machine learning paradigm. The idea behind HITL is to combine human and artificial intelligence to solve various data-driven tasks. Depending on how you look at it, all of machine learning is at least somewhat HITL, but some areas more than others. In active learning, human participation is as explicit as it is iterative:

  1. The oracle (the source of the ground truth labels, e.g. the human expert) supplies the model with some labeled data.
  2. The model gets trained on those labeled samples, and any others it may have gotten previously (it is then, most likely, tested on the validation set to keep track of its performance).
  3. The model determines which unlabeled samples it would most like to have labeled next, and sends a request to the oracle.
  4. And on and on it goes. Ideally, you repeat steps 1-3 until you are satisfied with the model's performance, but in the real world you might either notice that the said performance stops improving, or you run out of your labeling budget.

The model in question can be any supervised ML algorithm, active learning puts no restrictions on you in that regard. For the sake of example, let us assume that we are dealing with image classification. Implementation-wise, you can think of active learning as something that is wrapped around your classifier. In fact, the classifier itself does not require any changes compared to its plain old passive learning version. Passive learning is basically the kind of machine learning that we are all used to, where labeled examples are sampled at random, rather than in accordance with the feedback received from the classifier. Standard supervised machine learning can be viewed as a special case of passive learning - one where you just happened to randomly sample all of your available training data!

The most nontrivial part of active learning lies is step 3 above: how does the model decide which samples are the most beneficial to label at the current stage? Turns out, there are multiple ways of doing this.

Active learning: how is it done?

Chicken or egg?

This cat would have preferred an egg.

This cat would have preferred an egg.

Let us first state the obvious: one has to start somewhere. Simply initializing your classifier to a random state and throwing unlabeled data at it will not get you very far. Better alternatives include:

  • Transfer learning: pre-train your classifier on another labeled dataset in the hopes that some of that knowledge will carry over to your data.
  • Random query: label some number of training samples at random, and use them to train the initial classifier (or fine-tune a pre-trained one, if you fancy combining the two strategies).

There are other, more sophisticated things that one can do, of course. You can, for example, try clustering your data first, and sample points from each cluster. This approach is not always possible, but is particularly helpful when we do not know how many classes are there in the underlying data distribution.

However we choose to start our training, there are several choices to be made along the way. One decision to make is:

Streaming or Pooling?

In a streaming scenario, the model is presented with training samples, one at a time. The model will then either ignore the sample, or query its label. In our classifier example, one could imagine there being a hyperparameter that is a probability threshold for the most likely class. If the sample's probability does not reach this threshold, the oracle gets a query, otherwise the sample goes to the ignore pile. The idea here is that if we are quite sure about the sample's label, there is a good chance that we are right and confirming our suspicions is not the best use of our labeling budget.

Another way to do active learning is via pooling. In this case the model evaluates the class probabilities for all of the unlabeled data, and selects some part of it to be labeled at the next iteration. The selection is based on a query strategy, and this is where things start to get really interesting:

Which query strategy?

With active learning being such a young and rapidly developing field, there is no shortage of query strategies floating around. The most common one, however, is uncertainty sampling, where the model wants to get the labels for the samples whose class assignment it is most uncertain about. This can take a few different forms:

  • Least confidence: choose samples that have the lowest confidence in their most likely class label. One downside of this method is that it is prone to picking up outliers and other types of spam in the data.
  • Margin sampling: look for samples with the smallest difference between the top and the second most probable label. These are likely to lie in between two classes in feature space, so annotating them helps in locating the decision boundary for the classifier.
  • Entropy sampling: similar idea, but the entropy is spread out among all classes.

In addition to uncertainty sampling, there are other possible approaches, such as:

  • Query-by-committee involves training multiple models, and selecting samples that these classifiers disagree about the most.
  • Expected model change: choose the instance that would lead to the greatest change in the classifier if we were to find out its label. How can we tell? One way would be to compute the gradients for a loss function that is averaged over all label possibilities for the given instance. Not a bad strategy, but computationally expensive.
  • Expected error reduction: if you thought that expected model change was computationally expensive, wait till you see this one. Instead of looking for the greatest gradients, you compute the change in the validation error - still averaged over the class labels.

With a query strategy at hand, you are ready to start the iterative process of active learning. (For the actual code example, stay tuned for part 2!) But before we call it a day, let us look at some of the common issues that arise in real use cases and the ways to deal with them:

Active learning: what could go wrong?

The major potential issue has already been mentioned when we discussed the uncertainty sampling approaches above. The least confidence method in particular (although that is not to say that it is the only one) has the unfortunate property of gravitating towards the outliers. In practice, one solution to this is to iterate between a few different query strategies, e.g. from least confidence to random, then to margin sampling, etc. Another reason to employ multiple query strategies (including the random one!) is making sure that we explore more of the relevant feature space instead of focusing on a limited area.

Speaking of feature space exploration, in the event that we do not know what classes are there in our problem, we face a difficult challenge of class discovery. This is not specific to active learning, but having a limited labeling budget (the reason we are probably using AL on the first place) certainly does not help. So how do you train a classifier when you do not know how many classes are there? Apart from unsupervised clustering (perhaps combined with encoding samples into latent space with reduced dimensionality via a self-supervised autoencoder model), you would likely have to rely on what you get via the random selection.

Active learning may not be a one-size-fits-all, but without a doubt, it is a heavily under-utilized technique that can, and will, bring a lot of value to your commercial machine learning projects.

For an example of active learning in action, jump to the second part : Active Learning, part 2: the Practice !

Recommended articles