Batch Active Learning

Making Active Learning Practical - Batch Queries

In our last post, we learned about the active learning framework. In it, we have access to a large pool of easily gathered unlabeled data, and our learner helps reduce the labeling costs by asking for labels one at a time from the unlabeled pool. We saw a few possible query strategies that the learner can use, and that they can reduce the labeling costs significantly.

This is all very nice, but who has time to sit and label examples one at a time, when we must wait for the learner to train a neural network before every single label request? Furthermore, we usually want to have a large batch of examples to label, so we can send them to many different annotators who can label them in parallel… So, to make the ideas from the last post applicable to modern machine learning tasks, we need to slightly tweak our framework.

In each iteration, the learner will train a model on the labeled examples like before, only now he will query a batch of examples. The size of this batch is a free parameter of our framework and depends on the way we do the labeling. Overall this seems like a rather meaningless change to our framework, but it adds some complexity.

The Effects of the Batch Size

There are a few effects that the batch size has on our learning process. On the plus side, we get to label examples in parallel and suffer less of a delay to our process which stems from waiting for our neural networks to train. But this plus side has a cost.

On the down side, the learner has to select examples with less information. If a learner has to query a batch of size 100, then the 100th example he has to choose is chosen without seeing the labels to the first 99 examples. If we had those labels, the learner might have chosen a different 100th example which would have been better for our overall accuracy in the long run.

But there is an even worse down side to the batch setting. The learner can ask for a batch of 100 examples such that each of them is very informative, but the information they give us is too similar. Intuitively, if the learner has trouble confidently classifying the digit “0”, it should be enough to get a few relevant examples of difficult “0”s for it to improve. But if the learner asks for 100 examples of “0”s, it isn’t using its batch budget very well…

This sort of problem can lead to a greedy strategy which chooses the top-K examples according to a standard measure being worse than random sampling. So intuitively, a good query strategy for the batch setting should not only chooses informative examples in the batch, but also choose examples that are different from eahc other.

Query Strategies for the Batch Setting

There were many papers published in recent years about active learning for neural networks, and we probably missed a few of them. Still, we’ll try to detail the most notable modern methods, and for those we also provide implementations and code to recreate our experiments. In this post we’ll focus on giving concise explanations and intuitions to these methods, and the next post will be all about comparing the methods to see which is best.

Greedy Query Strategies

These strategies are adaptations for neural networks of classic query strategies which only query one label at a time. They do not consider the mutual information between examples in a single batch, but rather simply choose the top-K examples under the relevant score.

Uncertainty-Based Strategy

We saw in the previous post that uncertainty sampling performs really well in the classical setting, and in general it is the most widely used query strategy due to its effectiveness and simplicity. It isn’t surprising then that this query strategy has been adapted to neural networks. The most straightforward adaptation of uncertainty sampling is to simply treat the softmax scores of the neural network as a probability distribution over the labels $\hat{P}(y|x)$, and use the regular decision rules. This approach is reasonable, but it is questionable how much the softmax scores can be interpreted as probabilities… They look like probabilities, but the training process of the neural network just tries to get the softmax scores of the training examples to be as close to one-hot as possible. If the model is able to get 100% training accuracy, it can just ramp up the norm of the final weight layer to get arbitrarily confident about the training set - but this makes the model overly confident about any input and so the resulting distribution is hard to trust without proper regularization of the model.

To deal with this issue, many people are working on ways of making neural networks output results which more accurately represent their confidence on a given input. The most well-known work, and one which has been adapted to the active learning setting, is Yarin Gal’s work on Bayesian Neural Networks. In this work, called DBAL, the authors show that we can get an approximation of the posterior probability over the labels by running the input many times through the network with dropout turned on and then average the resulting softmax scores. This method can be easily implemented for most modern neural network architectures since they already use dropout for regularization. The resulting probability distribution over the labels after $T$ iterations is then simply:

$P(y|x) \approx \frac{1}{T}\sum_{t=1}^{T}\hat{P}(y|x,drop_t)$

This new confidence estimation can now be plugged into any of the possible uncertainty decision rules to choose the top-K examples. In their paper the authors compare many possible decision rules, and the best two decision rules were the two presented in the previous post for uncertainty sampling.

Margin-Based Strategy - Adversarial Active Learning

In the previous post we saw that for linear models there is an equivalence between choosing examples which have a high uncertainty and choosing examples which are closest to the decision boundary. This is since the probability of an example is decided by its distance (or inner product) from the weight vector which is orthogonal to the decision boundary.

In neural networks though, this is hardly the case. The uncertainty score in neural networks is calculated with respect to the final representation layer. This means that choosing according to the uncertainty score is equivalent to choosing examples whose representation is closest to the decision boundary. The thing is, because the network is deep and complex, small changes to the input image can cause wild changes to the representation of the image. This is best seen in adversarial examples.

Adversarial examples are a fascinating and quirky property of neural networks which stems from a very simple fact - in the same way we can calculate the derivative of the loss with respect to the weights of the network, we can calculate the derivative with respect to the input. So, instead of changing the weights in the opposite direction of the gradient, we can change the input image in the direction of the gradient and get a new image which slightly increases the loss. Doing this a few times can result in an image which is completely misclassified by the model, while looking basically the same as the original image. This property of neural networks and its implications have led to a thriving field of neural network security research.

So, coming back to active learning, we would like to be able to find the images which are closest to the decision boundary of our network, but the distance of the representation of the inputs to the softmax decision boundary is a bad proxy for that. The actual distance to the decision boundary, like many things in the world of deep learning, is intractable. So, what can we do?

The cool solution Melanie Ducoffe came up with, called DFAL, is to use adversarial examples to get a better proxy for the distance to the decision boundary. DFAL takes each image in the unlabeled set, gets its classification according to the network and then finds an adversarial example for it, such that the network makes a wrong classification on that image. It then calculates the distance between the original image and the adversarial example and chooses the top-K examples with the smallest distance to their adversarial example. Basically, it forces each image to cross the decision boundary and checks how far it had to go to cross it.

This doesn’t mean that the adversarial example is the closest possible to the decision boundary, but it does give a reasonable heuristic upper bound on the distance to the decision boundary, which DFAL attempts to minimize.

Model Change Strategy - EGL

EGL was mentioned in passing in the previous post, so just to remind ourselves - the decision rule for EGL is to choose the examples that have the largest expected gradient length, where the expectation is over the probability distribution that the model predicts for the labels for each example. Intuitively, examples that we expect to change our model a lot have a lot of information in them.

A team from Baidu adapted this method for speech recognition, using it to choose from an unlabeled set of utterances and using a deep recurrent neural network for the architecture.

Batch-Aware Query Strategies

The Core Set Approach

This strategy takes the opposite approach to the previous ones. Instead of ignoring the batch mode and focusing on greedily choosing the examples that maximize some score, this method ignores the approach of maximizing some score, and instead tries to find a batch that is as diverse as possible and represents the complete data distribution. But how can we find a batch which represents the complete unlabeled set as much as possible?

The solution, proposed independently by Ran El-Yaniv’s group and Silvio Savarese’s group (with more impressive empirical results in Silvio’s paper), draws its inspiration from the concept of core sets. A core set is a subset of a dataset, such that when a model is trained on the subset it will produce a function that is close to a model resulting from training on the entire dataset. This concept seems to be exactly what we want - active learning is basically a core set selection problem. If we were able to query a batch that is a core set of the unlabeled set, we would quickly be able to gather labeled data such that the resulting model will be as good as the model trained on the entire distribution.

In Silvio’s paper, the authors show that under some assumptions on the model and the loss function, we can minimize a (not very tight) upper bound on the generalization error of the model by choosing a batch which covers the unlabeled set as much as possible, in an $\epsilon$-cover sense. This means that we should choose a batch such that when added to the labeled set, the maximum distance between an unlabeled example and a labeled example is minimized.

If this sounds like a hard thing to do, it’s because it is. Still, there is a simple greedy algorithm which guarantees a 2-approximate solution for this problem - greedily choose the example whose distance from the closest labeled example is farthest until you have a full batch. This algorithm is fast to run and results in a batch which is guaranteed to be diverse and cover the unlabeled set relatively well.

Still, the fun doesn’t end there because, while solving the original problem is NP-hard, that doesn’t mean we can’t try to improve on the approximate solution. The authors formulate the problem as a mixed integer optimization problem (MIP) which tries to find solutions which minimize the maximum distance and give the approximate solution as an initial starting point for the MIP solver. Even better, they formulate the MIP such that it is robust to outliers - there is a constant number of examples which are allowed to have a distance larger than the maximum distance. This addition is important because the original greedy algorithm is susceptible to outliers because it keeps choosing the examples which are farthest away.

There are no guarantees that we will be able to improve on the greedy solution in a reasonable amount of time using the above formulation, but the relaxation of allowing outliers usually leads to a small improvement (and makes the chosen batch contain less outliers).

There is only one thing we need to sort out - the distances between images in pixel space has little to do with the semantic distance between them, so just using the Euclidean distance between the images isn’t the best idea. What this method does is use the image representation learned by the neural network that was trained on the labeled set as the space in which to calculate the distances between images. So, we simply need to train the network on the labeled set, get the network’s embedding of the entire dataset and then run the core set method on that embedding.

Flexibility of the Core Set Approach

This approach has a unique advantage which isn’t mentioned in the papers but is very relevant for practical purposes. So far, we have talked about active learning only in the setting of classification problems, but that doesn’t mean that we wouldn’t want to use it on other problems (regression problems, structured learning, segmentation and so on). On the contrary, these different machine learning problems are usually harder to label so we would like to use active learning especially on things which aren’t classification. All of the methods we’ve talked about besides core set assume we have access to a learned $\hat{P}(y|x)$, but when our problem isn’t a classification problem, this isn’t true. While we could try to adapt the other methods to new scenarios, the core set approach simply assumes access to a learned representation of the input data, which is something that exists for a larger variety of scenarios when neural networks are involved. So, this method automatically extends to many types of machine learning problems, which is a real plus.

We can also think about extending this method to working on any type of data and problem by having the learned representation be an embedding learned by an autoencoder. In this extension, we can just train an autoencoder on the entire dataset (both the labeled and unlabeled set), and then iteratively choose batches which cover the embedding. We won’t even have to retrain the autoencoder between batches, since our objective is getting a good cover and not maximizing some individual example score.

Coordinate Matching

This strategy isn’t practical for neural networks as we will soon see, but it is worth a mention because of its unique approach. What Glencora Borradaile’s group suggest is that the classic methods for the single query setting are close enough to optimal, such that we can replace the objective of maximizing some example score with the objective of choosing a K-sized batch which is as close as possible to what we would have gotten after K iterations in the classic setting. If we successfully select a batch which is close to what the standard iterative methods would have given us, then we will get results which are like the ones from the classic setting which were great.

So how can we try to choose a batch which is close to the next K choices of a classic active learning algorithm? This is where it gets impractical for deep learning…

This method uses Monte Carlo sampling to simulate many sequences of K choices of the classic algorithm, and then tries to combine the different sequences into one single batch of size K. We’ll assume for example that the method we are trying to mimic is uncertainty sampling. What we’ll do is sample a single example using uncertainty sampling and then label the example with a label drawn from our learned distribution over labels. We will now pretend this is the correct label and fit a new model to the labeled examples and repeat this process until we have a sequence of K examples. We then start over and run the above process for T times, until we have T possible sequences of K examples.

If we believe our model’s posterior over the labels given an example, we should expect to see sequences which are similar to the behavior of the model. Having many of these sequences should reduce the variance caused by the sampling process, and what’s left for us to do is extract the K examples from the sequences which best capture that behavior. In the paper there are two approaches suggested for dealing with this objective, one variational and the other based on a greedy optimization algorithm. We won’t detail them here since the method isn’t something we can implement with neural networks in any reasonable running time…

Summary

In this post we extended the basic active learning framework to a more practical setting for today’s models - batch active learning. We saw that this extension adds complexity to our problem, because now the mutual information between examples in a batch can affect our performance. Also, we saw that using a more complex model like neural networks means that it’s harder to accurately assess the model’s uncertainty and decision boundary.

We covered a few leading algorithms for batch active learning for neural networks - Bayesian uncertainty sampling (DBAL), adversarial active learning (DFAL) and the core set approach. Each method tackles the active learning problem from wildly different directions, and each one intuitively sounds like it should do a good job.

In the next post we’ll compare these methods and others and try to understand where each method shines. We’ll look at the effect of the batch size on the effectiveness of these methods, try to look at the batch composition of the methods and hopefully say something intelligent about applying these methods in practice.