“The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks” Paper Summary & Analysis
Objectives/ Goals of the Paper
What problem is the paper tackling?
In Machine Learning, pruning techniques that remove unnecessary parameters are used in neural networks to improve speed and decrease model size without materially affecting accuracy. Earlier work has demonstrated that removing up to 90% of model parameters is possible without reducing accuracy, dramatically speeding up inference once the model is trained. However, up to this point, these models have not been very accurate to train with. The paper’s main contribution is finding a sparse representation of dense neural networks, which improves computational efficiency and generalization. The paper proposes that, given a randomly initialized neural network, there is a subnetwork (with the same initialization) such that, when trained independently, the sparse network can match or improve upon the performance of the original network in a similar or smaller number of iterations.
More formally, the paper presents the following hypothesis: “A randomly-initialized, dense neural network contains a subnetwork that is initialized such that — when trained in isolation — it can match the test accuracy of the original network after training for at most the same number of iterations.” Calling it the lottery hypothesis, the authors then experimentally show this hypothesis is true with a series of experiments on convolutional neural networks trained for basic computer vision tasks.
The authors present an algorithm that can identify a “winning ticket” by pruning the weights with the smallest magnitudes, removing those nodes entirely, and then retraining the network again (iterative pruning). Successive iterations make the networks smaller, while maintaining the accuracy of the original larger model.
How are the paper’s contributions different from previous related works?
They used iterative pruning vs. one-shot pruning, and tested the results through various architectures, as well as its synergy with Dropout. Additionally, while previous work has only tested on random reinitialization of weights, this paper shows that with proper reinitialization, performance can be greatly enhanced.
The paper assessed its results based on how much the algorithm was able to cut out of the neural network while maintaining the accuracy of the model. For example, one implementation of the algorithm was able to cut out roughly 98% of the model’s nodes while maintaining a similar level of accuracy, creating a much more compact model.
By using this method, the authors showed that the reason for large neural networks performing well is due to the fact that they can have many different valid sub-networks which can fit the model with a smaller dimension — a fact that can be leveraged in order to extract smaller networks.
Paper Limitations, Further Research, & Potential Applications
A major limitation of the paper is that it only runs on relatively small datasets such as MNIST and CIFAR-10, and not on larger datasets like ImageNet. Further work from CSAIL extended this paper and ran it on ImageNet, improving on the methods first developed on this paper. Another problem with the current method is that iterative pruning is very computationally intensive, requiring us to train the network 15 or more times. Further work has worked on reducing this computational requirement.
A useful explanation in the experimental section could have involved why the training accuracy was also increasing as the percentage of weights remaining decreased from 100% to about 13% in figure 4. We should expect the network with more parameters to overfit more strongly to the training dataset, and hence the accuracy should monotonically decrease as the percentage of weights remaining decreases. For the test accuracy, an improvement in accuracy as more weights are annulled is intuitively explained as the winning ticket network might overfit less strongly, but we should expect the opposite for the training data (where more overfitting should result in higher accuracy, assuming decreasing the loss generally results in increased accuracy). This is likely explained by the fact that the loss surface for a network of smaller size likely contains significantly fewer local optima, but a more fleshed-out explanation would have been nice.
What potential directions can research in this area take?
In the future, the authors intend to further study the properties of working initializations together with the inductive biases of the pruned network architectures that make the networks particularly good at learning.
A useful investigation might also include the effect of winning-tickets for a fine-tuning task; i.e. if we take a network trained on ImageNet, find its winning ticket and then retrain the winning ticket on ImageNet, would that network perform better when transferred to a new domain (say satellite image classification) than the full dense network trained on ImageNet? More concretely, how would the winning ticket sub network for a given task generalize to a different domain?
Moreover, the authors weren’t able to find the winning tickets without learning rate warmup. Therefore, future work should investigate why the warmup is necessary and whether there are other alternatives to this.
How can the ideas introduced by this paper be used in applications?
This work can be used to take a model and introduce a strong inductive bias upon the model, similar to how the architecture or design of a model also encodes some inductive bias in terms of how the information is processed. Using methods such as this would allow machine learning practitioners to worry less about the training of the model but rather focus on finding the right mask to use for the model. This will make it less sensitive to a lot of hyperparameters but will cause the initialization to have a much greater impact than previously experienced.
Another potential benefit comes from the robustness of the model to the removal of extra parameters. Especially within the field of NLP, there is a large focus on using these large pre-trained language models, which perform very well on specialized tasks but are not able to handle adversarial perturbations or generalize to out of distribution data and may potentially impose unfairness that comes from the data. With a sparser, more generalized model the hope is that it will help to correct some of these issues and make a more robust model with an equivalent level of performance.