Do Better ImageNet Models Transfer Better? Paper Summary & Analysis
It is often implicitly assumed that models which perform well on ImageNet would perform better in other CV tasks as well. This paper looks to empirically investigate if it is true that models trained on ImageNet perform better on other CV datasets because they have been trained on ImageNet, or simply because their architectures are well suited for general CV tasks. More broadly, this paper discusses if CV is overfitting to the ImageNet dataset.
The background of the paper consists of basic fundamental knowledge of modern computer vision architectures and the prevalence of ImageNet as a dataset used to train state-of-the-art models. In addition, the reader should be familiar with transfer learning in computer vision and the two methods that are used in the paper’s experiments.
In transfer learning, two types of transfer are generally used:
- Fixed feature extraction: the final layer of the Image-Net trained network is removed in favour of a linear classifier, which outputs the class prediction over the classes of the new (target) dataset.
- Fine-tuning: the weights of the ImageNet pretrained model are treated as an initialisation for the model trained on the new (target) dataset
Generally, Fixed Feature Extraction is better suited to a transfer task where data is sparse and the distribution is similar to the original one, while fine-tuning will generally outperform fixed feature extraction given enough data.
The paper performs a rigorous statistical analysis comparing ImageNet transfer performance using Fixed Feature Extraction and fine-tuning. Using robust spearman correlation metrics, they compare whether this correlation is statistically robust. Their main contribution is this rigorous analysis, answering to a high degree of confidence that improved ImageNet performance is strongly correlated with improved transfer performance, and reassuringly demonstrates that Computer Vision, as a field, has not overfit on ImageNet as a dataset.
In the fixed extraction paradigm, ImageNet top-1 accuracy was highly correlated with accuracy on transfer tasks with correlation (r=0.99), and for fine-tuning, this was also the case (r=0.96). However, this was only the case when all models were trained using the same techniques; using publicly available checkpoints, differences in regularization and training regime made a substantial difference. Consequently, the paper identified 4 key choices that reduced transfer performance:
- the absence of a scale parameter (γ) for batch normalization layers
- the use of label smoothing
- the presence of an auxiliary classifier head
These decisions had negligible impact on ImageNet performance but drastically reduced transfer performance. The differences can even be seen when looking at t-SNE embeddings of the feature space.
An interesting part of their analysis is that they also include fine-grained datasets, which while small, generally require expert-level expertise to classify for humans. A good example of this is Stanford Cars, which has 8,144 training datasets, but 196 different makes of cars. They tested their approaches on various datasets, including fine-grained datasets such as Stanford Cars and FGVC Aircraft. In this case, transfer learning doesn’t necessarily improve performance, but it did increase the speed of convergence by a factor of 17x.
Ultimately, because of this high correlation, CV can safely continue using ImageNet as a core benchmark for understanding CV model performance. However, it is not particularly clear why certain types of regularizations reduce transfer performance, rather than as expected, improving it. Nonetheless, this research shows that it is always better to start with an already-trained ImageNet model for other CV tasks rather than initializing from random; even if there are no performance gains, the drastically improved speed of convergence easily pays for itself.