“MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications” Paper Summary & Analysis
In order to increase model accuracy, data scientists often continuously increase the size of models. However, this comes at the expense of efficiency and computational hardware requirements. As a result, these oversized models may become unsuitable for some tasks. Self-driving cars, for example, need to identify problems quickly, or else their systems may not have enough time to resolve them, which can be very dangerous. The goal of this paper is to develop smaller, thinner CNNs that can perform comparably to larger models while still being lightweight enough to run on devices with less computational power, such as mobile devices. Rather than focusing on pruning techniques to decrease the size of pre-existing networks, this paper focuses on creating a lightweight network from scratch and emphasizes the use of thinner, deeper networks rather than making a shallower net.
The MobileNet Architecture
This paper approaches the task of creating more lightweight CNNs by introducing depthwise separable convolutions into their CNN architecture. Depthwise separable convolutions are a form of a factorized convolutions which consist of a depthwise convolution and a 1x1 convolution, which is referred to as a pointwise convolution.
The key portion of the MobileNet model is the depthwise separable convolution. Recall in a regular convolutional layer —
— the convolution is spatially local but extends through the full depth of the image. The depthwise separable convolution factors this convolution into a layer by layer convolution and a pixel by pixel pointwise convolution.
In the depthwise convolution, we have the same number of output layers as input layers.
The pointwise convolution then applies a convolution across the full depth at each point, resulting in an image depth of 1. We can repeat the pointwise convolution by the depth of the desired output. Because each pointwise convolutional filter is randomly initialized, they compute different values.
By analyzing the number of computations required for the depthwise separable convolutional layer:
We see that we can reduce the number of computations required; here N is the depth of the output layer and D_k is the size of depthwise filter. As a result, if we compare a 3x3 convolutional layer with a 3x3 depthwise separable convolutional layer and kept the depth of the output layer constant, we would need ~9x fewer computations.
As a result, the heart of the MobileNet architecture is the depthwise separable convolution; all convolutions except the first are depthwise separable.
The paper then proposes two additional hyperparameters that trade performance for accuracy. alpha thins the network uniformly at every layer; if the input depth was 10 and the output depth is 20, then if we use alpha = 0.5, the new input depth will be 5 and the output depth will be 10.
The second hyperparameter, rho, is set by modifying the input resolution of the image. Since depthwise separable convolutions operate over the full image, this naturally reduces the computational cost, while the reduced image size reduces the amount of information available to the model.
Compared to standard, large vision networks, MobileNet uses far less computation —
— while retaining most of the accuracy; this feature of MobileNet was replicated across the full spectrum of experiments conducted.
A Critique of the Paper
While the paper showed compelling evidence for the effectiveness of MobileNets on a variety of tasks, explanations for the difference in performance were not given. This is especially confusing in the case of the PlaNet MobileNet architecture which managed to outperform the original PlaNet model in 2 scales despite having significantly fewer parameters.
The architecture itself raises some questions, as well. The paper proposes that a standard convolution can be replaced with a depthwise convolution and a series of pointwise convolutions. In their example, when representing a 3x3 convolution followed by a batchnorm and ReLU layer, they have a batchnorm layer followed by a ReLU layer after both the depthwise and the pointwise convolutions, without stating why these layers are used twice or how they came across this architecture as optimal. Other variations (such as only having one batchnorm and ReLU layer) were not explored. The authors also mention two hyperparameters, the width multiplier and resolution multiplier, that are used to reduce the computational cost of the model, but don’t offer results comparing different values on the same task. Despite the model being called MobileNet there were also no benchmarks on mobile devices.
The principle question of why the depthwise separable convolution works remains unanswered. Possible experiments include using DS blocks in larger networks, comparing, for example, a ResNet-50 with DS blocks extended to contain the same number of parameters as a ResNet-50 with regular convolutional layers. This would help us understand the observed efficacy of depthwise separable convolutions. Further experimentation could include measurements on inference speed-up compared to regular CNNs, as faster inference is a big motivation for the paper’s contributions.Finally, it would be interesting to perform a more in-depth comparison of the performance of smaller neural networks such as MobileNets with other approaches to reduce deep neural network bloat, such as knowledge distillation, lottery tickets, pruning and compression. Is it better to train a MobileNet variant from scratch, or to take a larger, more complex model and use one of the aforementioned methods? Can these methods be combined to produce even more robust and generalizable models?
Due to its small size and efficiency, MobileNets provide users with an option to decide between the trade off of latency and accuracy via adjusting the hyper-parameters so as to find the right sized model for the purpose of their use case. The paper tested MobileNets’ performance on a number of applications, including large scale geolocalization, face attribute classification, object detection, and Face embeddings. We see potential applications in contexts where the loss of accuracy isn’t as costly (for example in VR applications vs automated vehicles) compared to an increased ability to fit scale constraints and an improvement in user experience due to faster responses. In addition, MobileNets might even be coupled with a more complex model to give a timely response followed by a “confirmation” by the complex model, which is likely to have higher accuracy.