MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Abstract

Targeting mobile and embedded vision applications, this paper proposes an efficient model called MobileNets, a lightweight neural network built on depthwise separable convolutions. The model uses two hyperparameters to trade off accuracy against latency, and extensive experiments balancing the two were conducted on ImageNet, demonstrating strong performance compared with other models. Experiments also showcase MobileNets’ strengths across a wide range of applications, including object detection, fine-grained classification, face attributes, and large-scale geolocalization.

Introduction

Ever since AlexNet made deep CNNs popular, CNNs have become ubiquitous in computer vision, and the general trend has been to invent deeper and more complex networks to achieve higher accuracy. However, these improvements have done nothing to advance the speed and size of networks—real-world applications such as robotics, self-driving cars, and AR all need to run in real time on platforms with limited computation.

This paper proposes an efficient network architecture and a set of two hyperparameters for building models for the applications above. Section 2 reviews prior experience in building small models, Section 3 describes the MobileNet architecture and the two hyperparameters—the width multiplier and the resolution multiplier, Section 4 describes experiments on ImageNet and various applications, and finally Section 5 concludes.

Prior Work

Building efficient, small-footprint networks has become popular recently, and many approaches can be classified into two categories: compressing pretrained networks, or training small networks directly. The network architecture proposed in this paper lets the builder choose a small network that meets resource constraints. MobileNet focuses primarily on optimizing latency while producing small networks; many networks consider only size and not speed.

MobileNets make use of depthwise separable convolutions, which the Inception model also used to reduce computation in its first few layers. Flattened networks use fully factorized convolutions to build networks and demonstrate the potential of factorized networks. Factorized networks use similar convolution factorization and also leverage topological connections. Others include the Xception network, which scales up depthwise separable filters, and SqueezeNet, which uses bottlenecks. Some other networks that reduce computation include the structured transform network and deep fried convnets.

Other ways to obtain small networks: shrinking, factorizing, and compressing pretrained networks, where compressing includes product quantization, hashing, pruning, vector quantization, and Huffman coding. Factorization methods include references [14, 20]. Other approaches include distillation (training a small network using the outputs of a large network) and low bit networks.

MobileNet Architecture

This chapter first introduces the core depthwise separable filters, then describes the MobileNet network architecture, and finally closes with the two shrinking hyperparameters (the width multiplier and the resolution multiplier).

Depthwise Separable Convolutions

A depthwise separable convolution splits a standard convolution into two parts: a depthwise convolution and a 1x1 convolution. It also splits a single convolutional layer into two layers: filtering and combining. This factorization can drastically reduce both computation and model size. The comparison between a standard convolution and a depthwise separable convolution is shown below:

The difference between the two convolution methods
The difference between a standard convolution and a depthwise separable convolution

For a standard convolution, suppose its input dimensions are DFDFMD_{F}*D_{F}*M, its output dimensions are DGDGND_{G}*D_{G}*N, and the convolution kernel dimensions are DKDKMND_{K}*D_{K}*M*N. Then, with stride=1 and padding, the standard convolution is computed by:

Gk,l,n=i,j,mKi,j,m,nFk+i1,l+j1,mG_{k,l,n}=\sum_{i,j,m}K_{i,j,m,n}\cdot F_{k+i-1,l+j-1,m}

The computational cost of the standard convolution is:

DKDKMNDFDFD_{K}\cdot D_{K}\cdot M\cdot N\cdot D_{F}\cdot D_{F}

while the computational cost of the depthwise separable convolution is:

DKDKMDFDF+MNDFDFD_{K}\cdot D_{K}\cdot M\cdot D_{F}\cdot D_{F}+M\cdot N\cdot D_{F}\cdot D_{F}

The former term is the cost of the depthwise convolution, and the latter is the cost of the 1x1 convolution. Comparing the two, the reduction in computation is:

DKDKMDFDF+MNDFDFDKDKMNDFDF=1N+1DK2\frac{D_{K}\cdot D_{K}\cdot M\cdot D_{F}\cdot D_{F}+M\cdot N\cdot D_{F}\cdot D_{F}}{D_{K}\cdot D_{K}\cdot M\cdot N\cdot D_{F}\cdot D_{F}}=\frac{1}{N}+\frac{1}{D_{K}^{2}}

If a 3x3 kernel is used, computation is reduced by 8–9x, with only a small drop in accuracy. Factorizing further [16, 31] does not reduce computation much, because the depthwise convolution already has very little computation.

Network Structure and Training

Apart from the first layer, which is a standard convolution, the rest of MobileNet’s structure is built on depthwise separable convolutions. The full network architecture is shown below:

Type / StrideFilter ShapeInput Size
Conv / s23×3×3×32224×224×3
Conv dw / s13×3×32 dw112×112×32
Conv / s11×1×32×64112×112×32
Conv dw / s23×3×64 dw112×112×64
Conv / s11×1×64×12856×56×64
Conv dw / s13×3×128 dw56×56×128
Conv / s11×1×128×12856×56×128
Conv dw / s23×3×128 dw56×56×128
Conv / s11×1×128×25628×28×128
Conv dw / s13×3×256 dw28×28×256
Conv / s11×1×256×25628×28×256
Conv dw / s23×3×256 dw28×28×256
Conv / s11×1×256×51214×14×256
5× Conv dw / s13×3×512 dw14×14×512
5× Conv / s11×1×512×51214×14×512
Conv dw / s23×3×512 dw14×14×512
Conv / s11×1×512×10247×7×512
Conv dw / s23×3×1024 dw7×7×1024
Conv / s11×1×1024×10247×7×1024
Avg Pool / s1Pool 7×77×7×1024
FC / s11024×10001×1×1024
Softmax / s1Classifier1×1×1000

Table 1. MobileNet body architecture (the two 5× rows are repeated 5 times).

It is worth noting that a small number of Mult-Adds alone does not make a model efficient. It is equally important that these Mult-Add operations can be implemented efficiently. For example, unstructured sparse matrix operations are not necessarily faster than dense matrix operations unless the sparsity is very high. Our model turns nearly all computation into dense 1x1 convolution operations, which can be implemented with a highly optimized general matrix multiply (GEMM). Convolutions implemented with GEMM usually require first reordering the input in memory using im2col—an operation that can, for example, be implemented in Caffe. Our 1x1 convolutions, by contrast, require no prior reordering and can apply the GEMM algorithm directly (one of the most optimized numerical linear algebra algorithms). In MobileNet, 95% of the Mult-Add operations and 75% of the parameters come from 1x1 convolutions.

Training details: TensorFlow + RMSprop + asynchronous gradient descent (similar to Inception V3) + less regularization and data augmentation (small models are less prone to overfitting) + little or no weight decay on the depthwise filters (since they already have very few parameters).

Width Multiplier: Thinner Models

Although the current MobileNet is already very small and fast, sometimes an even smaller model is needed. We introduce a hyperparameter α\alpha (the width multiplier) to build these smaller models. The goal of this parameter is to uniformly thin out the entire network at every layer. Given an α\alpha, the number of input channels M becomes αM\alpha M and the number of output channels N becomes αN\alpha N. Typical values of α\alpha are 1, 0.75, 0.5, and 0.25. After applying this parameter, the computation becomes:

DKDKαMDFDF+αMαNDFDFD_{K}\cdot D_{K}\cdot \alpha M\cdot D_{F}\cdot D_{F}+\alpha M\cdot \alpha N\cdot D_{F}\cdot D_{F}

The computation becomes roughly α2\alpha^{2} of what it was.

Resolution Multiplier: Reduced Representation

The second hyperparameter for reducing network computation is ρ\rho (the resolution multiplier), which is set by setting the input resolution; the internal resolution then decreases accordingly. After adding the hyperparameters α,ρ\alpha,\rho, the computation becomes:

DKDKαMρDFρDF+αMαNρDFρDFD_{K}\cdot D_{K}\cdot \alpha M\cdot \rho D_{F}\cdot \rho D_{F}+\alpha M\cdot \alpha N\cdot \rho D_{F}\cdot \rho D_{F}

The resolutions resulting from setting ρ\rho are typically 224, 192, 160, and 128. Note that setting this parameter changes the computation, but the number of model parameters does not change.

Experiments

This part mainly covers a number of experiments. First is the comparison between depthwise separable convolutions and standard convolutions, and between a thinner MobileNet and a shallower MobileNet. It then presents the experimental results for the two hyperparameters, including ImageNet accuracy, the number of Mult-Add operations, and the number of parameters. Finally, it presents experimental results for MobileNet on several different applications (fine-grained classification, large-scale geolocalization, face attributes, object detection, and face embeddings).

Model Choices

The experimental results show that a full-convolution MobileNet and a depthwise separable convolution MobileNet have comparable accuracy, but the depthwise separable convolution version has far fewer parameters and far less computation. Comparing a thinner MobileNet with a shallower MobileNet, at comparable computation the thinner MobileNet has somewhat higher accuracy. The results are shown below:

ModelImageNet AccuracyMillion Mult-AddsMillion Parameters
Conv MobileNet71.7%486629.3
MobileNet70.6%5694.2

Table 4. Depthwise Separable vs Full Convolution MobileNet.

ModelImageNet AccuracyMillion Mult-AddsMillion Parameters
0.75 MobileNet68.4%3252.6
Shallow MobileNet65.3%3072.9

Table 5. Narrow vs Shallow MobileNet.

Model Shrinking Hyperparameters

This part covers tuning the two hyperparameters above, with results shown below:

Width MultiplierImageNet AccuracyMillion Mult-AddsMillion Parameters
1.0 MobileNet-22470.6%5694.2
0.75 MobileNet-22468.4%3252.6
0.5 MobileNet-22463.7%1491.3
0.25 MobileNet-22450.6%410.5

Table 6. MobileNet width multiplier.

ResolutionImageNet AccuracyMillion Mult-AddsMillion Parameters
1.0 MobileNet-22470.6%5694.2
1.0 MobileNet-19269.1%4184.2
1.0 MobileNet-16067.2%2904.2
1.0 MobileNet-12864.4%1864.2

Table 7. MobileNet resolution.

Comparison of hyperparameter tuning
Comparison of hyperparameter tuning (continued)
ModelImageNet AccuracyMillion Mult-AddsMillion Parameters
1.0 MobileNet-22470.6%5694.2
GoogleNet69.8%15506.8
VGG 1671.5%15300138

Table 8. MobileNet comparison to popular models.

ModelImageNet AccuracyMillion Mult-AddsMillion Parameters
0.50 MobileNet-16060.2%761.32
Squeezenet57.5%17001.25
AlexNet57.2%72060

Table 9. Smaller MobileNet comparison to popular models.

Fine Grained Recognition

A model for fine-grained classification was trained on the Stanford Dogs dataset and some noisy data from the web, with careful tuning, ultimately achieving results close to the state of the art while reducing computation and model size. The results are shown below:

ModelTop-1 AccuracyMillion Mult-AddsMillion Parameters
Inception V384%500023.2
1.0 MobileNet-22483.3%5693.3
0.75 MobileNet-22481.9%3251.9
1.0 MobileNet-19281.9%4183.3
0.75 MobileNet-19280.5%2391.9

Table 10. MobileNet for Stanford Dogs.

Large Scale Geolocalizaton

PlaNet solves the localization problem by recasting it as a classification problem. PlaNet has already successfully localized many photos, and its performance on this problem has surpassed Im2GPS. We retrained PlaNet on the same data using the MobileNet architecture, with results shown below:

ScaleIm2GPSPlaNetPlaNet MobileNet
Continent (2500 km)51.9%77.6%79.3%
Country (750 km)35.4%64.0%60.3%
Region (200 km)32.1%51.1%45.2%
City (25 km)21.9%31.7%31.7%
Street (1 km)2.5%11.0%11.4%

Table 11. Performance of PlaNet using the MobileNet architecture. Percentages are the fraction of the Im2GPS test dataset localized within a certain distance from ground truth.

Face Attributes

MobileNet can also be used to compress large-scale systems with unknown training processes. A face attribute classification system used the synergy of MobileNet and distillation; after combining the two, the system not only required no regularization but also exhibited stronger performance, with results shown below:

Width Multiplier / ResolutionMean APMillion Mult-AddsMillion Parameters
1.0 MobileNet-22488.7%5683.2
0.5 MobileNet-22488.1%1490.8
0.25 MobileNet-22487.2%450.2
1.0 MobileNet-12888.1%1853.2
0.5 MobileNet-12887.7%480.8
0.25 MobileNet-12886.4%150.2
Baseline86.9%16007.5

Table 12. Face attribute classification using the MobileNet architecture. Each row corresponds to a different hyper-parameter setting (width multiplier α and image resolution).

Object Detection

This experiment used VGG, Inception, and MobileNet on SSD and Faster-RCNN to train on the COCO dataset, with results shown below:

Framework ResolutionModelmAPBillion Mult-AddsMillion Parameters
SSD 300deeplab-VGG21.1%34.933.1
SSD 300Inception V222.0%3.813.7
SSD 300MobileNet19.3%1.26.8
Faster-RCNN 300VGG22.9%64.3138.5
Faster-RCNN 300Inception V215.4%118.213.3
Faster-RCNN 300MobileNet16.4%25.26.1
Faster-RCNN 600VGG25.7%149.6138.5
Faster-RCNN 600Inception V221.9%129.613.3
Faster-RCNN 600MobileNet19.8%30.56.1

Table 13. COCO object detection results comparison using different frameworks and network architectures. mAP is reported with the COCO primary challenge metric (AP at IoU=0.50:0.05:0.95).

Face Embeddings

FaceNet is the state-of-the-art result for face embeddings. Here we likewise use distillation to train a Mobile FaceNet. The results are shown below:

Model1e-4 AccuracyMillion Mult-AddsMillion Parameters
FaceNet83%16007.5
1.0 MobileNet-16079.4%2864.9
1.0 MobileNet-12878.3%1855.5
0.75 MobileNet-12875.2%1663.4
0.75 MobileNet-12872.5%1083.8

Table 14. MobileNet distilled from FaceNet.

Conclusion

We proposed a model architecture based on depthwise separable convolutions, and used two hyperparameters—the width multiplier and the resolution multiplier—to control model complexity. We compared MobileNet against other models in terms of model size, speed, and accuracy, demonstrating its efficiency across a variety of applications. The next step is to refine and further develop MobileNet.

Datasets: ImageNet (image classification), Stanford Dogs dataset (fine-grained classification), YFCC100M (face attributes), COCO (object detection).

Related papers:

Datasets

“Imagenet large scale visual recognition challenge” (ImageNet, ILSVRC 2012)

“In First Workshop on Fine-Grained Visual Categorization” (Stanford Dogs dataset)

“Yfcc100m: The new data in multimedia research” (YFCC100M)

Deeper, more complex, higher-accuracy neural networks

“Inception-v4, inception-resnet and the impact of residual connections on learning” (Inception V4)

“Rethinking the inception architecture for computer vision” (Inception V3, additional factorization in the spatial dimension)

“Deep residual learning for image recognition” (ResNet)

“Going deeper with convolutions” (GoogLeNet)

“Very deep convolutional networks for large-scale image recognition” (VGG16)

“Imagenet classification with deep convolutional neural networks” (AlexNet)

Neural network compression and acceleration

“Flattened convolutional neural networks for feedforward acceleration” (additional factorization in the spatial dimension)

“Factorized convolutional neural networks” (factorizing convolutions)

“Squeezenet: Alexnet-level accuracy with 50x fewer parameters and ¡1mb model size” (using bottlenecks to build small networks)

“Quantized convolutional neural networks for mobile devices” (compression based on product quantization)

“Xnornet: Imagenet classification using binary convolutional neural networks” (using low bit networks)

“Training deep neural networks with low precision multiplications” (using low bit networks)

“Quantized neural networks: Training neural networks with low precision weights and activations” (using low bit networks)

“Xception: Deep learning with depthwise separable convolutions” (scaling up depthwise separable filters)

“Structured transforms for small-footprint deep learning” (a network for reducing computation)

“Deep fried convnets” (a network for reducing computation)

“Compressing neural networks with the hashing trick” (using hashing to compress neural networks)

“Rigid-motion scattering for image classification” (originally proposed factorizing a standard convolution into a depthwise conv and a 1x1 conv)

“Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding” (using Huffman coding to compress networks)

“Speeding up convolutional neural networks with low rank expansions” (additional variable factorization)

“Speeding-up convolutional neural networks using fine-tuned cp-decomposition” (additional variable factorization)

“Distilling the knowledge in a neural network” (using distillation to train a small network from a large one, for compression)

BN

“Batch normalization: Accelerating deep network training by reducing internal covariate shift” (Inception V2 also derives from this)

Frameworks

“Caffe: Convolutional architecture for fast feature embedding”

“Tensorflow: Large-scale machine learning on heterogeneous systems”

Image localization

“IM2GPS: estimating geographic information from a single image” (proposed Im2GPS)

“Large-Scale Image Geolocalization” (about Im2GPS)

“PlaNet - Photo Geolocation with Convolutional Neural Networks” (PlaNet)

Fine-grained classification

“The unreasonable effectiveness of noisy data for fine-grained recognition”

Object detection

“Faster r-cnn: Towards real-time object detection with region proposal networks” (the Faster-RCNN framework)

“Ssd: Single shot multibox detector” (the SSD framework)

Face embeddings

“Facenet: A unified embedding for face recognition and clustering” (FaceNet, building face embeddings based on triplet loss)