MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Abstract

Targeting mobile and embedded vision applications, this paper proposes an efficient model called MobileNets, a lightweight neural network built on depthwise separable convolutions. The model uses two hyperparameters to trade off accuracy against latency, and extensive experiments balancing the two were conducted on ImageNet, demonstrating strong performance compared with other models. Experiments also showcase MobileNets’ strengths across a wide range of applications, including object detection, fine-grained classification, face attributes, and large-scale geolocalization.

Introduction

Ever since AlexNet made deep CNNs popular, CNNs have become ubiquitous in computer vision, and the general trend has been to invent deeper and more complex networks to achieve higher accuracy. However, these improvements have done nothing to advance the speed and size of networks—real-world applications such as robotics, self-driving cars, and AR all need to run in real time on platforms with limited computation.

This paper proposes an efficient network architecture and a set of two hyperparameters for building models for the applications above. Section 2 reviews prior experience in building small models, Section 3 describes the MobileNet architecture and the two hyperparameters—the width multiplier and the resolution multiplier, Section 4 describes experiments on ImageNet and various applications, and finally Section 5 concludes.

Prior Work

Building efficient, small-footprint networks has become popular recently, and many approaches can be classified into two categories: compressing pretrained networks, or training small networks directly. The network architecture proposed in this paper lets the builder choose a small network that meets resource constraints. MobileNet focuses primarily on optimizing latency while producing small networks; many networks consider only size and not speed.

MobileNets make use of depthwise separable convolutions, which the Inception model also used to reduce computation in its first few layers. Flattened networks use fully factorized convolutions to build networks and demonstrate the potential of factorized networks. Factorized networks use similar convolution factorization and also leverage topological connections. Others include the Xception network, which scales up depthwise separable filters, and SqueezeNet, which uses bottlenecks. Some other networks that reduce computation include the structured transform network and deep fried convnets.

Other ways to obtain small networks: shrinking, factorizing, and compressing pretrained networks, where compressing includes product quantization, hashing, pruning, vector quantization, and Huffman coding. Factorization methods include references [14, 20]. Other approaches include distillation (training a small network using the outputs of a large network) and low bit networks.

MobileNet Architecture

This chapter first introduces the core depthwise separable filters, then describes the MobileNet network architecture, and finally closes with the two shrinking hyperparameters (the width multiplier and the resolution multiplier).

Depthwise Separable Convolutions

A depthwise separable convolution splits a standard convolution into two parts: a depthwise convolution and a 1x1 convolution. It also splits a single convolutional layer into two layers: filtering and combining. This factorization can drastically reduce both computation and model size. The comparison between a standard convolution and a depthwise separable convolution is shown below:

The difference between the two convolution methods — The difference between a standard convolution and a depthwise separable convolution

For a standard convolution, suppose its input dimensions are $D_{F}*D_{F}*M$ , its output dimensions are $D_{G}*D_{G}*N$ , and the convolution kernel dimensions are $D_{K}*D_{K}*M*N$ . Then, with stride=1 and padding, the standard convolution is computed by:

G_{k,l,n}=\sum_{i,j,m}K_{i,j,m,n}\cdot F_{k+i-1,l+j-1,m}

The computational cost of the standard convolution is:

D_{K}\cdot D_{K}\cdot M\cdot N\cdot D_{F}\cdot D_{F}

while the computational cost of the depthwise separable convolution is:

D_{K}\cdot D_{K}\cdot M\cdot D_{F}\cdot D_{F}+M\cdot N\cdot D_{F}\cdot D_{F}

The former term is the cost of the depthwise convolution, and the latter is the cost of the 1x1 convolution. Comparing the two, the reduction in computation is:

\frac{D_{K}\cdot D_{K}\cdot M\cdot D_{F}\cdot D_{F}+M\cdot N\cdot D_{F}\cdot D_{F}}{D_{K}\cdot D_{K}\cdot M\cdot N\cdot D_{F}\cdot D_{F}}=\frac{1}{N}+\frac{1}{D_{K}^{2}}

If a 3x3 kernel is used, computation is reduced by 8–9x, with only a small drop in accuracy. Factorizing further [16, 31] does not reduce computation much, because the depthwise convolution already has very little computation.

Network Structure and Training

Apart from the first layer, which is a standard convolution, the rest of MobileNet’s structure is built on depthwise separable convolutions. The full network architecture is shown below:

Type / Stride	Filter Shape	Input Size
Conv / s2	3×3×3×32	224×224×3
Conv dw / s1	3×3×32 dw	112×112×32
Conv / s1	1×1×32×64	112×112×32
Conv dw / s2	3×3×64 dw	112×112×64
Conv / s1	1×1×64×128	56×56×64
Conv dw / s1	3×3×128 dw	56×56×128
Conv / s1	1×1×128×128	56×56×128
Conv dw / s2	3×3×128 dw	56×56×128
Conv / s1	1×1×128×256	28×28×128
Conv dw / s1	3×3×256 dw	28×28×256
Conv / s1	1×1×256×256	28×28×256
Conv dw / s2	3×3×256 dw	28×28×256
Conv / s1	1×1×256×512	14×14×256
5× Conv dw / s1	3×3×512 dw	14×14×512
5× Conv / s1	1×1×512×512	14×14×512
Conv dw / s2	3×3×512 dw	14×14×512
Conv / s1	1×1×512×1024	7×7×512
Conv dw / s2	3×3×1024 dw	7×7×1024
Conv / s1	1×1×1024×1024	7×7×1024
Avg Pool / s1	Pool 7×7	7×7×1024
FC / s1	1024×1000	1×1×1024
Softmax / s1	Classifier	1×1×1000

Table 1. MobileNet body architecture (the two 5× rows are repeated 5 times).

It is worth noting that a small number of Mult-Adds alone does not make a model efficient. It is equally important that these Mult-Add operations can be implemented efficiently. For example, unstructured sparse matrix operations are not necessarily faster than dense matrix operations unless the sparsity is very high. Our model turns nearly all computation into dense 1x1 convolution operations, which can be implemented with a highly optimized general matrix multiply (GEMM). Convolutions implemented with GEMM usually require first reordering the input in memory using im2col—an operation that can, for example, be implemented in Caffe. Our 1x1 convolutions, by contrast, require no prior reordering and can apply the GEMM algorithm directly (one of the most optimized numerical linear algebra algorithms). In MobileNet, 95% of the Mult-Add operations and 75% of the parameters come from 1x1 convolutions.

Training details: TensorFlow + RMSprop + asynchronous gradient descent (similar to Inception V3) + less regularization and data augmentation (small models are less prone to overfitting) + little or no weight decay on the depthwise filters (since they already have very few parameters).

Width Multiplier: Thinner Models

Although the current MobileNet is already very small and fast, sometimes an even smaller model is needed. We introduce a hyperparameter $\alpha$ (the width multiplier) to build these smaller models. The goal of this parameter is to uniformly thin out the entire network at every layer. Given an $\alpha$ , the number of input channels M becomes $\alpha M$ and the number of output channels N becomes $\alpha N$ . Typical values of $\alpha$ are 1, 0.75, 0.5, and 0.25. After applying this parameter, the computation becomes:

D_{K}\cdot D_{K}\cdot \alpha M\cdot D_{F}\cdot D_{F}+\alpha M\cdot \alpha N\cdot D_{F}\cdot D_{F}

The computation becomes roughly $\alpha^{2}$ of what it was.

Resolution Multiplier: Reduced Representation

The second hyperparameter for reducing network computation is $\rho$ (the resolution multiplier), which is set by setting the input resolution; the internal resolution then decreases accordingly. After adding the hyperparameters $\alpha,\rho$ , the computation becomes:

D_{K}\cdot D_{K}\cdot \alpha M\cdot \rho D_{F}\cdot \rho D_{F}+\alpha M\cdot \alpha N\cdot \rho D_{F}\cdot \rho D_{F}

The resolutions resulting from setting $\rho$ are typically 224, 192, 160, and 128. Note that setting this parameter changes the computation, but the number of model parameters does not change.

Experiments

This part mainly covers a number of experiments. First is the comparison between depthwise separable convolutions and standard convolutions, and between a thinner MobileNet and a shallower MobileNet. It then presents the experimental results for the two hyperparameters, including ImageNet accuracy, the number of Mult-Add operations, and the number of parameters. Finally, it presents experimental results for MobileNet on several different applications (fine-grained classification, large-scale geolocalization, face attributes, object detection, and face embeddings).

Model Choices

The experimental results show that a full-convolution MobileNet and a depthwise separable convolution MobileNet have comparable accuracy, but the depthwise separable convolution version has far fewer parameters and far less computation. Comparing a thinner MobileNet with a shallower MobileNet, at comparable computation the thinner MobileNet has somewhat higher accuracy. The results are shown below:

Model	ImageNet Accuracy	Million Mult-Adds	Million Parameters
Conv MobileNet	71.7%	4866	29.3
MobileNet	70.6%	569	4.2

Table 4. Depthwise Separable vs Full Convolution MobileNet.

Model	ImageNet Accuracy	Million Mult-Adds	Million Parameters
0.75 MobileNet	68.4%	325	2.6
Shallow MobileNet	65.3%	307	2.9

Table 5. Narrow vs Shallow MobileNet.

Model Shrinking Hyperparameters

This part covers tuning the two hyperparameters above, with results shown below:

Width Multiplier	ImageNet Accuracy	Million Mult-Adds	Million Parameters
1.0 MobileNet-224	70.6%	569	4.2
0.75 MobileNet-224	68.4%	325	2.6
0.5 MobileNet-224	63.7%	149	1.3
0.25 MobileNet-224	50.6%	41	0.5

Table 6. MobileNet width multiplier.

Resolution	ImageNet Accuracy	Million Mult-Adds	Million Parameters
1.0 MobileNet-224	70.6%	569	4.2
1.0 MobileNet-192	69.1%	418	4.2
1.0 MobileNet-160	67.2%	290	4.2
1.0 MobileNet-128	64.4%	186	4.2

Table 7. MobileNet resolution.

Comparison of hyperparameter tuning (continued)

Model	ImageNet Accuracy	Million Mult-Adds	Million Parameters
1.0 MobileNet-224	70.6%	569	4.2
GoogleNet	69.8%	1550	6.8
VGG 16	71.5%	15300	138

Table 8. MobileNet comparison to popular models.

Model	ImageNet Accuracy	Million Mult-Adds	Million Parameters
0.50 MobileNet-160	60.2%	76	1.32
Squeezenet	57.5%	1700	1.25
AlexNet	57.2%	720	60

Table 9. Smaller MobileNet comparison to popular models.

Fine Grained Recognition

A model for fine-grained classification was trained on the Stanford Dogs dataset and some noisy data from the web, with careful tuning, ultimately achieving results close to the state of the art while reducing computation and model size. The results are shown below:

Model	Top-1 Accuracy	Million Mult-Adds	Million Parameters
Inception V3	84%	5000	23.2
1.0 MobileNet-224	83.3%	569	3.3
0.75 MobileNet-224	81.9%	325	1.9
1.0 MobileNet-192	81.9%	418	3.3
0.75 MobileNet-192	80.5%	239	1.9

Table 10. MobileNet for Stanford Dogs.

Large Scale Geolocalizaton

PlaNet solves the localization problem by recasting it as a classification problem. PlaNet has already successfully localized many photos, and its performance on this problem has surpassed Im2GPS. We retrained PlaNet on the same data using the MobileNet architecture, with results shown below:

Scale	Im2GPS	PlaNet	PlaNet MobileNet
Continent (2500 km)	51.9%	77.6%	79.3%
Country (750 km)	35.4%	64.0%	60.3%
Region (200 km)	32.1%	51.1%	45.2%
City (25 km)	21.9%	31.7%	31.7%
Street (1 km)	2.5%	11.0%	11.4%

Table 11. Performance of PlaNet using the MobileNet architecture. Percentages are the fraction of the Im2GPS test dataset localized within a certain distance from ground truth.

Face Attributes

MobileNet can also be used to compress large-scale systems with unknown training processes. A face attribute classification system used the synergy of MobileNet and distillation; after combining the two, the system not only required no regularization but also exhibited stronger performance, with results shown below:

Width Multiplier / Resolution	Mean AP	Million Mult-Adds	Million Parameters
1.0 MobileNet-224	88.7%	568	3.2
0.5 MobileNet-224	88.1%	149	0.8
0.25 MobileNet-224	87.2%	45	0.2
1.0 MobileNet-128	88.1%	185	3.2
0.5 MobileNet-128	87.7%	48	0.8
0.25 MobileNet-128	86.4%	15	0.2
Baseline	86.9%	1600	7.5

Table 12. Face attribute classification using the MobileNet architecture. Each row corresponds to a different hyper-parameter setting (width multiplier α and image resolution).

Object Detection

This experiment used VGG, Inception, and MobileNet on SSD and Faster-RCNN to train on the COCO dataset, with results shown below:

Framework Resolution	Model	mAP	Billion Mult-Adds	Million Parameters
SSD 300	deeplab-VGG	21.1%	34.9	33.1
SSD 300	Inception V2	22.0%	3.8	13.7
SSD 300	MobileNet	19.3%	1.2	6.8
Faster-RCNN 300	VGG	22.9%	64.3	138.5
Faster-RCNN 300	Inception V2	15.4%	118.2	13.3
Faster-RCNN 300	MobileNet	16.4%	25.2	6.1
Faster-RCNN 600	VGG	25.7%	149.6	138.5
Faster-RCNN 600	Inception V2	21.9%	129.6	13.3
Faster-RCNN 600	MobileNet	19.8%	30.5	6.1

Table 13. COCO object detection results comparison using different frameworks and network architectures. mAP is reported with the COCO primary challenge metric (AP at IoU=0.50:0.05:0.95).

Face Embeddings

FaceNet is the state-of-the-art result for face embeddings. Here we likewise use distillation to train a Mobile FaceNet. The results are shown below:

Model	1e-4 Accuracy	Million Mult-Adds	Million Parameters
FaceNet	83%	1600	7.5
1.0 MobileNet-160	79.4%	286	4.9
1.0 MobileNet-128	78.3%	185	5.5
0.75 MobileNet-128	75.2%	166	3.4
0.75 MobileNet-128	72.5%	108	3.8

Table 14. MobileNet distilled from FaceNet.

Conclusion

We proposed a model architecture based on depthwise separable convolutions, and used two hyperparameters—the width multiplier and the resolution multiplier—to control model complexity. We compared MobileNet against other models in terms of model size, speed, and accuracy, demonstrating its efficiency across a variety of applications. The next step is to refine and further develop MobileNet.

Datasets: ImageNet (image classification), Stanford Dogs dataset (fine-grained classification), YFCC100M (face attributes), COCO (object detection).