MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Abstract
Targeting mobile and embedded vision applications, this paper proposes an efficient model called MobileNets, a lightweight neural network built on depthwise separable convolutions. The model uses two hyperparameters to trade off accuracy against latency, and extensive experiments balancing the two were conducted on ImageNet, demonstrating strong performance compared with other models. Experiments also showcase MobileNets’ strengths across a wide range of applications, including object detection, fine-grained classification, face attributes, and large-scale geolocalization.
Introduction
Ever since AlexNet made deep CNNs popular, CNNs have become ubiquitous in computer vision, and the general trend has been to invent deeper and more complex networks to achieve higher accuracy. However, these improvements have done nothing to advance the speed and size of networks—real-world applications such as robotics, self-driving cars, and AR all need to run in real time on platforms with limited computation.
This paper proposes an efficient network architecture and a set of two hyperparameters for building models for the applications above. Section 2 reviews prior experience in building small models, Section 3 describes the MobileNet architecture and the two hyperparameters—the width multiplier and the resolution multiplier, Section 4 describes experiments on ImageNet and various applications, and finally Section 5 concludes.
Prior Work
Building efficient, small-footprint networks has become popular recently, and many approaches can be classified into two categories: compressing pretrained networks, or training small networks directly. The network architecture proposed in this paper lets the builder choose a small network that meets resource constraints. MobileNet focuses primarily on optimizing latency while producing small networks; many networks consider only size and not speed.
MobileNets make use of depthwise separable convolutions, which the Inception model also used to reduce computation in its first few layers. Flattened networks use fully factorized convolutions to build networks and demonstrate the potential of factorized networks. Factorized networks use similar convolution factorization and also leverage topological connections. Others include the Xception network, which scales up depthwise separable filters, and SqueezeNet, which uses bottlenecks. Some other networks that reduce computation include the structured transform network and deep fried convnets.
Other ways to obtain small networks: shrinking, factorizing, and compressing pretrained networks, where compressing includes product quantization, hashing, pruning, vector quantization, and Huffman coding. Factorization methods include references [14, 20]. Other approaches include distillation (training a small network using the outputs of a large network) and low bit networks.
MobileNet Architecture
This chapter first introduces the core depthwise separable filters, then describes the MobileNet network architecture, and finally closes with the two shrinking hyperparameters (the width multiplier and the resolution multiplier).
Depthwise Separable Convolutions
A depthwise separable convolution splits a standard convolution into two parts: a depthwise convolution and a 1x1 convolution. It also splits a single convolutional layer into two layers: filtering and combining. This factorization can drastically reduce both computation and model size. The comparison between a standard convolution and a depthwise separable convolution is shown below:

For a standard convolution, suppose its input dimensions are , its output dimensions are , and the convolution kernel dimensions are . Then, with stride=1 and padding, the standard convolution is computed by:
The computational cost of the standard convolution is:
while the computational cost of the depthwise separable convolution is:
The former term is the cost of the depthwise convolution, and the latter is the cost of the 1x1 convolution. Comparing the two, the reduction in computation is:
If a 3x3 kernel is used, computation is reduced by 8–9x, with only a small drop in accuracy. Factorizing further [16, 31] does not reduce computation much, because the depthwise convolution already has very little computation.
Network Structure and Training
Apart from the first layer, which is a standard convolution, the rest of MobileNet’s structure is built on depthwise separable convolutions. The full network architecture is shown below:
| Type / Stride | Filter Shape | Input Size |
|---|---|---|
| Conv / s2 | 3×3×3×32 | 224×224×3 |
| Conv dw / s1 | 3×3×32 dw | 112×112×32 |
| Conv / s1 | 1×1×32×64 | 112×112×32 |
| Conv dw / s2 | 3×3×64 dw | 112×112×64 |
| Conv / s1 | 1×1×64×128 | 56×56×64 |
| Conv dw / s1 | 3×3×128 dw | 56×56×128 |
| Conv / s1 | 1×1×128×128 | 56×56×128 |
| Conv dw / s2 | 3×3×128 dw | 56×56×128 |
| Conv / s1 | 1×1×128×256 | 28×28×128 |
| Conv dw / s1 | 3×3×256 dw | 28×28×256 |
| Conv / s1 | 1×1×256×256 | 28×28×256 |
| Conv dw / s2 | 3×3×256 dw | 28×28×256 |
| Conv / s1 | 1×1×256×512 | 14×14×256 |
| 5× Conv dw / s1 | 3×3×512 dw | 14×14×512 |
| 5× Conv / s1 | 1×1×512×512 | 14×14×512 |
| Conv dw / s2 | 3×3×512 dw | 14×14×512 |
| Conv / s1 | 1×1×512×1024 | 7×7×512 |
| Conv dw / s2 | 3×3×1024 dw | 7×7×1024 |
| Conv / s1 | 1×1×1024×1024 | 7×7×1024 |
| Avg Pool / s1 | Pool 7×7 | 7×7×1024 |
| FC / s1 | 1024×1000 | 1×1×1024 |
| Softmax / s1 | Classifier | 1×1×1000 |
Table 1. MobileNet body architecture (the two 5× rows are repeated 5 times).
It is worth noting that a small number of Mult-Adds alone does not make a model efficient. It is equally important that these Mult-Add operations can be implemented efficiently. For example, unstructured sparse matrix operations are not necessarily faster than dense matrix operations unless the sparsity is very high. Our model turns nearly all computation into dense 1x1 convolution operations, which can be implemented with a highly optimized general matrix multiply (GEMM). Convolutions implemented with GEMM usually require first reordering the input in memory using im2col—an operation that can, for example, be implemented in Caffe. Our 1x1 convolutions, by contrast, require no prior reordering and can apply the GEMM algorithm directly (one of the most optimized numerical linear algebra algorithms). In MobileNet, 95% of the Mult-Add operations and 75% of the parameters come from 1x1 convolutions.
Training details: TensorFlow + RMSprop + asynchronous gradient descent (similar to Inception V3) + less regularization and data augmentation (small models are less prone to overfitting) + little or no weight decay on the depthwise filters (since they already have very few parameters).
Width Multiplier: Thinner Models
Although the current MobileNet is already very small and fast, sometimes an even smaller model is needed. We introduce a hyperparameter (the width multiplier) to build these smaller models. The goal of this parameter is to uniformly thin out the entire network at every layer. Given an , the number of input channels M becomes and the number of output channels N becomes . Typical values of are 1, 0.75, 0.5, and 0.25. After applying this parameter, the computation becomes:
The computation becomes roughly of what it was.
Resolution Multiplier: Reduced Representation
The second hyperparameter for reducing network computation is (the resolution multiplier), which is set by setting the input resolution; the internal resolution then decreases accordingly. After adding the hyperparameters , the computation becomes:
The resolutions resulting from setting are typically 224, 192, 160, and 128. Note that setting this parameter changes the computation, but the number of model parameters does not change.
Experiments
This part mainly covers a number of experiments. First is the comparison between depthwise separable convolutions and standard convolutions, and between a thinner MobileNet and a shallower MobileNet. It then presents the experimental results for the two hyperparameters, including ImageNet accuracy, the number of Mult-Add operations, and the number of parameters. Finally, it presents experimental results for MobileNet on several different applications (fine-grained classification, large-scale geolocalization, face attributes, object detection, and face embeddings).
Model Choices
The experimental results show that a full-convolution MobileNet and a depthwise separable convolution MobileNet have comparable accuracy, but the depthwise separable convolution version has far fewer parameters and far less computation. Comparing a thinner MobileNet with a shallower MobileNet, at comparable computation the thinner MobileNet has somewhat higher accuracy. The results are shown below:
| Model | ImageNet Accuracy | Million Mult-Adds | Million Parameters |
|---|---|---|---|
| Conv MobileNet | 71.7% | 4866 | 29.3 |
| MobileNet | 70.6% | 569 | 4.2 |
Table 4. Depthwise Separable vs Full Convolution MobileNet.
| Model | ImageNet Accuracy | Million Mult-Adds | Million Parameters |
|---|---|---|---|
| 0.75 MobileNet | 68.4% | 325 | 2.6 |
| Shallow MobileNet | 65.3% | 307 | 2.9 |
Table 5. Narrow vs Shallow MobileNet.
Model Shrinking Hyperparameters
This part covers tuning the two hyperparameters above, with results shown below:
| Width Multiplier | ImageNet Accuracy | Million Mult-Adds | Million Parameters |
|---|---|---|---|
| 1.0 MobileNet-224 | 70.6% | 569 | 4.2 |
| 0.75 MobileNet-224 | 68.4% | 325 | 2.6 |
| 0.5 MobileNet-224 | 63.7% | 149 | 1.3 |
| 0.25 MobileNet-224 | 50.6% | 41 | 0.5 |
Table 6. MobileNet width multiplier.
| Resolution | ImageNet Accuracy | Million Mult-Adds | Million Parameters |
|---|---|---|---|
| 1.0 MobileNet-224 | 70.6% | 569 | 4.2 |
| 1.0 MobileNet-192 | 69.1% | 418 | 4.2 |
| 1.0 MobileNet-160 | 67.2% | 290 | 4.2 |
| 1.0 MobileNet-128 | 64.4% | 186 | 4.2 |
Table 7. MobileNet resolution.

| Model | ImageNet Accuracy | Million Mult-Adds | Million Parameters |
|---|---|---|---|
| 1.0 MobileNet-224 | 70.6% | 569 | 4.2 |
| GoogleNet | 69.8% | 1550 | 6.8 |
| VGG 16 | 71.5% | 15300 | 138 |
Table 8. MobileNet comparison to popular models.
| Model | ImageNet Accuracy | Million Mult-Adds | Million Parameters |
|---|---|---|---|
| 0.50 MobileNet-160 | 60.2% | 76 | 1.32 |
| Squeezenet | 57.5% | 1700 | 1.25 |
| AlexNet | 57.2% | 720 | 60 |
Table 9. Smaller MobileNet comparison to popular models.
Fine Grained Recognition
A model for fine-grained classification was trained on the Stanford Dogs dataset and some noisy data from the web, with careful tuning, ultimately achieving results close to the state of the art while reducing computation and model size. The results are shown below:
| Model | Top-1 Accuracy | Million Mult-Adds | Million Parameters |
|---|---|---|---|
| Inception V3 | 84% | 5000 | 23.2 |
| 1.0 MobileNet-224 | 83.3% | 569 | 3.3 |
| 0.75 MobileNet-224 | 81.9% | 325 | 1.9 |
| 1.0 MobileNet-192 | 81.9% | 418 | 3.3 |
| 0.75 MobileNet-192 | 80.5% | 239 | 1.9 |
Table 10. MobileNet for Stanford Dogs.
Large Scale Geolocalizaton
PlaNet solves the localization problem by recasting it as a classification problem. PlaNet has already successfully localized many photos, and its performance on this problem has surpassed Im2GPS. We retrained PlaNet on the same data using the MobileNet architecture, with results shown below:
| Scale | Im2GPS | PlaNet | PlaNet MobileNet |
|---|---|---|---|
| Continent (2500 km) | 51.9% | 77.6% | 79.3% |
| Country (750 km) | 35.4% | 64.0% | 60.3% |
| Region (200 km) | 32.1% | 51.1% | 45.2% |
| City (25 km) | 21.9% | 31.7% | 31.7% |
| Street (1 km) | 2.5% | 11.0% | 11.4% |
Table 11. Performance of PlaNet using the MobileNet architecture. Percentages are the fraction of the Im2GPS test dataset localized within a certain distance from ground truth.
Face Attributes
MobileNet can also be used to compress large-scale systems with unknown training processes. A face attribute classification system used the synergy of MobileNet and distillation; after combining the two, the system not only required no regularization but also exhibited stronger performance, with results shown below:
| Width Multiplier / Resolution | Mean AP | Million Mult-Adds | Million Parameters |
|---|---|---|---|
| 1.0 MobileNet-224 | 88.7% | 568 | 3.2 |
| 0.5 MobileNet-224 | 88.1% | 149 | 0.8 |
| 0.25 MobileNet-224 | 87.2% | 45 | 0.2 |
| 1.0 MobileNet-128 | 88.1% | 185 | 3.2 |
| 0.5 MobileNet-128 | 87.7% | 48 | 0.8 |
| 0.25 MobileNet-128 | 86.4% | 15 | 0.2 |
| Baseline | 86.9% | 1600 | 7.5 |
Table 12. Face attribute classification using the MobileNet architecture. Each row corresponds to a different hyper-parameter setting (width multiplier α and image resolution).
Object Detection
This experiment used VGG, Inception, and MobileNet on SSD and Faster-RCNN to train on the COCO dataset, with results shown below:
| Framework Resolution | Model | mAP | Billion Mult-Adds | Million Parameters |
|---|---|---|---|---|
| SSD 300 | deeplab-VGG | 21.1% | 34.9 | 33.1 |
| SSD 300 | Inception V2 | 22.0% | 3.8 | 13.7 |
| SSD 300 | MobileNet | 19.3% | 1.2 | 6.8 |
| Faster-RCNN 300 | VGG | 22.9% | 64.3 | 138.5 |
| Faster-RCNN 300 | Inception V2 | 15.4% | 118.2 | 13.3 |
| Faster-RCNN 300 | MobileNet | 16.4% | 25.2 | 6.1 |
| Faster-RCNN 600 | VGG | 25.7% | 149.6 | 138.5 |
| Faster-RCNN 600 | Inception V2 | 21.9% | 129.6 | 13.3 |
| Faster-RCNN 600 | MobileNet | 19.8% | 30.5 | 6.1 |
Table 13. COCO object detection results comparison using different frameworks and network architectures. mAP is reported with the COCO primary challenge metric (AP at IoU=0.50:0.05:0.95).
Face Embeddings
FaceNet is the state-of-the-art result for face embeddings. Here we likewise use distillation to train a Mobile FaceNet. The results are shown below:
| Model | 1e-4 Accuracy | Million Mult-Adds | Million Parameters |
|---|---|---|---|
| FaceNet | 83% | 1600 | 7.5 |
| 1.0 MobileNet-160 | 79.4% | 286 | 4.9 |
| 1.0 MobileNet-128 | 78.3% | 185 | 5.5 |
| 0.75 MobileNet-128 | 75.2% | 166 | 3.4 |
| 0.75 MobileNet-128 | 72.5% | 108 | 3.8 |
Table 14. MobileNet distilled from FaceNet.
Conclusion
We proposed a model architecture based on depthwise separable convolutions, and used two hyperparameters—the width multiplier and the resolution multiplier—to control model complexity. We compared MobileNet against other models in terms of model size, speed, and accuracy, demonstrating its efficiency across a variety of applications. The next step is to refine and further develop MobileNet.
Some Other Related Notes
Datasets: ImageNet (image classification), Stanford Dogs dataset (fine-grained classification), YFCC100M (face attributes), COCO (object detection).
Related papers:
Datasets
“Imagenet large scale visual recognition challenge” (ImageNet, ILSVRC 2012)
“In First Workshop on Fine-Grained Visual Categorization” (Stanford Dogs dataset)
“Yfcc100m: The new data in multimedia research” (YFCC100M)
Deeper, more complex, higher-accuracy neural networks
“Inception-v4, inception-resnet and the impact of residual connections on learning” (Inception V4)
“Rethinking the inception architecture for computer vision” (Inception V3, additional factorization in the spatial dimension)
“Deep residual learning for image recognition” (ResNet)
“Going deeper with convolutions” (GoogLeNet)
“Very deep convolutional networks for large-scale image recognition” (VGG16)
“Imagenet classification with deep convolutional neural networks” (AlexNet)
Neural network compression and acceleration
“Flattened convolutional neural networks for feedforward acceleration” (additional factorization in the spatial dimension)
“Factorized convolutional neural networks” (factorizing convolutions)
“Squeezenet: Alexnet-level accuracy with 50x fewer parameters and ¡1mb model size” (using bottlenecks to build small networks)
“Quantized convolutional neural networks for mobile devices” (compression based on product quantization)
“Xnornet: Imagenet classification using binary convolutional neural networks” (using low bit networks)
“Training deep neural networks with low precision multiplications” (using low bit networks)
“Quantized neural networks: Training neural networks with low precision weights and activations” (using low bit networks)
“Xception: Deep learning with depthwise separable convolutions” (scaling up depthwise separable filters)
“Structured transforms for small-footprint deep learning” (a network for reducing computation)
“Deep fried convnets” (a network for reducing computation)
“Compressing neural networks with the hashing trick” (using hashing to compress neural networks)
“Rigid-motion scattering for image classification” (originally proposed factorizing a standard convolution into a depthwise conv and a 1x1 conv)
“Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding” (using Huffman coding to compress networks)
“Speeding up convolutional neural networks with low rank expansions” (additional variable factorization)
“Speeding-up convolutional neural networks using fine-tuned cp-decomposition” (additional variable factorization)
“Distilling the knowledge in a neural network” (using distillation to train a small network from a large one, for compression)
BN
“Batch normalization: Accelerating deep network training by reducing internal covariate shift” (Inception V2 also derives from this)
Frameworks
“Caffe: Convolutional architecture for fast feature embedding”
“Tensorflow: Large-scale machine learning on heterogeneous systems”
Image localization
“IM2GPS: estimating geographic information from a single image” (proposed Im2GPS)
“Large-Scale Image Geolocalization” (about Im2GPS)
“PlaNet - Photo Geolocation with Convolutional Neural Networks” (PlaNet)
Fine-grained classification
“The unreasonable effectiveness of noisy data for fine-grained recognition”
Object detection
“Faster r-cnn: Towards real-time object detection with region proposal networks” (the Faster-RCNN framework)
“Ssd: Single shot multibox detector” (the SSD framework)
Face embeddings
“Facenet: A unified embedding for face recognition and clustering” (FaceNet, building face embeddings based on triplet loss)