Notes on MobileNetV3 and the Lottery Ticket Hypothesis

MobileNetV3: Engineering Refinements on Top of NAS

After reading the MobileNetV3 paper, my overall impression is that it doesn’t pack the punch of those earlier classic works — but if performance is good and it deploys, that’s a good thing. Let’s start with the experimental results:

The overall idea is to obtain a seed architecture from MnasNet and then fine-tune it. The Large version directly takes MnasNet’s search result, while the Small version modifies the objective function — lowering the weight on latency and increasing the importance of accuracy — and re-runs the search.

The objective function used by MnasNet is:

ACC(m)×[LAT(m)TAR]wACC(m) \times \left[\frac{LAT(m)}{TAR}\right]^{w}

where the magnitude of w controls the importance of latency.

NetAdapt Fine-Tuning

The second step uses NetAdapt to fine-tune the seed architecture. NetAdapt’s main goal is to find the best accuracy subject to a latency constraint, taking a trained seed architecture as input:

Starting from the seed architecture, NetAdapt generates a batch of candidate architectures whose latency drops by at least delta. The seed’s weights are then aligned to each candidate by truncation or random initialization, fine-tuned, and the one with the best accuracy is kept:

ΔAccΔlatency\frac{\Delta Acc}{|\Delta latency|}

This process is repeated multiple times, eventually yielding the model with the best accuracy at the target latency.

Changes to the Head, Tail, and Activation Function

After obtaining the above model, some manual adjustments were made, mainly to the head, the tail, and the activation function.

Head change: halve the number of output channels of the initial convolution.

Tail change, as shown below:

The activation function is changed to:

h-swish[x]=xReLU6(x+3)6\text{h-swish}[x] = x \, \frac{\text{ReLU6}(x+3)}{6}

The resulting overall architecture is as follows:

The SE block is embedded into the block as follows:

The structure of the stride=2 block in MobileNetV2 is shown below. In V3, stride=2 and stride=1 are likewise distinguished by whether there is a residual connection, and the residual and SE modules do not conflict:

THE LOTTERY TICKET HYPOTHESIS: Finding a Trainable Sparse Sub-Network

The core idea of this paper is that an iterative, unstructured pruning procedure can find a small network, and this iterative pruning process is itself a way of training the large network.

The resulting small network is roughly 10%–20% of the size of the full large network. If this small network is given the same initialization as the large network, it can be trained faster to match the accuracy of the large network — this is the so-called “lottery.”

The overall training-and-iteration process is shown below. This process is typically iterated many times, and the key point is that it is reinitialized to theta_0 each time:

Moreover, this effect becomes more pronounced as the degree of pruning increases: the more you prune, the harder the resulting small network is to train under random initialization, and the more important the initialization becomes — whereas keeping the same initialization as the pre-pruning network makes it easy to train.

Comparison with “Rethinking the Value of Network Pruning”

Another paper, “Rethinking the Value of Network Pruning,” argues that random initialization can in fact achieve very good performance, provided the learning rate is appropriate.

The two papers also differ in a few minor ways:

  1. That paper mainly uses structured pruning, with unstructured pruning only on smaller datasets — different from the lottery paper’s all-unstructured pruning.
  2. That paper mainly prunes large networks, whereas the lottery work is mainly done on small networks.
  3. The optimization approaches also differ; the optimization strategy used in the lottery paper is not exactly the mainstream configuration for classification tasks.

That paper finds that under structured pruning there is none of the difference observed in the lottery work — the structures obtained in the lottery setting show little difference across different initializations; under unstructured pruning, the main difference comes from differences in learning rate.