Notes on Inception-V4
Abstract
In recent years, very deep convolutional neural networks have been the single biggest driver of improvements in image recognition performance. The Inception architecture achieves strong performance while keeping computational cost relatively low. Combining residual connections with conventional architectures produced the best results on the 2015 ILSVRC, comparable to Inception-V3. This paper considers combining Inception networks with residual connections; there is ample evidence that residual connections can greatly accelerate the training of Inception networks, and also evidence that a residual Inception slightly outperforms a non-residual Inception of almost the same computational cost. The paper also proposes several new Inception networks, both with and without residual connections, and these changes likewise markedly improve single-frame classification performance on the 2012 ILSVRC. Finally, it shows that scaling the activations appropriately can make the training of very wide residual Inception networks more stable.
Introduction
In 2012, AlexNet achieved excellent results on CV tasks, and deep CNNs have since been applied very successfully across a wide range of CV domains. Because residual connections are crucial for training deep architectures, this paper combines residual connections with deep Inception networks, thereby reaping all the benefits of residual connections while keeping the computational cost unchanged.
Beyond this direct combination, the paper also explores whether Inception itself can be made deeper and wider to achieve better performance. To this end, it designs Inception-V4, and thanks to TensorFlow’s distributed computing capabilities the model no longer needs to be partitioned.
The paper also compares Inception-V3, Inception-V4, and a residual Inception network of comparable computational cost. It is clear that the single-frame performance of Inception-V4 and Inception-ResNet-V2 on the ImageNet validation set is similar, with both exceeding the state-of-the-art results. Finally, it finds that even this ensemble’s performance does not yet reach the level of the dataset’s classification noise, so there is still room for improvement.
Related Work
CNNs have become very popular for large-scale image recognition tasks; Network in Network, VGGNet, and GoogLeNet (Inception-V1) are all important milestones. He gave ample theoretical and empirical justification for the benefits of residual connections, particularly for applications in detection. The authors emphasize that residual connections are inherently necessary for training very deep CNN models, but our findings do not support this view, at least not for image recognition; this probably requires more arguments and a deeper understanding of the role of residual connections in deep networks. The experimental section demonstrates that even without residual connections, training very deep networks is not difficult, although residual connections can substantially improve training speed, which is itself an important argument for using them.
Starting from Inception-V1, this architecture has gone through several rounds of improvement: introducing BN yielded V2, and performing additional factorization yielded V3.
Architectural Choices
This section mainly introduces the concrete architecture of the network.
In the residual version of the Inception network, simpler Inception blocks are used, and each Inception block is followed by a 1*1 convolution to compensate for the change in the number of channels. Another minor difference between the Inception and residual versions is that, in the residual version, BN is applied only on top of the conventional layers and not on top of the summations. This is a compromise made so that the network can have more Inception blocks and still be trained on a single GPU; in fact, applying BN everywhere would be more favorable.
When the number of filters exceeds 1000, the network fails to train early in the process, and neither lowering the learning rate nor adding extra BN can avoid this phenomenon. He proposed first warming up with a low learning rate and then training with a high learning rate to mitigate this, but this paper finds that scaling the residuals solves the problem more reliably, without losing accuracy and with more stable training.
Training Methodology
This section mainly introduces some training details. The TensorFlow distributed computing framework was used, with Nvidia Kepler GPUs. The best model used the RMSProp algorithm with decay = 0.9, , and a learning rate of 0.045, decayed with an exponential decay rate of 0.94 every two epochs.