A Close Reading of the ResNet Training Bag of Tricks
I recently read a very practical paper — “Bag of Tricks for Image Classification with Convolutional Neural Networks.” Its main idea is to stack a pile of training tricks together, lifting ResNet-50’s Top-1 accuracy on ImageNet from 75.3 to 79.29. The notes follow the paper’s chapter structure, recording each point in turn.
Standard Training Pipeline (Baseline)
Chapter 2 of the paper first lays out the standard training pipeline as a baseline for subsequent comparison.
Data preprocessing steps during training:
- Convert the image to 32-bit float, with pixel values in the [0, 255] range;
- Randomly crop a rectangular region with an aspect ratio between 3/4 and 4/3, then resize the crop to 224×224;
- Randomly flip horizontally with probability 0.5;
- Adjust color attributes such as brightness and saturation;
- Add PCA noise, with the noise coefficient sampled from N(0, 0.1);
- Normalize.
During validation, the short side is scaled to 256 while preserving the aspect ratio, then a 224×224 region is center-cropped.
Network initialization uniformly uses Xavier, the optimizer is Nesterov Accelerated Gradient, the hardware is 8×V100, training runs for 120 epochs with a batch size of 256 and an initial learning rate of 0.1, which is decayed to 1/10 of its previous value at epochs 30, 60, and 90.
Training Tricks for New Hardware Platforms
Chapter 3 discusses engineering-level tricks oriented toward modern GPU hardware.
Large Batch and Linear Learning-Rate Scaling
In convex optimization, a larger batch size slows convergence, and a similar phenomenon is observed in neural networks: for the same number of epochs, a smaller batch size often achieves better validation accuracy. Raising the batch size from 256 to 1024 roughly costs 0.x% of Top-1 accuracy.
Increasing the batch size does not change the expectation of the stochastic gradient; it only reduces its variance, that is, it lowers gradient noise. In that case the learning rate can be correspondingly increased — linear scaling is recommended: 256 corresponds to 0.1, 512 to 0.2, and so on.
Warm-up
Starting directly from a very large learning rate tends to be unstable, so warm-up can be combined with it: over the first n epochs of training (say 5), linearly increase the learning rate from 0 to the initial learning rate.
Where to Apply Weight Decay
Weight decay is best applied only to the weights of convolutional and fully connected layers, not to biases, and the parameters of BN layers (gamma, beta) should not receive weight decay either.
Low-Precision Training (FP16)
On the V100, switching the computation from FP32 to FP16 can speed things up by 2–3× — the V100’s FP32 throughput is about 14 TFLOPS, while FP16 is about 100 TFLOPS.
Using FP16 directly disrupts training. One approach is to store all parameters and activations as FP16 while using FP32 for accumulation during parameter updates. Another practical method is to multiply the loss by a scalar so that the gradient values stay within the range representable by FP16, which likewise provides a speedup.
Tweaks to the Network Architecture
Chapter 4 makes a few small structural changes directly to ResNet and proposes an improved variant. These are adjustments at the network-design level, which I won’t expand on here.
Additional Refinements to the Training Process
Chapter 5 introduces several training tricks that yield fairly noticeable accuracy gains.
Cosine Learning-Rate Decay
The original ResNet uses step decay (a staircase-like vertical drop); switching to cosine decay can improve accuracy by nearly a full point.
Label Smoothing
Label smoothing likewise brings nearly a full point of accuracy improvement; it’s simple to implement and reliably effective.
Knowledge Distillation
Using KD: on top of the classification loss from the softmax output, add a KD loss term measuring the distance to the teacher model’s output distribution, so the student gains additional supervisory signal from the teacher’s soft labels.
Mixup
Mixup data augmentation is also effective, but note that after applying mixup you need to train for longer — in the paper, training was increased from 120 epochs to 200 epochs.
Transfer Learning Results
Chapter 6 discusses how these tricks perform in transfer learning scenarios. I haven’t worked much in this area recently, so I’ll skip it for now.