A Survey of Generative Models and a Checklist of Image-Classification Training Tricks

Over the past couple of days I put together some deep-learning notes—part of them on the basic landscape of generative models, and part on the training tricks I’ve accumulated for image-classification tasks. I’m keeping the two together because some ideas in generative models—the regularization techniques and the training-stability concerns—actually share a lot with the training of discriminative classification models.

A Survey of Generative Models

Starting Point: Modeling the Data Distribution

The fundamental goal of a generative model is to model the probability distribution of the data, so that we can sample new examples consistent with the distribution of the training data.

Overview

- Unsupervised Learning
- Generative Models
  - PixelRNN and PixelCNN
  - Variational Autoencoders (VAE)
  - Generative Adversarial Networks (GAN)

The most intuitive starting point is the auto-encoder: compress a high-dimensional input into a low-dimensional latent space, then reconstruct it from that space. The encoder learns useful features, and the decoder learns to reconstruct.

To sample from the latent space and generate new examples, the latent space itself must be structured. Kernel density estimation (KDE) offers a non-parametric approach: it approximates the entire distribution by superposing kernel functions over the existing samples, but it is extremely inefficient in high-dimensional spaces.

The figures above summarize several paths from parametric density estimation to latent-variable models; they are the starting point for understanding the various families of generative models.

Pixel RNN and Pixel CNN

Autoregressive models sidestep the question of “how to define the latent space” and instead factorize the joint probability via the chain rule:

$p(x) = \prod_{i} p(x_i \mid x_1, \ldots, x_{i-1})$

Pixel RNN uses an LSTM to model the pixel sequence; it can capture long-range dependencies but is slow. Pixel CNN uses masked convolution to realize local causal dependencies and is faster. Both are early exemplars of autoregressive generation applied to images.

PixelRNN and PixelCNN

Pros:
- Can explicitly compute likelihood p(x)
- Explicit likelihood of training data gives good evaluation metric
- Good samples

Con:
- Sequential generation => slow

Improving PixelCNN performance
- Gated convolutional layers
- Short-cut connections
- Discretized logistic loss
- Multi-scale
- Training tricks
- Etc...

See
- Van der Oord et al. NIPS 2016
- Salimans et al. 2017 (PixelCNN++)

Autoregressive models have a clear training objective (maximizing likelihood) and no training-stability problems; the downside is that generation requires sampling step by step, cannot be parallelized, and carries a high inference cost.

Auto-Encoder and VAE

The variational auto-encoder (VAE) introduces a prior distribution over the latent variables (usually a standard normal) and, through variational inference, makes the posterior approach the prior, so that the entire latent space becomes sampleable.

It’s worth noting that images generated by auto-encoder methods tend to look rather blurry compared with state-of-the-art GANs. This is caused by the pixel-level reconstruction loss being insensitive to high-frequency detail, not by any fundamental flaw in the VAE framework itself.

GAN: A Game-Theoretic Approach to Generation

GAN abandons explicit density estimation entirely and instead uses a discriminator to provide the training signal: the generator tries to fool the discriminator, while the discriminator tries to tell real samples from fake ones. It can be shown that, under an optimal discriminator, training ultimately reaches a Nash equilibrium, where the generated distribution equals the true distribution.

GAN’s generation quality is clearly superior to VAE’s in terms of visual sharpness, but training instability is a widely acknowledged difficulty. There are some practical tricks for training GANs:

Common experience for stabilizing training includes: using Batch Normalization or Spectral Normalization, replacing the JS divergence with the Wasserstein distance, updating the discriminator multiple times per step, and using a smaller learning rate together with Adam. At bottom, these tricks all address the same problem: how to avoid a severe imbalance in the training pace of the two networks.

A Practical Checklist of Image-Classification Tricks

Below is a collection of training tricks worth trying on benchmarks such as CIFAR-10 and ImageNet, organized roughly into three parts: the training process, data augmentation, and the validation process.

You can try these tricks to get higher accuracy on several image classification benchmarks (e.g. CIFAR10, ImageNet) and thus fool the reviewers to get your paper accepted

The Training Process

Use cosine descending strategy for learning rate The learning rate decays smoothly from its initial value to 0 along a cosine curve. The descent is continuous and slows down the closer you get to the end of training, giving the model ample room for fine adjustment, and is more stable than step decay.
When increasing batch size, increasing learning rate proportionally When the batch size grows by a factor of $k$ , multiply the learning rate by $k$ at the same time (the Linear Scaling Rule). The intuition comes from the statistical properties of SGD updates: as the batch size grows, the variance of the gradient signal drops, so scaling up the learning rate in step keeps the overall update magnitude consistent.
Warm-up at the beginning of training process At the start of training the model weights are random and the gradients are noisy; using the target learning rate directly tends to cause oscillation or even divergence, especially in large-batch training. The warm-up strategy linearly increases the learning rate from a very small value to the target value over the first several epochs.
Do NOT use weight decay on BN layers The scale (γ) and bias (β) parameters of a BN layer are responsible for feature scaling and shifting; applying weight decay to them weakens BN’s expressive power. Excluding BN parameters from weight decay usually brings a stable accuracy gain.
Auxiliary loss on shallow layers Attach a classification head to an intermediate shallow layer of the network, compute an auxiliary loss, and add it, weighted, to the main loss for joint backpropagation. In deep networks the gradient signal weakens as it propagates to shallow layers; an auxiliary loss can supply stronger supervision directly to those layers and prevent shallow-layer features from degenerating. It is active during training and unused at inference.
Add knowledge distillation loss Have the student network learn not only the one-hot labels but also the soft logits (“dark knowledge”) output by the teacher network. Soft logits contain inter-class similarity information and provide a richer supervisory signal than hard labels.
Use label smooth Smooth the one-hot label into $(1-\epsilon) \cdot y_i + \epsilon/K$ (typically $\epsilon=0.1$ ) to prevent the model from being overconfident in logit space. This markedly improves model calibration and also yields a sizable accuracy gain on ImageNet.
Drop path & dropout Dropout randomly drops neurons in fully connected layers; Drop Path (Stochastic Depth) targets residual networks, dropping a residual block entirely with some probability and keeping only the identity mapping. The effect is similar to an ensemble—you’ve trained networks of different depths.
Shake-shake A regularization method for multi-branch networks. During training, the output of each branch is multiplied by a random coefficient, and the forward and backward passes use different random coefficients, adding training noise and preventing the network from over-relying on any single branch. At inference the coefficient is fixed at 0.5.

Data Augmentation

Random flip, padding and random crop, etc. The basic augmentations used in almost every experiment: random horizontal flip (with probability 0.5), and padding by a few pixels before randomly cropping back to the original size, introducing translation invariance.
Cutout Randomly select a square region in the image and set its pixels to zero. This forces the network to rely not on locally discriminative regions but on more global feature representations learned from the whole image—analogous to Dropout but acting in the input space.
Mixup Linearly interpolate between a pair of random training samples: $\tilde{x} = \lambda x_i + (1-\lambda) x_j$ , with the labels interpolated in step as well. This encourages the network to behave linearly in both feature space and label space, and serves as an implicit form of regularization.
Auto augmentation Turn the choice of augmentation strategy itself into a search problem, using reinforcement learning to search for the optimal strategy in the combinatorial space of augmentation operations. The operations include ShearX, Rotate, Brightness, Contrast, and so on, and each operation has two further dimensions: its probability of application and its magnitude.

The Validation Process

10 crop and average Crop the test image at its 4 corners plus 1 center, then horizontally flip, for a total of 10 crops; run inference on each and take the average of the softmax probabilities as the final prediction. This is essentially an ensemble at inference time, reducing the randomness of any single crop by using different crop positions. It’s a stable and reliable way to squeeze out extra points when chasing benchmarks.

Generative Models
GAN
Training Tricks

2019 · 03 · 30