Feeding the GPU in Deep Learning
In the end, training VGG on CIFAR went from a full day down to one hour, and training a MobileNet model on ImageNet took only 2 minutes per epoch. (The code is at the end of the article.)
Diagnosing the Bottleneck on CIFAR
First, about training on CIFAR: if you just use torchvision’s dataloader (with the most common padding/crop/flip data augmentation), it is very slow—roughly the speed shown below. 600 epochs take more than a day to finish, and the speed is erratic, sometimes fast and sometimes slow.
Train epoch: 5
Train [5][0/97] Time 18.988 (18.988) Data 16.504 (16.504) Loss 1.0435 Top-1 acc 59.375 Top-5 acc 98.047 lr 0.16000
...
Valid [5/600] Top-1 acc 63.310 Top-5 acc 95.000
Speed log of training CIFAR with the torchvision dataloader; the per-batch Time/Data are both high and erratic, so 600 epochs take more than a day
At first I assumed it was an IO problem, so I mounted a RAM disk, changed the path, and kept using torchvision’s dataloader. The speed barely changed…
Train epoch: 5
Train [5][0/97] Time 19.576 (19.576) Data 17.102 (17.102) Loss 1.0399 Top-1 acc 63.086 Top-5 acc 96.680 lr 0.16000
...
After mounting a RAM disk the per-batch Time is still around 19 s—barely changed—which shows the bottleneck is not IO
Then I opened the resource usage monitor and found that the CPU was nearly maxed out (I could only request 2 CPUs and one V100…), while the GPU utilization was very low. This basically confirmed that the bottleneck was the CPU processing speed.

Using NVIDIA DALI for GPU Preprocessing
After some research I found that NVIDIA has a library called DALI that can use the GPU to do image preprocessing—an entire pipeline from input and decoding to transforms. Looking it over, the common operations like pad/crop are fairly complete, and it supports PyTorch/Caffe/MXNet and other frameworks.

Unfortunately I could not find a CIFAR pipeline in the official docs, so I wrote my own based on the ImageNet version. I hit a few pitfalls at first (to save effort I grabbed a JPEG version of CIFAR to decode, but found the accuracy dropped a lot and could not figure out why—I had to read from CIFAR’s binary files instead). In the end I reached the same accuracy. Looking again at the speed and resource usage, the total time dropped directly from a day to an hour, and GPU utilization was much higher.
Train epoch: 7
Train [7][0/97] Time 0.076 (0.076) Data 0.005 (0.005) Loss 0.7536 Top-1 acc 74.023 Top-5 acc 97.656 lr 0.20000
...
WARNING:root:DALI iterator does not support resetting while epoch is not finished. Ignoring...
After using DALI the per-batch Time drops from 19 s to 0.076 s, and total training time drops from a day to an hour

Multi-GPU Scaling on ImageNet
Now about accelerating ImageNet training. At first I also copied the entire dataset onto the mounted RAM disk (160GB was about enough; copying and unpacking took a little under 10 minutes). Again, training with torchvision’s dataloader was very unstable, so I directly lifted DALI’s official dataloader, and the speed took off the same way hhhh (I can’t find the training images from back then). Then, combining it with Apex mixed precision and distributed training, requesting 4 V100s kept GPU utilization steady above 95, 8 V100s steady above 90, and finally going all the way up to 16 V100s and 32 CPUs kept it steady around 85 (looking at resource usage I saw the CPU was maxed out; otherwise the GPU could probably have reached above 95 too). Training MobileNet on ImageNet with 16 V100s took only 2 minutes per epoch.

The dataloader I wrote is up on GitHub. The accuracy I tested is about the same as the torchvision version, but it is much faster than torchvision. When I have time I will also write DALI versions of some other commonly used dataloaders and put them up.