Feeding the GPU in Deep Learning

In the end, training VGG on CIFAR went from a full day down to one hour, and training a MobileNet model on ImageNet took only 2 minutes per epoch. (The code is at the end of the article.)

Diagnosing the Bottleneck on CIFAR

First, about training on CIFAR: if you just use torchvision’s dataloader (with the most common padding/crop/flip data augmentation), it is very slow—roughly the speed shown below. 600 epochs take more than a day to finish, and the speed is erratic, sometimes fast and sometimes slow.

Train epoch: 5
Train [5][0/97]   Time 18.988 (18.988)   Data 16.504 (16.504)   Loss 1.0435   Top-1 acc 59.375   Top-5 acc 98.047   lr 0.16000
...
Valid [5/600]     Top-1 acc 63.310   Top-5 acc 95.000

Speed log of training CIFAR with the torchvision dataloader; the per-batch Time/Data are both high and erratic, so 600 epochs take more than a day

At first I assumed it was an IO problem, so I mounted a RAM disk, changed the path, and kept using torchvision’s dataloader. The speed barely changed…

Train epoch: 5
Train [5][0/97]   Time 19.576 (19.576)   Data 17.102 (17.102)   Loss 1.0399   Top-1 acc 63.086   Top-5 acc 96.680   lr 0.16000
...

After mounting a RAM disk the per-batch Time is still around 19 s—barely changed—which shows the bottleneck is not IO

Then I opened the resource usage monitor and found that the CPU was nearly maxed out (I could only request 2 CPUs and one V100…), while the GPU utilization was very low. This basically confirmed that the bottleneck was the CPU processing speed.

CPU and GPU resource utilization monitor — Resource usage shows the CPU nearly saturated while GPU utilization is low; the bottleneck is the CPU

Using NVIDIA DALI for GPU Preprocessing

After some research I found that NVIDIA has a library called DALI that can use the GPU to do image preprocessing—an entire pipeline from input and decoding to transforms. Looking it over, the common operations like pad/crop are fairly complete, and it supports PyTorch/Caffe/MXNet and other frameworks.

Architecture diagram of the NVIDIA DALI data processing pipeline — NVIDIA DALI’s GPU data preprocessing pipeline, supporting the full flow from input and decoding to transforms

Unfortunately I could not find a CIFAR pipeline in the official docs, so I wrote my own based on the ImageNet version. I hit a few pitfalls at first (to save effort I grabbed a JPEG version of CIFAR to decode, but found the accuracy dropped a lot and could not figure out why—I had to read from CIFAR’s binary files instead). In the end I reached the same accuracy. Looking again at the speed and resource usage, the total time dropped directly from a day to an hour, and GPU utilization was much higher.

Train epoch: 7
Train [7][0/97]   Time 0.076 (0.076)   Data 0.005 (0.005)   Loss 0.7536   Top-1 acc 74.023   Top-5 acc 97.656   lr 0.20000
...
WARNING:root:DALI iterator does not support resetting while epoch is not finished. Ignoring...

After using DALI the per-batch Time drops from 19 s to 0.076 s, and total training time drops from a day to an hour

GPU resource utilization after using DALI — After using DALI, GPU utilization increased significantly

Multi-GPU Scaling on ImageNet

Now about accelerating ImageNet training. At first I also copied the entire dataset onto the mounted RAM disk (160GB was about enough; copying and unpacking took a little under 10 minutes). Again, training with torchvision’s dataloader was very unstable, so I directly lifted DALI’s official dataloader, and the speed took off the same way hhhh (I can’t find the training images from back then). Then, combining it with Apex mixed precision and distributed training, requesting 4 V100s kept GPU utilization steady above 95, 8 V100s steady above 90, and finally going all the way up to 16 V100s and 32 CPUs kept it steady around 85 (looking at resource usage I saw the CPU was maxed out; otherwise the GPU could probably have reached above 95 too). Training MobileNet on ImageNet with 16 V100s took only 2 minutes per epoch.

Resource utilization of training ImageNet with 16 V100s — Training ImageNet with 16 V100s using DALI and Apex, GPU utilization steady around 85

The dataloader I wrote is up on GitHub. The accuracy I tested is about the same as the torchvision version, but it is much faster than torchvision. When I have time I will also write DALI versions of some other commonly used dataloaders and put them up.

Technology

2019 · 08 · 12