Profiling Performance in Deep Learning Training and Inference
In the engineering practice of deep learning, performance bottlenecks often aren’t something you can spot at a glance. Whether in the training or inference phase, you need profiling tools to quantitatively analyze how much time each stage takes before you can optimize in a targeted way. This note records the profiling methods I use in the training and inference phases respectively.
Profiling the Training Process
The main goal of profiling during the training phase is to analyze how much time is spent on data loading, the forward pass, and the backward pass, so you can identify the time bottlenecks across the whole training pipeline.
In PyTorch, I use NVTX (NVIDIA Tools Extension) for annotation. In the training loop, I insert markers before and after each major stage (data loading, forward pass, loss computation, backward pass, optimizer update), written like this:
Once the code is annotated, you open the captured results in Nsight Systems, and the timeline is colored by NVTX range, making the time spent in each stage immediately clear. I tested it myself, and the actual result is shown in the figure below. It really is quite useful—no more relying on printed timestamps for coarse-grained estimates:
Profiling the Inference Process
The situation in the inference phase is slightly different. The usual approach is to first convert the PyTorch-trained model to TensorRT, then use TensorRT for inference deployment. Profiling for this workflow needs to be done within the TensorRT context, which differs from the NVTX annotation approach used in the training phase. I plan to write up the specifics of this part in more detail later on.