Progressive Pruning: Breaking the Train-Prune-Finetune Paradigm

While the experiments keep running this morning, I’m taking the chance to organize the pruning papers I’ve been reading lately and to work through my own thoughts on the progressive-pruning direction.

Why Break Train-Prune-Finetune

The mainstream structured-pruning pipeline today is train-prune-finetune: first train a full model, then prune away redundant channels according to some importance metric, and finally finetune to recover accuracy. This pipeline is logically clean and is followed by most papers.

The idea behind progressive pruning is to skip this pipeline altogether and replace it with a train-prune-train loop—pruning continuously during training, so the network is compressed as it learns, ultimately yielding an extremely compressed model directly, without a separate finetune step. Methodologically it may look simple, but if it can produce a really good result, it’s still worth pursuing.

Using Relative Cross-Layer Statistics to Guide Non-Uniform Pruning

Uniform pruning—cutting the same fraction of channels in every layer—is the simplest baseline. But different convolutional layers contribute differently to the model’s capacity, so cutting a fixed fraction across the board clearly isn’t reasonable enough.

The direction I’m considering now is: exploit the relative statistics across different convolutional layers, not merely to find which channels are unimportant, but first to find which layers are relatively unimportant. Relatively unimportant layers can be pruned more, while critical layers retain more, yielding a step-by-step pruning scheme—at each step the pruning budget is allocated according to the relative importance across layers, rather than treating all layers identically.

This scheme will ultimately need to be compared against uniform pruning in experiments. Intuitively it should perform better, especially at high compression ratios, where differences in the degree of redundancy across layers are amplified and the advantage of a non-uniform scheme should become more pronounced.

The Number of Pruned Channels as an Analog of the Learning Rate

There’s another angle I find quite interesting: treating the number of channels pruned at each step as an analog of the learning rate in gradient descent. The learning rate determines the magnitude of each parameter update; the pruning step size determines the magnitude of each structural change. Both need to strike a balance between “changing too drastically, causing instability” and “changing too little, causing low efficiency.” This analogy might let us borrow ideas from learning-rate scheduling to design a pruning-scheduling strategy.

The next plan is to get the experimental results out first, then run a systematic comparison against the uniform baseline to see how far this direction can go.