Notes on Three CVPR 2019 Neural Network Pruning Papers
I recently read three pruning-related papers from CVPR 2019. Here I record the core idea and experimental conclusions of each, along with the progress of my own experiments today.
Variational Convolutional Neural Network Pruning
The main argument of this paper is that making pruning decisions based solely on weight values is unstable; instead, one should estimate a probability distribution for the channel saliency that these values reflect, and then decide whether to prune based on that distribution.
Concretely, the approach rewrites the BN layer so that when gamma is small, the layer’s output tends toward zero.
| Model | Accuracy | Channels | Channels Pruned | Parameters | Parameters Pruned | FLOPs | FLOPs Pruned |
|---|---|---|---|---|---|---|---|
| VGG-16 Base | 93.25% | 4224 | - | 14.71M | - | 313M | - |
| VGG-16 Pruned | 93.18% | 1599 | 62% | 3.92M | 73.34% | 190M | 39.10% |
Gamma is then used as a proxy metric for channel saliency, and variational inference is used to estimate the distribution of each gamma. Once the distribution of each gamma is obtained, a threshold based on the mean and variance is set, and the channels corresponding to distributions falling below the threshold are removed.
The experiments are conducted on VGG-16 / CIFAR, and the results alone are fairly average. As a point of reference, when I previously did structured pruning with SNIP, I could still reach 93% accuracy at a 50x compression ratio (around 0.3M parameters); by comparison, the gains in this paper are not particularly striking.
Importance Estimation for Neural Network Pruning
This paper proposes a new importance metric: the change in loss caused by removing a given channel—that is, each filter’s contribution to the final loss. Computing this directly is expensive, so the authors approximate it using a Taylor expansion, which reduces the computational cost. The experiments are all done on ResNet; the input is a pretrained network, and the number of channels pruned per iteration is set manually.
One interesting point in the paper: it takes the result of oracle pruning as an upper bound, computes the discrepancy between that and the proposed importance metric, and ultimately shows a very high correlation, which indirectly validates the soundness of this measurement approach. The paper also mentions sensitivity analysis, the details of which I still need to go back and review.
From my own perspective, if the importance metric were replaced with the gradient of the scaling factor (gamma) on the BN layer, it should in theory also work; I can try this later.
Cascaded Projection: End-to-End Network Compression and Acceleration
The idea here is to low-rank-map the inputs and outputs of consecutive layers into a unified low-dimensional space. In comparison experiments on ResNet + CIFAR, the accuracy outperforms AMC.
Current Experimental Progress (7.10)
Today I tried a method for estimating the relative redundancy between layers based on kernel statistics. The results so far are fairly average, but the direction is still worth pursuing; I’ll need to try some other algorithms.