Notes on a Survey of Model Pruning Algorithms

Today I organized a batch of papers on model pruning that I’ve been reading recently. The methods can broadly be divided into three categories: predefined structured pruning, automatic structured pruning, and unstructured pruning.

Let me start with a thought-provoking paper: it ran a large number of comparative experiments, and its conclusion was that—if the pruned network structure is predefined, then training this slim network from scratch already yields decent results, and there is no need to go through the cumbersome “pretrain → prune → fine-tune” pipeline. The experiments covered three directions: Predefined Structured Pruning, Automatic Structured Pruning, and Unstructured Magnitude-based Pruning.

Predefined Structured Pruning

What these methods have in common is that, before performing pruning, the proportion of filters/channels to remove from each layer is determined first—either manually or through sensitivity analysis—and the algorithm is only responsible for finding the “least important” ones within the given proportion.

L1 Pruning

Paper: Pruning Filters for Efficient ConvNets

The most straightforward approach: directly remove the convolutional kernels with the smallest L1 norm, which amounts to cutting down the channel count of the next layer. The pruning ratio of each layer is fixed and relies on sensitivity analysis done in advance—prune each layer individually and measure accuracy on the validation set; then retrain the pruned single layer and measure accuracy again, thereby determining each layer’s tolerance to pruning. Without a pretrained model, this pipeline has little point—you might as well retrain from scratch.

ThiNet

Paper: ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression

The goal is: after pruning several filters from layer L, keep the input of layer L+2 (i.e., the output of layer L+1) as unchanged as possible. After removing layer L’s filters, the corresponding input channel count and filter channel count of layer L+1 shrink accordingly.

The approach is to use a greedy algorithm to find, among the outputs of layer L+1, the channel with the smallest impact on the whole—ranking by the sum of absolute values over the feature-map dimension and the batch dimension, and progressively removing the one with the least impact. The formula to be optimized is as follows (m is resolution × number of channels × batch size):

\arg\min_{T} \sum_{i=1}^{m}\left(\sum_{j\in T}\hat{\mathbf{x}}_{i,j}\right)^2 \quad\text{s.t.}\quad |T| = C\times(1-r),\ T\subset\{1,2,\dots,C\}

The greedy algorithm used is as follows:

Algorithm 1: A greedy algorithm for minimizing Eq. 6.
Input:  Training set {(x̂_i, ŷ_i)}, and compression rate r.
Output: The subset of removed channels T.
 1: T ← ∅;  I ← {1, 2, ..., C};
 2: while |T| < C × (1 − r) do
 3:     min_value ← +∞;
 4:     for each item i ∈ I do
 5:         tmpT ← T ∪ {i};
 6:         compute value from Eq. 6 using tmpT;
 7:         if value < min_value then
 8:             min_value ← value;  min_i ← i;
 9:         end if
10:     end for
11:     move min_i from I into T;
12: end while

LASSO Regression Pruning

Paper: Channel Pruning for Accelerating Very Deep Neural Networks

The starting point is similar to ThiNet: it likewise aims to keep the downstream layer’s response as unchanged as possible after pruning certain channels, but it replaces the greedy search with the sparse regularization term of LASSO regression. A weighting coefficient β is introduced for each channel, and the goal is to make β as sparse as possible while keeping the reconstruction error minimal. N is the number of sampled convolutional input patches, and Y is the N×n feature matrix:

\arg\min_{\boldsymbol{\beta},\mathbf{W}} \frac{1}{2N}\left\lVert\mathbf{Y}-\sum_{i=1}^{c}\beta_i \mathbf{X}_i \mathbf{W}_i^{\top}\right\rVert_F^2 + \lambda\lVert\boldsymbol{\beta}\rVert_1 \quad\text{s.t.}\quad \lVert\boldsymbol{\beta}\rVert_0 \le c',\ \forall i\ \lVert\mathbf{W}_i\rVert_F = 1

The solution uses alternating optimization, in two steps:

Step 1: Fix w and optimize β, solving it directly as a LASSO regression problem:

\hat{\boldsymbol{\beta}}^{\mathrm{LASSO}}(\lambda) = \arg\min_{\boldsymbol{\beta}} \frac{1}{2N}\left\lVert\mathbf{Y}-\sum_{i=1}^{c}\beta_i \mathbf{Z}_i\right\rVert_F^2 + \lambda\lVert\boldsymbol{\beta}\rVert_1 \quad\text{s.t.}\quad \lVert\boldsymbol{\beta}\rVert_0 \le c'

Step 2: Fix β and optimize w:

\arg\min_{\mathbf{W}'}\left\lVert\mathbf{Y}-\mathbf{X}'(\mathbf{W}')^{\top}\right\rVert_F^2

The main difference from ThiNet lies in the solving algorithm: here it’s alternating optimization rather than greedy enumeration.

Automatic Structured Pruning

These methods don’t require manually specifying the pruning ratio for each layer; the training process itself decides which channels to remove. In a sense, this can be viewed as a lightweight form of NAS.

Network Slimming

Paper: Learning Efficient Convolutional Networks through Network Slimming

L1 regularization is added to the scaling parameter (γ) of the BN layer. Since the weights of the BN layer act independently along the channel dimension, a γ approaching zero for a given channel means that this channel’s output is almost suppressed and can be directly pruned.

The key difference from L1 filter pruning: Network Slimming’s threshold is derived after a global ranking of the γ values across all BN layers, whereas L1 pruning does a relative ranking within each layer. The former does not require manually specifying the compression ratio of each layer—the structure emerges automatically. L1 is non-differentiable, so in implementation you operate directly on the gradients (subgradients), and ordinary SGD is enough to converge.

Sparse Structure Selection

Paper: Data-Driven Sparse Structure Selection for Deep Neural Networks

The idea is similar to Network Slimming: through data-driven sparse regularization, the network automatically selects its structure.

Other Methods Worth Noting

AOFP (ICML 2019)

Paper: Approximated Oracle Filter Pruning for Destructive CNN Width Optimization

It uses binary search to locate the least important channels, estimating each channel’s impact on the next layer’s output through repeated random sampling, and then prunes the channels with smaller impact. Using sampling to approximate the impact magnitude and avoiding enumerating all combinations is an approach well worth borrowing.

NISP

Paper: NISP: Pruning Networks Using Neuron Importance Score Propagation

It likewise predefines the pruning ratio per layer and then prunes according to some importance score—a variant of the predefined category of methods.

SNIP

Paper: SNIP: Single-Shot Network Pruning Based on Connection Sensitivity

Unstructured pruning: it directly removes the connections whose gradient of the loss with respect to the mask is smallest—i.e., the weights to which the loss is least sensitive. Here s is the sensitivity, obtained via an approximate differential:

Algorithm 1: SNIP — Single-shot Network Pruning based on Connection Sensitivity.
Require: Loss function L, training dataset D, sparsity level κ.  (Refer Equation 3)
Ensure:  ||w*||_0 ≤ κ
1: w ← VarianceScalingInitialization                        (Refer Section 4.2)
2: D^b = {(x_i, y_i)}_{i=1}^{b} ∼ D                          (Sample a mini-batch of training data)
3: s_j ← |g_j(w; D^b)| / Σ_{k=1}^{m} |g_k(w; D^b)|,  ∀j ∈ {1...m}   (Connection sensitivity)
4: s̃ ← SortDescending(s)
5: c_j ← 1[ s_j − s̃_κ ≥ 0 ],  ∀j ∈ {1...m}                  (Pruning: choose top-κ connections)
6: w* ← arg min_{w∈ℝ^m} L(c ⊙ w; D)                         (Regular training)
7: w* ← c ⊙ w*

This way, a single backward pass suffices to compute the sensitivity of all weights, without needing multiple forward passes:

\Delta L_j(\mathbf{w};\mathcal{D}) \approx g_j(\mathbf{w};\mathcal{D}) = \left.\frac{\partial L(\mathbf{c}\odot\mathbf{w};\mathcal{D})}{\partial c_j}\right|_{\mathbf{c}=\mathbf{1}} = \left.\lim_{\delta\to 0}\frac{L(\mathbf{c}\odot\mathbf{w};\mathcal{D}) - L((\mathbf{c}-\delta\,\mathbf{e}_j)\odot\mathbf{w};\mathcal{D})}{\delta}\right|_{\mathbf{c}=\mathbf{1}}

Autopruner

Paper: Autopruner: An End-to-End Trainable Filter Pruning Method for Efficient Deep Model Inference

Instead of using a simple penalty term to drive sparsity, it uses an additional network structure to predict the mask of each filter. It modifies the sigmoid, gradually increasing α so that the activation values get closer and closer to a 0/1 encoding—essentially using neural-network activation values directly as the mask, so that the optimal encoding can be found end-to-end via backpropagation. The drawback is that the compression rate is hard to control and requires additional measures.

The activation function takes the following form:

The compression rate is controlled by adding a regularization term to the loss, where v is the encoding vector, C is the vector length, and r is the target retention ratio:

\min\ \mathcal{L}_{\text{classification}} + \lambda\left\lVert\frac{\lVert\mathbf{v}\rVert_1}{C} - r\right\rVert_2^2

Other Notes

ISTA-based (Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers): skipping for now—it looks tough going, I’ll come back to it later.
C-SGD (Centripetal SGD for Pruning Very Deep Convolutional Networks with Complicated Structure): designed for networks with complex skip-connection structures like ResNet.
AMC (AMC: AutoML for Model Compression and Acceleration on Mobile Devices): uses DDPG to search for each layer’s compression-ratio hyperparameter, handing the decision of “how much to prune” over to reinforcement learning as well.
SFP (Soft Filter Pruning for Accelerating Deep Convolutional Neural Networks), CFP, GDP, SSR-L2, DCP, etc.: noting these down for a closer read when I have time.
Influence Functions (Understanding Black-Box Predictions via Influence Functions): seems to also use differentiation to find sensitivity, sharing some common ground with SNIP’s approach.

Summary

Structured pruning (whether predefined or automatic) is friendlier for hardware acceleration, as it can directly shrink the network size; unstructured pruning, although theoretically offering a higher compression ratio, struggles to achieve real acceleration with sparse matrices on conventional GPUs, so its practical value is limited. Automatic structured pruning is getting closer and closer to NAS—Network Slimming uses sparse regularization and AMC uses reinforcement learning, representing two routes for searching structures, and these seem to be the more interesting directions for now.

Separately, today I also read the MixMatch paper and finished Chapter 3 of CSAPP—a fruitful day.

Paper Notes
Pruning
Model Compression

2019 · 05 · 23