Dynamic Network Inference: Design Thoughts on Early Exit and Dynamic Channel Pruning

Today I finished writing the code for sensitivity-based pruning under a parameter budget. Most of VGG’s parameters are concentrated in the FC layers, so pruning the convolutional layers yields limited gains. With that in mind, I also read a few papers on dynamic inference networks, mainly referencing the benchmarking approach used by schemes like SkipNet—the typical practice is to measure how much FLOPS is reduced without a noticeable drop in accuracy.

Why FLOPS Is Unreliable

During my research, one number left me puzzled: EfficientNet-B3 has fewer FLOPS than ResNet-18, yet the measured inference speed difference is nowhere near as dramatic as the FLOPS gap would suggest. This started making me feel that FLOPS is a rather “mystical” metric—it’s a proxy for compute volume, but there’s a gulf between it and actual latency.

The reason is probably this: FLOPS doesn’t account for memory access overhead, nor for hardware parallelism. The depthwise separable convolutions that EfficientNet relies on heavily compute each channel independently, with low arithmetic intensity, so the actual speedup on a GPU falls far short of the FLOPS reduction. This realization left me with some doubts about pruning schemes that use FLOPS as their optimization target.

Abandoning Multi-Network Dynamics in Favor of Single-Network Dynamics

I previously had an idea for “dynamic multi-network” inference: routing among several networks of different scales according to the difficulty of the input, sending easy samples through small networks and hard ones through large networks. But I can now basically abandon this approach, mainly because of the FLOPS problem—computing FLOPS in a multi-network scheme is more complicated, and the EfficientNet-B3 vs. ResNet-18 example shows that fewer FLOPS doesn’t mean faster in practice. The engineering complexity that multiple networks introduce is hard to justify against uncertain speedup gains.

A more sensible direction is to build it as single-network dynamic inference.

Early Exit Strategy: A Classifier After Each Block

The concrete idea is this: attach a classifier after the output of each block in the backbone network, and train these FC layers separately while freezing the upstream feature-extraction part during training. This way the backbone’s representational capacity is unaffected, and the FC layer at each exit is trained independently.

At inference time, entropy serves as the criterion: if the entropy of the softmax probability distribution at the current exit is low enough, it means the model is sufficiently confident in its prediction, so it outputs early at that point and stops the forward pass; high entropy means the model is still hesitating and needs deeper features to help with the judgment.

This design is essentially similar to an early-exit strategy—the core idea is to let simple samples produce an answer at a shallow layer, without having to run through the entire network every time.

Dynamic Channel Selection: Combining with Pruning

Building on early exit, we can also introduce dynamic channel selection—within each block, dynamically deciding which channels participate in computation based on the input features, rather than always activating all channels. Combining the two yields an inference framework that is dynamic along both the depth and width dimensions:

Depth dimension: use entropy to decide whether to exit early at the current block;
Width dimension: within each block, dynamically select the subset of channels that actually participate in computation.

This way, simple samples can both exit early at shallow layers and use fewer channels per layer; only complex samples traverse the full width and depth of the network.

That said, I still need to do more research on the concrete implementation of dynamic channel selection—especially how to design the selector module, and how to achieve actual inference speedup while keeping it differentiable. I need to consult Yuchao on this part. I also need to read a few more papers to clarify how others handle the early-exit side.

Other Progress Today

Besides the pruning research, today I also finished chapter four of compiler theory, as well as the closing chapter of the cache course, and knocked out the cache assignment in the evening.

Pruning
Dynamic Inference
Model Compression

2019 · 07 · 01