OctaveConv: Reducing Convolutional Redundancy via Frequency Decomposition

I’ve been reading papers on model compression lately, and I came across a really interesting piece of work—Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution. Its angle isn’t the parameter count, but the redundancy in the feature maps themselves.

Core idea: treating a feature map as a superposition of frequencies

The authors’ starting point is this: an image can be decomposed into a low-frequency part and a high-frequency part, and the feature maps in the intermediate layers of a convolutional network can be understood the same way.

The low-frequency part corresponds to global, blurry, overall-contour information; the high-frequency part corresponds to local detail and edges. Since the low-frequency component is mostly global information with little detail, storing and computing it with a full-resolution tensor is actually wasteful. Based on this intuition, the authors designed a convolution operation called OctaveConv that decomposes the feature map along the channel dimension into a high-frequency group and a low-frequency group, and spatially downsamples the low-frequency group, thereby reducing memory and compute.

Decomposition scheme: a hyperparameter α controls the ratio

Concretely, the approach defines a hyperparameter α along the channel dimension as the proportion of low-frequency channels.

High-frequency part (proportion 1 - α): keeps the original feature map size.

X^{H} \in \mathbb{R}^{(1-\alpha)c \times h \times w}

Low-frequency part (proportion α): both the height and width of the feature map are reduced to half the original.

X^{L} \in \mathbb{R}^{\alpha c \times \frac{h}{2} \times \frac{w}{2}}

The name “Octave” comes from exactly this—halving the spatial resolution corresponds to lowering the frequency by one octave in the audio domain.

How OctaveConv updates features

Because the high-frequency and low-frequency groups of feature maps have different spatial sizes, ordinary convolution cannot process them directly. OctaveConv defines each feature update as a mapping from one set of differently-sized feature maps to another set of differently-sized feature maps.

The input/output form of each update is as follows:

Y = \{Y^{H}, Y^{L}\}

The specific update scheme is shown below:

The right side of the figure labels the kernel size corresponding to each path. There are four paths overall: high-to-high, low-to-low, high-to-low (which first needs a pool downsample, then a convolution), and low-to-high (which first needs a convolution, then an upsample). Note that whenever the feature map size changes, there are upsample and pool operations; in addition, the convolutions inside an octave use group conv.

Parameter count unchanged, compute reduced

After changing the convolution operation, the parameter count does not change—the size of the convolution kernel’s parameter matrix is independent of the feature map’s spatial size. But memory and compute are reduced accordingly, and the exact degree of reduction is determined by α: the larger α is, the more channels are allocated to the low-frequency path, and the more noticeably the overall compute drops.

ratio ( $\alpha$ )	.0	.125	.25	.50	.75	.875	1.0
#FLOPs Cost	100%	82%	67%	44%	30%	26%	25%
Memory Cost	100%	91%	81%	63%	44%	35%	25%

Replacement results

The performance comparison after replacing standard convolution with OctaveConv is shown below:

One fairly representative result in the paper is this: applying OctaveConv to ResNet-152 reaches 82.9% Top-1 accuracy with only 22.2 GFLOPs of compute. At a reasonable choice of α, this is a way to accelerate with almost no loss in accuracy—worth trying out in real projects.

Papers
Model Compression
Convolution

2019 · 04 · 16