A Side-by-Side Comparison of Attention Mechanisms in CV, with Single-Path NAS Notes

I recently read several papers on attention mechanisms in CV, along with a newly released NAS work. Let me organize and compare their respective ideas, and record some of my own thoughts along the way.

SENet: Weighting Channels

The core idea of Squeeze-and-Excitation Networks is to consider the correlations between channels. It adds an SE module after the feature map—first applying global average pooling to each channel (squeeze), then passing through FC layers to learn a weight for each channel (excitation), and finally using these weights to recalibrate the original feature map. Essentially, it adds an attention over the different channels.

SENet scores quite high on ImageNet, and its effect is very noticeable.

Model	224×224 top-1 err.	224×224 top-5 err.	320×320 / 299×299 top-1 err.	320×320 / 299×299 top-5 err.
ResNet-152 [13]	23.0	6.7	21.3	5.5
ResNet-200 [14]	21.7	5.8	20.1	4.8
Inception-v3 [20]	-	-	21.2	5.6
Inception-v4 [21]	-	-	20.0	5.0
Inception-ResNet-v2 [21]	-	-	19.9	4.9
ResNeXt-101 (64×4d) [19]	20.4	5.3	19.1	4.4
DenseNet-264 [17]	22.15	6.12	-	-
Attention-92 [60]	-	-	19.5	4.8
PyramidNet-200 [77]	20.1	5.4	19.2	4.7
DPN-131 [16]	19.93	5.12	18.55	4.16
SENet-154	18.68	4.47	17.28	3.79

Non-local NN: Self-Attention over Spatial Pixels

Non-local Neural Networks takes a different approach: it looks for correlations between pixels in the feature map. This is a self-attention mechanism that lets the feature at every position interact with the features at all other positions, capturing long-range dependencies. This kind of attention mechanism really can achieve fairly good results. After reading this paper, I wondered whether this aspect could also be taken into account in NAS—incorporating this kind of attention operation into the search space as a candidate.

CBAM: Chaining Channel and Spatial Attention

CBAM: Convolutional Block Attention Module can be seen as an extension of the SE module, split into two parts.

Channel attention part: Apply global average pooling and global max pooling separately, pass both paths through the same FC, add them up, then apply sigmoid to obtain a weight for each channel.

Spatial attention part: Apply average pooling and max pooling along the channel dimension, concatenate the two results, pass them through a convolutional layer plus sigmoid, and obtain an attention map over the spatial dimension.

The two parts are chained in sequence: channel attention first, then spatial attention.

Dual Attention Network: Dual Attention for Segmentation

Dual Attention Network applies attention over both the channel and the spatial dimensions of the feature map simultaneously in semantic segmentation, with the two paths running in parallel and finally fused together. The design idea is similar to CBAM, except the application scenario is switched to segmentation, and the two paths are in parallel rather than chained.

Single-Path NAS: Compressing Multi-Path Search into a Single Path

The main idea of Single-Path NAS: Designing Hardware-Efficient Mobile Networks is to place all the multi-path operation choices into a single convolutional kernel, which amounts to a fairly fine-grained form of weight sharing, so that only one path needs to be maintained during training.

In terms of time it isn’t particularly fast either, taking 30 TPU hours to search directly on ImageNet, with mediocre performance. The search space is based on MobileNet-v2, mainly searching over the kernel size of the depthwise conv.

The final search results are as follows. It feels like it’s following the ProxylessNAS work, only replacing multi-path with a single path.

Search method details: An indicator function is used to determine whether to use a sub-kernel.

\mathbf{w}_k = \mathbf{w}_{3\times3} + \mathbb{1}(\text{use } 5\times5)\cdot \mathbf{w}_{5\times5\setminus 3\times3} \tag{1}

where $\mathbb{1}(\cdot)$ is the indicator function that encodes the architectural (NAS) choice, i.e., if $\mathbb{1}(\cdot)=1$ then $\mathbf{w}_k = \mathbf{w}_{3\times3} + \mathbf{w}_{5\times5\setminus 3\times3} = \mathbf{w}_{5\times5}$ , else $\mathbb{1}(\cdot)=0$ then $\mathbf{w}_k = \mathbf{w}_{3\times3}$ .

The indicator function uses a threshold $t_k = 5$ , which is learnable. The original indicator function is:

g(x,t) = \mathbb{1}(x > t)

To allow gradients to propagate back, the indicator function is relaxed into a sigmoid:

\hat{g}(x,t) = \sigma(x > t)

After relaxation it becomes:

\mathbf{w}_k = \mathbf{w}_{3\times3} + \mathbb{1}\!\left(\left\lVert \mathbf{w}_{5\times5\setminus 3\times3}\right\rVert^2 > t_{k=5}\right)\cdot \mathbf{w}_{5\times5\setminus 3\times3}

From this design, we can see that whether a given kernel is used is closely related to the weights themselves, mainly depending on the importance of the weights outside the sub-kernel. Overall, this method generalizes the entire MBConv layer into a searchable block.

The final experimental results also look mediocre.

Papers
Attention
NAS

2019 · 04 · 11