Auto-DeepLab Reading Notes: Rethinking the Essence of the NAS Search Space

Overview of Auto-DeepLab

Auto-DeepLab is the first paper I came across that migrates NAS from classification tasks to semantic segmentation tasks. It makes a fairly interesting innovation in the search space—it not only searches the internal structure of a cell, as DARTS does, but also simultaneously searches the macro network topology outside the cell. This macro search space covers the connection patterns of several classic segmentation architectures such as DeepLabv3, Conv-Deconv, and Stacked Hourglass.

Search efficiency is not bad either: it took only 3 days on a P100, which benefits from likewise adopting a gradient-based search approach.

The Search Mechanism

Updates at the cell level are basically identical to DARTS. Updates along the macro path, however, are different: each step considers all possible next-hop paths and builds a distribution over them, which amounts to performing a relaxation at the path level.

\begin{aligned} {}^{s}H^{l} ={}& \beta^{l}_{\frac{s}{2}\to s}\,\mathrm{Cell}\!\left({}^{\frac{s}{2}}H^{l-1},\, {}^{s}H^{l-2};\, \alpha\right) \\ &+ \beta^{l}_{s\to s}\,\mathrm{Cell}\!\left({}^{s}H^{l-1},\, {}^{s}H^{l-2};\, \alpha\right) \\ &+ \beta^{l}_{2s\to s}\,\mathrm{Cell}\!\left({}^{2s}H^{l-1},\, {}^{s}H^{l-2};\, \alpha\right) \end{aligned}

Specifically, the parameters fall into two groups: alpha controls the internal structure of the cell, while beta controls the entire macro topology. Each beta parameter corresponds to a whole set of alphas, so beta sits at a higher level than alpha. During training, alpha and beta are updated synchronously.

Update network weights $w$ by $\nabla_{w}\mathcal{L}_{trainA}(w, \alpha, \beta)$
Update architecture $\alpha, \beta$ by $\nabla_{\alpha,\beta}\mathcal{L}_{trainB}(w, \alpha, \beta)$

A Few Thoughts on the Essence of the NAS Search Space

After reading this paper, I started thinking about a more fundamental question: do today’s NAS methods manage to search out networks with ever-higher accuracy truly because the search algorithms themselves have gotten better?

I tend to believe that, to a large extent, it is the design of the search space that is doing the work. Take DARTS as an example: a network randomly sampled from the search space already achieves quite high accuracy on CIFAR—and behind this lies a heavy stack of human priors: the number of cells and their connection patterns, the connection patterns and number of nodes inside each cell, and so on. The search space looks large, but in reality the variation in network structure is not that great. Moreover, a substantial part of the high accuracy ultimately reported also comes from the various tricks layered on during training, rather than entirely from a well-designed network structure.

A more direct controlled experiment would be: within the same time budget, randomly sample from the search space, directly take the structure with the largest parameter count and FLOPs, and then train and fine-tune it under exactly the same tricks—the result would not necessarily be worse.

The reason behind this lies in the density of the search space. The search space looks large, but the “distance” between networks is actually very small, and the entire space is extremely dense. For instance, a 3×3 convolution versus a 5×5 convolution: which one you pick has little effect on the final accuracy, yet by the definition of the search space they count as two entirely different structures. In such a dense space, only the slightest guidance is needed to find a decent structure—whether it’s the reward signal in reinforcement learning, the gradient in DARTS, or the probabilities of the corresponding operations in the paper I co-authored with my labmate Xiawu, in essence they all merely point in a rough direction within an already crowded space.

A Potential Research Direction

This brings to mind a piece of work worth doing: evaluating the search space itself. Two search spaces that contain the same number of networks can differ greatly in the spacing between networks—one space is dense, the other sparse, and the sparse space naturally covers a wider range.

A more sensible search strategy might be: first find a relatively good rough region within a sparse space (that is, use search to find a good-enough prior, rather than relying entirely on manual human specification), and then perform a fine-grained search within the dense space near that region. This would both reduce reliance on human priors and make more effective use of search resources.

Paper Notes
NAS
Semantic Segmentation

2019 · 03 · 21