Darts: Differentiable Architecture Search

Abstract

This paper takes on architecture search by formulating the task in a differentiable form, instead of the traditional approach of using reinforcement learning over a discrete, non-differentiable space. The method is based on a continuous relaxation of the architecture representation, allowing efficient methods such as gradient descent to be used for architecture search. Subsequent experiments show that the algorithm performs well at discovering high-performance CNN architectures for image recognition and RNN architectures for language modeling, and is far faster than existing state-of-the-art non-differentiable architectures.

Introduction

Automatically searched network architectures have achieved very competitive performance on tasks such as image classification and object detection. However, the best architecture search algorithms today place extremely high demands on computational power. For example, using reinforcement learning to obtain a state-of-the-art architecture on CIFAR-10 and ImageNet requires 1800 GPU days. There are now some acceleration methods, such as setting specific architectures for the search space / weights / performance prediction, and sharing weights across architectures. But the fundamental tension over scalability still remains. The reason these mainstream methods (RL, MCTS, SMBO, Bayesian optimization) are relatively inefficient is that they treat the architecture search problem as a black-box optimization problem over a discrete space, which leads to a large number of architecture evaluations being required.

This work proposes another approach to the problem: darts (Differentiable Architecture Search), which relaxes the search space to make it continuous, so that the architecture can be optimized via gradient descent. This allows darts to achieve performance competitive with the state of the art while requiring far less computation. The method also surpasses the performance of another efficient network search method (ENAS). darts is much simpler than many other networks because it contains no controllers / hypernetworks / performance predictors. So far darts performs well at searching both CNNs and RNNs.

The idea of conducting architecture search in a continuous space is not new, but there are some key differences here: prior work mainly does fine-tuning, e.g. of filter shapes. darts is able to explore high-performance architectures within complex graph topologies with large search spaces, and can be applied simultaneously to CNNs and RNNs. Experiments have shown that darts achieves strong results on CIFAR-10, ImageNet, and PTB.

The main contributions of this paper can be summarized as follows:

It introduces a differentiable architecture search algorithm that can be applied to both CNNs and RNNs.
Through experiments on image classification and language modeling, it demonstrates that gradient-based architecture search achieves competitive results on CIFAR-10 and surpasses other state-of-the-art architectures on PTB. Interestingly, the best architecture search algorithms today rely on non-differentiable, RL- and evolution-based search techniques.
It achieves very good architecture search efficiency by using gradient-based optimization rather than non-differentiable search techniques.
The architectures darts learns on CIFAR-10 and PTB can be transferred to ImageNet and WikiText-2.

Differentiable Architecture Search

This section first gives an overall introduction to the search space, with the entire computation flow described as a directed acyclic graph. It then introduces the continuous relaxation of connections, as well as the joint optimization of weights and architecture. Finally, it introduces an approximation technique to make the computation feasible and efficient.

Search Space

The goal is to search for a computation cell to serve as the building block of the final architecture, which can be stacked into a CNN or recurrently connected into an RNN. A cell is a directed acyclic graph with N nodes, where each node $x^{(i)}$ corresponds to a feature map, and each directed edge $(i,j)$ corresponds to an operation $o^{(i,j)}$ that transforms $x^{(i)}$ . Each node is computed from the previous nodes:

x^{(i)}=\sum_{j<i}o^{(i,j)}(x^{(j)})

When there is no connection between two nodes, a null operation is introduced, denoted as the $zero$ operation.

Continuous Relaxation and Optimization

Let $O$ be the set of all possible operations, where each operation is denoted as $o(\cdot)$ . To make the search space continuous, the choice among the candidate operations is relaxed over all operations in a softmax-like manner:

\bar{o}^{(i,j)}(x)=\sum_{o\in O}\frac{exp(\alpha_{o}^{(i,j)})}{\sum_{o^{'}\in O}exp(\alpha_{o^{'}}^{(i,j)})}o(x)

The operation between two nodes $i,j$ is encoded as an $|O|$ -dimensional vector $\alpha^{(i,j)}$ . After relaxation, the task of searching for an architecture becomes searching for a set $\alpha=\{\alpha^{(i,j)}\}$ , which is also called the encoding of the architecture. Finally, the mixed operation $\bar{o}^{(i,j)}$ is replaced by the most likely operation; for instance, the operation corresponding to the largest weight in the vector can be chosen as the selected operation.

After relaxation, the next goal is to jointly optimize the weights $w$ and the network architecture $\alpha$ . As with RL and evolution, the accuracy on the validation set is used as the reward, except that here a gradient-based method is used for optimization rather than searching over a discrete space. With $L_{train},L_{val}$ denoting the losses on the training and validation sets respectively, and the loss being jointly determined by $w,\alpha$ , this gives rise to a bilevel optimization problem:

min_{\alpha}\ L_{val}(w^{*}(\alpha),\alpha)\\ s.t. \ w^{*}(\alpha)=argmin_{w}\ L_{train}(w,\alpha)

In fact, $\alpha$ can also be regarded as a hyperparameter, except that the dimension of this hyperparameter is far larger than those scalar hyperparameters.

Approximation

Directly solving such a bilevel optimization problem is very difficult, so an approximate algorithm is given here:

In essence, the network architecture is first fixed in the initial situation and the weights $w$ are optimized, then on the basis of the current weights the network architecture $\alpha$ is optimized, alternating back and forth until convergence, and finally the mixed operation is fixed to a specific operation. Note that in the above formula, when updating the network architecture $\alpha$ , the term $w-\xi \nabla_{w}L_{train}(w,\alpha)$ serves as a single-step gradient descent to approximate $w^{*}(\alpha)$ . A similar approach has been applied to meta-learning for model transfer. This iterative algorithm defines a Stackelberg game between $w,\alpha$ . However, there is currently no guarantee that the algorithm will converge, though in practice it can converge when $\xi$ is chosen appropriately. Also note that when momentum is used in the optimization of the weights, the single-step forward algorithm changes accordingly, and the earlier analysis still holds.

In the first step, the gradient with respect to $w$ is just the ordinary neural network backpropagation process. In the second step, to update the network architecture $\alpha$ , one needs to compute the gradient of $L_{val}(w-\xi\nabla_{w}L_{train}(w,\alpha),\alpha)$ with respect to $\alpha$ . Through computation, we obtain:

\nabla_{\alpha}L_{val}(w-\xi\nabla_{w}L_{train}(w,\alpha),\alpha)=\nabla_{\alpha}L_{val}(w',\alpha)-\xi\nabla_{\alpha,w}^{2}L_{train}(w,\alpha)\nabla_{w'}L_{val}(w',\alpha)

where $w'=w-\xi\nabla_{w}L_{train}(w,\alpha)$ is the result after the single-step forward process. The latter term in the formula above involves a matrix-vector product, an operation that requires a large amount of computation. However, it can be realized through an approximation. Let $\epsilon$ be an infinitesimal quantity:

\nabla_{\alpha,w}^{2}L_{train}(w,\alpha)\nabla_{w'}L_{val}(w',\alpha)\thickapprox\frac{\nabla_{\alpha}L_{train}(w^{+},\alpha)-\nabla_{\alpha}L_{train}(w^{-},\alpha)}{2\epsilon}

where $w^{+}=w+\epsilon\nabla_{w'}L_{val}(w',\alpha),w^{-}=w-\epsilon\nabla_{w'}L_{val}(w',\alpha)$ .

This reduces the computational complexity from $O(|\alpha||w|)$ to $O(|\alpha|+|w|)$ .

When $\xi=0$ , the second-order term in the formula above no longer takes effect, and the gradient of the entire loss function with respect to $\alpha$ becomes $\nabla_{\alpha}L_{val}(w,\alpha)$ . This amounts to removing the single-step forward process used to approximate the next $w$ , which improves speed but leads to worse performance. In what follows, the case of $\xi=0$ is called the first-order approximation, and the case of $\xi>0$ is called the second-order approximation.

Deriving Discrete Architectures

After training is complete, the corresponding discrete architecture needs to be generated. First, for each intermediate node, the k strongest predecessor nodes are selected; here, k=2 for CNNs and k=1 for RNNs. The strength of this edge is defined as: $max_{o\in O,o\neq zero}\frac{exp(\alpha_{o}^{(i,j)})}{\sum_{o'\in O}exp(\alpha_{o'}^{(i,j)})}$ .

Then each mixed operation is fixed to a specific operation by taking its largest possible value.

The convergence path obtained from an experiment under a simple L_val and L_train is shown in the figure below. By comparison with the analytical solution of the problem, correctly choosing $\xi$ in the iterative algorithm can converge to a better result.

Experiments and Results

The experiments conducted on CIFAR-10 and PTB consist of two stages: architecture search and architecture evaluation. In the first stage, candidate cells are searched and the best set of cells is chosen based on performance on the validation set. In the second stage, these cells are used to build a larger network, whose performance is then reported on the test set. Finally, the best cell is used to evaluate transferability. This summary focuses mainly on the CNN-related parts and does not cover the RNN content.

Architecture Search

Only the convolution-related parts are summarized here. The operation set $O$ includes: 3x3 and 5x5 separable conv, 3x3 and 5x5 dilated separable conv, 3x3 max pooling, 3x3 avg pooling, identity, zero. All operations have stride=1, and for the convolution operations the ReLU-Conv-BN order is used, with each separable conv applied twice.

Here the conv cell consists of 7 nodes, the output node is defined as the depthwise concat of the internal nodes, and the first and second nodes in cell k are equal to the outputs of cell k-1 and cell k-2, using a 1x1 conv when necessary. The cells at the 1/3 and 2/3 points of the network are reduction cells, in which all operations connected to the input nodes have stride=2. Thus the architecture encoding is split into $\alpha_{normal},\alpha_{reduce}$ .

Throughout the search process the architecture changes. In BN, batch-specific statistics are used instead of the global moving average, and learnable affine parameters are disabled in all BN.

Half of the CIFAR-10 training data is used as the validation set, and a cell=8 model is trained for 50 epochs, batch_size=64, initial_channels=16. These parameter settings ensure training can be done on a single GPU. The optimization of w uses momentum SGD, and the optimization of $\alpha$ uses Adam, taking a total of one day on a single GPU.

Architecture Evaluation

To select the architecture to be used for evaluation, darts was run 4 times with different random seeds, and the best cell was selected based on validation set performance. The weights were then randomly initialized and trained, and performance on the test set was observed.

After the cell was selected via the small network, a larger network was trained: 20 cells, 600 epochs, batch_size=96, with some additional augmentation measures: cutout, path dropout, auxiliary towers. Training took 1.5 days on a single GPU.

Results Analysis

The experimental results of darts on CIFAR-10 and ImageNet are shown in the figures below. On ImageNet the same cells as on CIFAR-10 were used:

Architecture	Test Error (%)	Params (M)	Search Cost (GPU days)	Search Method
DenseNet-BC (Huang et al., 2017)	3.46	25.6	–	manual
NASNet-A + cutout (Zoph et al., 2017)	2.65	3.3	1800	RL
NASNet-A + cutout (Zoph et al., 2017)†	2.83	3.1	3150	RL
AmoebaNet-A + cutout (Real et al., 2018)	3.34 ± 0.06	3.2	3150	evolution
AmoebaNet-A + cutout (Real et al., 2018)†	3.12	3.1	3150	evolution
AmoebaNet-B + cutout (Real et al., 2018)	2.55 ± 0.05	2.8	3150	evolution
Hierarchical Evo (Liu et al., 2017b)	3.75 ± 0.12	15.7	300	evolution
PNAS (Liu et al., 2017a)	3.41 ± 0.09	3.2	225	SMBO
ENAS + cutout (Pham et al., 2018b)	2.89	4.6	0.5	RL
Random + cutout	3.49	3.1	–	–
DARTS (first order) + cutout	2.94	2.9	1.5	gradient-based
DARTS (second order) + cutout	2.83 ± 0.06	3.4	4	gradient-based

CIFAR-10 training results

In the figures above, NASNet and AmoebaNet are state-of-the-art; the difference in results compared to them is small, but the search time is much shorter. The same effect holds for the ImageNet dataset as well.

Conclusion

darts is the first differentiable architecture search algorithm, applicable to both CNNs and RNNs. By searching within a continuous space, darts can match or even surpass the performance of state-of-the-art non-differentiable architecture search algorithms on image classification and language modeling tasks, and is significantly more efficient. In the future, darts will be used to explore direct architecture search on larger tasks.

Technology

2018 · 10 · 22