Randomly Wired Neural Networks: A New Take on Replacing NAS with Graph Theory

Kaiming He and Ross just put out a paper. In short, it swaps the way network structures are produced—from manual design or NAS—for a random graph generator. The generator directly produces a random computation graph, with no selection step at all: whatever comes out randomly is what you use.

Core Idea

This approach is fundamentally different from Random Search in NAS. Random Search randomly searches a batch of structures and then picks the best one among them; here, structures are generated at random with no selection whatsoever. Experiments show that under the same generator parameters, the performance variance across networks generated from different random seeds is very small—the results are all about the same. This means selection is simply unnecessary.

The author’s conclusion: rather than spending effort designing or searching for one specific network structure, it’s better to design a better network generator (i.e., the sampler). Making the structures the generator samples generally good matters more than optimizing a single structure or a search process. This direction is worth thinking about seriously.

Three Random Graph Generation Algorithms

The paper uses three classic random graph generation algorithms from graph theory: Erdős–Rényi (ER), Barabási–Albert (BA), and Watts–Strogatz (WS). Some sample generated graphs are shown below:

These random graphs are not necessarily directed acyclic graphs (DAGs), whereas a neural network’s computation graph must be a DAG—otherwise forward propagation would contain a cycle. The paper handles this by encoding the nodes with a topological ordering, converting the generated graph into a DAG.

Erdős–Rényi (ER)

Two parameters: the number of nodes N and the edge probability P.

Iterate over all pairs of nodes, and connect each pair with an edge independently with probability P. Any graph containing N nodes can potentially be generated by ER—there is no structural preference at all; it’s completely random.

Barabási–Albert (BA)

Two parameters: the number of nodes N and the number of new edges added per step M.

Initially there are M isolated nodes and no edges. Then, each time, a new node is inserted and connected with M edges, with the connection targets chosen at random in proportion to the in-degree of the existing nodes—nodes with higher degree are more likely to be selected. A total of (N − M) nodes are inserted, generating M(N − M) edges.

Watts–Strogatz (WS)

Three parameters: the number of nodes N, the initial number of neighbors K (even), and the rewiring probability P.

Initially, the N nodes are arranged in a ring, with each node connected to its K/2 neighbors on each side. Then, for each node, going clockwise, the not-yet-connected nodes (other than the node itself) are selected in order, and an additional edge is added with probability P. A graph generated by this algorithm has exactly N×K edges.

Overall Network Structure: How Much Human Prior Is Retained

The randomization applies only to the topology of the computation graph; the network as a whole still requires some human-designed prior knowledge.

The overall structure follows the classic multi-stage paradigm (the conv layers in the figure below). Different stages are connected by convolutions that perform downsampling, controlling the feature-map resolution. The random wiring happens within each stage.

stage	output	small regime	regular regime
conv $_1$	112×112	3×3 conv, $C/2$	3×3 conv, $C/2$
conv $_2$	56×56	3×3 conv, $C$	random wiring, $N/2,\ C$
conv $_3$	28×28	random wiring, $N,\ C$	random wiring, $N,\ 2C$
conv $_4$	14×14	random wiring, $N,\ 2C$	random wiring, $N,\ 4C$
conv $_5$	7×7	random wiring, $N,\ 4C$	random wiring, $N,\ 8C$
classifier	1×1	1×1 conv, 1280-d; global average pool, 1000-d $fc$ , softmax	1×1 conv, 1280-d; global average pool, 1000-d $fc$ , softmax

The operation at each node is fixed as a 3×3 depthwise separable convolution (sep conv).

A special node is added at both the head and the tail of each random subgraph:

Input node: connects to all nodes in the original graph with zero in-degree, copying the input tensor and distributing it to them.
Output node: connects to all nodes in the graph with zero out-degree, taking a weighted sum of these outputs to produce the stage’s final output.

Within the graph, when a node has multiple inputs, a weighted sum is likewise computed. The overall structure is shown below:

Very Few Hyperparameters

The hyperparameters of the whole network are mainly used to control its scale: the number of nodes in the random graph, the number of channels, and the input image resolution. Each generation algorithm has only one or two internal parameters. The author ran a systematic comparison on ImageNet across different algorithms and different hyperparameters:

Experimental Results

ImageNet

The experiments span three scales—large, medium, and small. Under the same compute and parameter budget, the randomly generated networks nearly reach the state-of-the-art level of the time:

network	top-1 acc.	top-5 acc.	FLOPs (M)	params (M)
MobileNet [15]	70.6	89.5	569	4.2
MobileNet v2 [40]	74.7	–	585	6.9
ShuffleNet [54]	70.9	89.8	524	~5
ShuffleNet v2 [30]	73.7	–	524	~5
NASNet-A [56]	74.0	91.6	564	5.3
NASNet-B [56]	72.8	91.3	488	5.3
NASNet-C [56]	72.5	91.0	558	4.9
Amoeba-A [34]	74.5	92.0	555	5.1
Amoeba-B [34]	74.0	91.5	555	5.3
Amoeba-C [34]	75.7	92.4	570	6.4
PNAS [26]	74.2	91.9	588	5.1
DARTS [27]	73.1	91.0	595	4.9
RandWire-WS	74.7±0.25	92.2±0.15	583±6.2	5.6±0.1

Table 2. ImageNet: small computation regime (i.e., <600M FLOPs). RandWire results are the mean accuracy (±std) of 5 random network instances, with WS(4, 0.75). Here we train for 250 epochs similar to [56, 34, 26, 27], for fair comparisons.

network	top-1 acc.	top-5 acc.	FLOPs (B)	params (M)
ResNet-50 [11]	77.1	93.5	4.1	25.6
ResNeXt-50 [52]	78.4	94.0	4.2	25.0
RandWire-WS, $C=109$	79.0±0.17	94.4±0.11	4.0±0.09	31.9±0.66
ResNet-101 [11]	78.8	94.4	7.8	44.6
ResNeXt-101 [52]	80.1	95.2	8.0	44.2
RandWire-WS, $C=154$	80.1±0.19	94.8±0.18	7.9±0.18	61.5±1.32

Table 3. ImageNet: regular computation regime with FLOPs comparable to ResNet-50 (top) and to ResNet-101 (bottom). ResNeXt is the 32×4 version [52]. RandWire is WS(4, 0.75).

network	test size	epochs	top-1 acc.	top-5 acc.	FLOPs (B)	params (M)
NASNet-A [56]	331²	>250	82.7	96.2	23.8	88.9
Amoeba-B [34]	331²	>250	82.3	96.1	22.3	84.0
Amoeba-A [34]	331²	>250	82.8	96.1	23.1	86.7
PNASNet-5 [26]	331²	>250	82.9	96.2	25.0	86.1
RandWire-WS	320²	100	81.6±0.13	95.6±0.07	16.0±0.36	61.5±1.32

Table 4. ImageNet: large computation regime. Our networks are the same as in Table 3 ( $C=154$ ), but we evaluate on 320×320 images instead of 224×224. Ours are only trained for 100 epochs.

COCO Object Detection

Replacing the backbone with a randomly generated network yields better detection accuracy than ResNet-50 and ResNeXt-50:

backbone	AP	AP $_{50}$	AP $_{75}$	AP $_S$	AP $_M$	AP $_L$
ResNet-50 [11]	37.1	58.8	39.7	21.9	40.8	47.6
ResNeXt-50 [52]	38.2	60.5	41.3	23.0	41.5	48.8
RandWire-WS, $C=109$	39.6	61.9	43.3	23.6	43.5	52.7
ResNet-101 [11]	39.8	61.7	43.3	23.7	43.9	51.7
ResNeXt-101 [52]	40.7	62.9	44.5	24.4	44.8	52.6
RandWire-WS, $C=154$	41.1	63.1	44.6	24.6	45.1	53.0

Table 5. COCO object detection results fine-tuned from the networks in Table 3, reported on the val2017 set. The backbone networks have comparable FLOPs to ResNet-50 or ResNet-101.

Robustness Experiment

After the network is trained, some nodes and edges are randomly removed to measure how much the accuracy drops. The robustness varies across the different generation algorithms:

Honestly, I find this set of experiments not very meaningful—why artificially remove nodes and edges from an already-trained network? The practical application scenario for this setup isn’t really clear.

Node Operation Ablation

Changing the node operation from a 3×3 sep conv to a regular 3×3 conv, or to pooling followed by a 1×1 conv—in the end, the 3×3 sep conv works best:

A Practical Concern

The FLOPs and parameter counts in the paper are fairly small, but I suspect the actual memory cost during training would be large. The reason is that the random graph involves a lot of tensor copy and distribution operations (the input node copies and distributes the tensor to multiple child nodes, and multi-branch merges also require storing intermediate activations). This runtime GPU memory overhead is not reflected in the FLOPs or parameter count.

Of course, it might not—if the framework treats the “copy” of an immutable tensor as merely passing a reference rather than a real copy, the extra overhead might not be that large. But this isn’t guaranteed under current frameworks, and it’s worth actually running it to verify.

Summary

The main contribution of this paper is proposing a new idea: the future direction of network design should shift from traditional structure design or NAS toward designing better network generators. If the structures a generator samples are generally high-performing with tiny variance, then optimizing the generator (the sampling distribution) is more fundamental than optimizing a single structure or a search process. Few hyperparameters and little prior knowledge are the clear advantages of this direction over NAS. This idea is worth following up on.

Paper Notes
NAS
Network Architecture

2019 · 04 · 07