Resource-Constrained NAS: Using Submodular Optimization to Justify Greedy Search

I recently came across a paper on NAS in the mobile setting. One eye-catching result is that it searched on ImageNet in only 8 days (which, honestly, isn’t all that fast), but the idea is well worth borrowing—at its core, it casts the search problem into a classic framework from combinatorial optimization, namely submodular optimization.

The Starting Point: Resource-Constrained NAS Is NP-hard

The authors first point out that searching for the highest-accuracy network under a given compute budget is an NP-hard problem. That being the case, what’s needed is an approximation method with theoretical guarantees, rather than relying purely on heuristic search.

Basic Concepts of Submodular Functions

A submodular function is a set function defined over the power set of a finite set, i.e., a mapping from sets to real numbers.

The function’s core property concerns marginal gains: if $A$ is a subset of $B$ , then for any element $x$ not in $B$ , it satisfies

$f(A \cup \{x\}) - f(A) \geq f(B \cup \{x\}) - f(B)$

Put plainly, adding the same element to a smaller set yields a gain no smaller than adding it to a larger set—this is precisely the characterization of diminishing marginal returns.

When the function additionally satisfies monotonicity, i.e., the larger the set, the larger the function value:

\mathcal{A} \subseteq \mathcal{B} \subseteq \mathcal{V},\ F(\mathcal{A}) \leq F(\mathcal{B}).

we obtain a monotone non-decreasing submodular function.

Mapping NAS onto the Submodular Framework

The paper defines a set function $F$ that maps the power set of a set of $N$ blocks onto a set of real numbers—that is, it directly maps a subset of blocks to the validation accuracy (val accuracy) of the corresponding network architecture.

In the ideal case where SGD can find the global optimum, $F$ is monotone—adding more blocks does not lower accuracy:

F(\mathcal{A}) \leq F(\mathcal{B}) \text{ for any } \mathcal{A} \subseteq \mathcal{B} \subseteq \mathcal{V}.

In reality, of course, SGD cannot always find the global optimum, but experience shows that even when $F$ is not strictly monotone, it is non-decreasing, so the monotonicity assumption is reasonable in practice.

After further casting the problem into the framework of monotone submodular functions, we have the following theorem:

Lemma 1. For any selected blocks $\mathcal{A} \subseteq \mathcal{B} \subseteq \mathcal{V}$ and block $v \in \mathcal{V} \setminus \mathcal{B}$ , it holds that:

\begin{aligned} F(\mathcal{A}) &\geq 0 \\ F(\mathcal{A} \cup \{v\}) - F(\mathcal{A}) &\geq 0 \\ F(\mathcal{A} \cup \{v\}) - F(\mathcal{A}) &\geq F(\mathcal{B} \cup \{v\}) - F(\mathcal{B}) \end{aligned}

This theorem says two things:

Adding a block to any architecture never decreases accuracy (monotonicity).
Adding a block to a small architecture improves accuracy by no less than adding the same block to a large architecture (submodularity).

The Greedy Algorithm and Its Approximation Guarantee

With monotone submodularity in hand, the solution emerges naturally: the greedy algorithm. Concretely, at each step it picks from the candidate set the block with the largest “marginal accuracy gain / compute cost” ratio, adds it to the current architecture, and continues until it hits the compute budget.

Algorithm 1: Cost-Effective Greedy CNN Search (CEG)

function CEG(V, F, B, c(·))
    S ← argmax_{v ∈ V} F(v) / c(v)
    while c(S) ≤ B do
        v* = argmax_{v ∈ V \ S} (F(S ∪ {v}) − F(S)) / c(v)
        S ← S ∪ {v*}
    end while
    return S
end function

Since $F$ is a monotone submodular function, classical approximation theory guarantees that the greedy strategy achieves a $(1 - 1/e) \approx 63\%$ approximation ratio relative to the optimal solution. For an NP-hard problem, this is a theoretically grounded, non-trivial guarantee.

Cutting the Compute Cost Further

Naive greedy requires re-evaluating the marginal gain of all remaining candidate blocks at every step, which is fairly expensive. The paper further optimizes the algorithm:

Algorithm 2: Lazy Cost-Effective Greedy Search

function LAZY-CEG(V, F, B_p, B_m, c(·))
    S ← ∅
    PriorityQueue Q ← PriorityQueue()
    for all v ∈ V do                                  ▷ First iteration
        if Param(v) ≤ B_p AND MAdds(v) ≤ B_m then
            Q.push({v, F(v) / c(v)})
        end if
    end for
    S ← S ∪ {Q.pop()}
    while ∃ v ∈ Q : Param(S ∪ {v}) ≤ B_p AND MAdds(S ∪ {v}) ≤ B_m do   ▷ Lazy update
        v* ← Q.pop()
        if v* ∈ V \ S then
            if (F(S ∪ {v*}) − F(S)) / c(v*) ≥ (F(S ∪ {Q.top()}) − F(S)) / c(v) then
                S ← S ∪ {v*}
            else
                Q.push({v*, (F(S ∪ {v*}) − F(S)) / c(v*)})
            end if
        end if
    end while
    return S
end function

Algorithm 3: Resource Constrained Architecture Search (RCAS)

function RCAS(V, F, B_p, B_m)
    S̃_UC  ← LAZY-CEG(V, F, B_p, B_m, const(·))
    S̃_APR ← LAZY-CEG(V, F, B_p, B_m, Param(·))
    S̃_AMR ← LAZY-CEG(V, F, B_p, B_m, MAdds(·))
    return argmax{S̃_UC, S̃_APR, S̃_AMR}
end function

Exploiting the properties of submodular functions, one can avoid exactly computing the marginal gain of all blocks in every round, thereby drastically reducing the number of actual evaluations—this is also one of the keys to completing the search on ImageNet within 8 days.

Overall, viewing resource-constrained NAS as constrained maximization of a submodular function not only provides a theoretical basis for the greedy algorithm but also makes handling the constraint more natural—there’s no need to manually tune a penalty weight; instead, the compute budget is handled directly within the combinatorial optimization framework. This perspective is uncommon in the NAS literature and worth recording.

Paper Notes
NAS
Optimization Algorithms

2019 · 04 · 10