NCNN Peak Memory Benchmark: A Layer-by-Layer Analysis of MobileNet

I’ve recently been studying the memory overhead of inference engines, trying to understand how much peak memory a model actually consumes during real inference, and how various optimization options (light_mode, fp16, int8) affect that peak. Using MobileNet as the primary subject, I ran a systematic set of tests with NCNN on the x86 platform, and along the way I also collected data for several common networks.

Layer-by-Layer Memory Derivation for MobileNet

Let’s start with the structure of MobileNet. The total parameter count is about 17M bytes. Tallying each layer by “input + output + convolution parameters,” the per-layer memory usages add up to roughly 57M bytes, and the theoretical maximum single-layer memory usage is about 4.8M bytes.

LayerConv SizeInput SizeMemory Usage (bs=1, input + output + conv params)
conv/s23×3×3×32224×224×32211200 bytes
conv dw/s13×3×32112×112×323212416 bytes
conv/s11×1×32×64112×112×324825088 bytes
conv dw/s23×3×64112×112×644016384 bytes
conv/s11×1×64×12856×56×642441216 bytes
conv dw/s13×3×12856×56×1283215872 bytes
conv/s11×1×128×12856×56×1283276800 bytes
conv dw/s23×3×12856×56×1282011648 bytes
conv/s11×1×128×25628×28×1281335296 bytes
conv dw/s13×3×25628×28×2561614848 bytes
conv/s11×1×256×25628×28×2561867776 bytes
conv dw/s23×3×25628×28×2561012736 bytes
conv/s11×1×256×51214×14×2561126400 bytes
conv dw/s13×3×51214×14×512821248 bytes
conv/s11×1×512×51214×14×5121851392 bytes
… ×5
conv dw/s23×3×51214×14×512520192 bytes
conv/s11×1×512×10247×7×5122398208 bytes
conv dw/s23×3×10247×7×1024438272 bytes
conv/s11×1×1024×10247×7×10244595712 bytes
avg pool7×7×1024
fc1024×10001×1×10244104096 bytes

I found a similar set of statistics in MCUNet, but it differs considerably from my own calculations, so I’ve sent an email to ask about it. Looking at the MCUNet code, its counting method only includes the largest input + output activations and does not account for the weights themselves. I recalculated using that approach as well, but the result still doesn’t quite match up. By rights, weights also have to be loaded into memory when they participate in computation, so it seems more reasonable to count them in.

Cloud AI (NVIDIA V100)Mobile AI (iPhone 11)Tiny AI (STM32F746)ResNet-50MobileNetV2MobileNetV2 (int8)
Memory16 GB4 GB320 kB7.2 MB6.8 MB1.7 MB
StorageTB~PB>64 GB1 MB102 MB13.6 MB3.4 MB

Memory: Cloud→Mobile about 4×, Mobile→Tiny about 3100×; Storage: Cloud→Mobile about 1000×, Mobile→Tiny about 64000×. There is a huge gap between Tiny AI’s memory budget (320 kB) and the actual footprint of the three models on the right.

Experimental Setup and Cross-Network Comparison

The tests were run on an x86 Linux platform with an input image size of 224×224×3, using the NCNN framework, with models converted from ONNX. I hit a few snags here: some operations aren’t supported when converting from ONNX to NCNN, and the online ONNX simplifier isn’t great to work with — I recommend cloning it and converting locally instead.

First I ran a baseline experiment: loading only the same runtime libraries without performing any inference, the peak memory was about 16M bytes (stripping out some libraries could compress this further). Running inference in a loop multiple times does not increase peak memory. After loading the model, NCNN’s baseline peak memory was about 76M bytes.

The table below shows the measured data for each network, as reported by VmPeak:

Neural NetworkVGG16AlexNetGoogleNetResNet18ResNet50DenseNet161ShuffleNetV2MobileNetMobileNetV2
Model Size (onnx file)527MB233MB25.2MB44.5MB97.4MB110MB8.67MB16.1MB13.5MB
Peak Memory (VmPeak)1601.4MB549.7MB410.8MB473.3MB504.4MB48.2MB74.2MB73.4MB

How light_mode and fp16/int8 Affect Peak Memory

Taking MobileNetV2 as an example, I toggled NCNN’s quantization options one by one and measured peak memory, with the following results:

light_mode=false, peak about 86M:

PIDUSERPRNIVIRTRESSHRS%CPU%MEMTIME+COMMAND
18179root200114668949925468R49.54.70:09.71check_peak_memo

light_mode=true, peak about 76M:

PIDUSERPRNIVIRTRESSHRS%CPU%MEMTIME+COMMAND
18597root20075140554405508R49.32.70:10.27check_peak_memo

light_mode=true, with fp16 also enabled (use_fp16_packed, use_fp16_storage, and use_fp16_arithmetic all true), peak still about 76M:

PIDUSERPRNIVIRTRESSHRS%CPU%MEMTIME+COMMAND
18871root20075140553605436R49.32.70:15.72check_peak_memo

light_mode=true, fp16 fully enabled, plus int8 (use_int8_storage and use_int8_arithmetic also true), and the peak still doesn’t drop noticeably:

PIDUSERPRNIVIRTRESSHRS%CPU%MEMTIME+COMMAND
19148root20075140552965364R49.22.70:10.44check_peak_memo

This leads to the first conclusion: enabling fp16 on its own has no effect on reducing peak memory. It must be paired with light_mode=true to see an improvement of roughly 10M — and this is achieved through memory reuse by promptly releasing intermediate feature maps, which has nothing to do with quantization precision.

Differences in How “Peak Memory” Is Defined

There’s a noteworthy issue here: when papers (especially work targeting embedded devices, such as MCUNet) compute peak memory, they only count the memory footprint of the input and output feature maps and do not include the operators’ own weights — in MCU scenarios, the weights typically reside in Flash and are read directly during inference without occupying SRAM, so this definition has its own justification. But when measuring on a Linux-platform inference engine (such as NCNN), the weights stay resident in memory, so the numbers under the two conventions are naturally not on the same order of magnitude. This is something to watch out for in particular when comparing data from different sources.