Notes on a Survey of Deep Learning Interpretability

My group has been talking about neural network interpretability lately, so I did a close reading of Zhang Quanshi’s survey “Visual Interpretability for Deep Learning: a Survey,” and then followed it up with Interpretable CNN, Zhou Bolei’s Network Dissection, and a survey on black-box explanation methods. The field is a bit scattered, so while it’s still fresh I want to organize the threads into one piece.

What This Survey Covers

The survey roughly groups interpretability work into five aspects:

  • Visualization of CNN representations
  • Diagnosis of CNN representations
  • Disentangling the mixture patterns of filters inside a CNN
  • Directly constructing interpretable models
  • Semantic-level middle-to-end learning through human-computer interaction

I’ll go through each below.

Visualization of CNN Representations

Visualization is the most direct category. The mainstream approach is gradient-based: compute the gradient of the value of some node in the CNN with respect to the input image, then find the image region that maximizes this value, so you know where the node is “looking.”

Another approach is up-sampling (the up-convolutional net), which directly reconstructs an image backward from a convolutional layer’s feature map. The problem is that there’s no mathematical guarantee for the correctness of the reconstruction, so it’s less reliable than gradient methods.

A third approach is to precisely compute a filter’s receptive field — and it’s worth noting that the theoretical receptive field computed from the filter size is actually larger than the effective receptive field in practice; the region that actually does the work is often smaller.

Diagnosis of CNN Representations

Diagnosis goes a step beyond plain visualization. The survey breaks it into five sub-directions:

  1. Globally analyzing a CNN’s features: exploring the semantic information of each filter, the transferability of filter representations, and the distribution of different attributes/categories in the feature space of a pre-trained CNN.
  2. Extracting the image regions that contribute directly to the output: some propagate the gradient of the feature map with respect to the loss back to the image plane to estimate these regions, while others directly extract the regions to which the output is more sensitive.
  3. Estimating the points in feature space that are easily attacked, i.e., the directions of adversarial attacks.
  4. Refining a network’s representation based on analysis of the feature space (details omitted).
  5. Finding latent biased representations inside a CNN: sometimes a single attribute is represented by multiple features at once, producing some erroneous representations.

Decomposing CNN Representations into Explanatory Graphs / Decision Trees

High-level filters usually learn a kind of mixture pattern, which can be represented with a graph structure, where each layer of the graph corresponds to one convolutional layer of the CNN:

Representing the mixture patterns of high-level filters as an explanatory graph

Another route is to use a decision tree to quantitatively explain each CNN layer: exactly which filter is making the decision and how large its contribution is, expressed through the tree structure.

Using a decision tree to quantitatively explain convolutional layers

Learning an Interpretable Representation Directly

Rather than explaining after the fact, why not make the network’s representation interpretable by design during training. There are already interpretable variants of several classic networks, such as Interpretable CNN, Interpretable R-CNN, Capsule, info-GAN, and so on.

Let me single out Interpretable CNN, because its idea is quite representative. Its goal is to make each filter in the high-level convolutional layers stably represent the same object part — that is, no matter which image is fed in, the filter always responds to the same region of the object. It has several very practical advantages:

  • It can be applied to many different networks
  • It needs no extra annotation, using the same information the original network was trained on
  • It doesn’t change the overall form of the loss
  • It loses a little accuracy, but not much

Concretely, the approach adds a loss to each high-level filter, encouraging the filter to express just one part — that is, to make the mutual information between “this part” and “this filter’s response” as large as possible, so this loss is taken as the negative of the mutual information.

How to Quantitatively Evaluate Interpretability

Just saying something is “interpretable” is too vague; you need metrics. The survey mainly mentions two pieces of work:

  • Filter interpretability: classify filters into a few semantic categories (texture, material, etc.), obtain their receptive fields, then compare against pre-annotated pixel-level semantic information, using IoU to measure the degree of interpretability.
  • Location instability: measure whether the point on the original image corresponding to the location of maximum activation in the feature map is stably and accurately localized relative to a target object. This likewise requires prior annotation.

Zhou Bolei’s Interpretable AI Session at VALSE

The thread Zhou Bolei presented in VALSE’s Interpretable AI session lines up nicely with the above, and stringing together a few representative works makes it very clear:

My Previous Talks

  • On the importance of single units — CVPR’18 Tutorial talk: YouTube

  • Interpretable representation learning for visual intelligence — MIT thesis defense: YouTube

  • Disentangling semantic concepts: “Disentanglement of Visual Concepts from Classifying and Synthesizing Scenes,” disentangling internal representations from scene classification.

  • Network Dissection (CVPR 2017): feed an image into the network, automatically find the semantically meaningful neurons, and give the visualization results semantic labels. This corresponds to the paper “Interpreting Deep Visual Representation via Network Dissection.” Its core contribution is to quantify the degree of interpretability — using a unified framework to measure the interpretability of any representation, rather than just producing a heatmap.

  • GAN Dissection (ICLR 2019): extends the interpretability analysis from classification networks to generative networks, examining what representations a GAN learns internally.

  • feature inversion: he also has related work at SIGGRAPH.

Zhou Bolei’s PhD thesis follows this same thread, and its core is: visualizing the internal representations of networks, using one framework to quantify any representation, characterizing the visual understanding behind a deep model’s decisions (highlighting the most informative regions), and extending from images to video. He groups the visualization and interpretation of network representations into a few categories: visualization of deep visual representations, attribute analysis, and unsupervised learning.

Where can interpretability land next? Roughly: addressing overfitting, adversarial attacks, model compression, transfer learning, and GANs / RL.

Not Just Vision: Explaining Black-Box Models

Beyond vision models like CNNs, I also read a more general survey, “A Survey of Methods for Explaining Black Box Models,” whose perspective leans more toward “who is the explanation for, and what counts as a good explanation”:

  • Global vs. local interpretability
  • The semantics of an explanation should be easy to understand, and sometimes you have to let the user grasp it quickly
  • Semantic explanations tailored to a specific user’s knowledge background
  • Using an interpretable model to mimic the black box (a surrogate model)
  • Considerations around fairness / privacy
  • Interactive interpretability
  • You also have to consider the interpretability of the data itself — for example, tabular data is inherently easier for people to understand than matrices/vectors

Summary

The field as a whole is still fairly scattered, but the skeleton is roughly clear: post-hoc explanation (visualization, diagnosis, decomposition into graphs/trees) and ante-hoc construction (directly learning interpretable representations) are the two routes, paired with a set of quantifiable evaluation metrics (Filter interpretability, Location instability, Network Dissection). On the vision side, the work of Zhou Bolei and Zhang Quanshi forms the main line, and pushing further outward gets you to the whole “make it human-readable” approach for black-box models. For anyone doing model compression, interpretability is actually quite useful — knowing which filters are doing work and which are redundant is itself a basis for pruning.