Peeking Inside Convolutional Neural Networks

This post discusses using some tricks to peek inside of the neural network, and to visualize what the individual units in a layer detect.

Some of the units are more general shape detectors, detecting edges, circles, corners, cones or similar:

Visualization of filter 116 in VGG-S Visualization of filter 125 in VGG-S 
Visualization of filter 364 in VGG-SVisualization of filter 422 in VGG-S 
Visualization of filter 433 in VGG-S Visualization of filter 265 in VGG-S
Visualization of (from top left) unit 116,125,364,422,433 & 265 in convolutional layer 5 in VGG-S

and some seem to detect textures, such as these detecting leopard fur and wood grain:

Visualization of filter 27 in VGG-S Visualization of filter 497 in VGG-S 
Visualization of filter 243 in VGG-S
Visualization of (from left) unit 27,497 & 265 in convolutional layer 5 in VGG-S

Not all of the unit visualizations are so easy to interpret, such as these:

Visualization of filter 252 in VGG-S Visualization of filter 164 in VGG-S 
Visualization of filter 300 in VGG-S
Visualization of (from left) unit 252,164 & 300 in convolutional layer 5 in VGG-S

However, if we find images that maximally activate these units, we can see that they detect respectively grids and more abstract features such as out-of-focus backgrounds, and shallow-focus/macro images.

Images maximally activating filter 252 in VGG-S
Images maximally activating filter 164 in VGG-S
Images maximally activating filter 300 in VGG-S
Images maximally activating (from top) unit 252,164 & 300 in convolutional layer 5 in VGG-S

Overall, this visualization of the units give us useful insight into what the units in VGG-S detect. However, VGG-S is a relatively shallow network by todays standards, with only 5 convolutional layers. What about visualizing units in deeper networks, such as VGG-16 or GoogLeNet? Unfortunately, this doesn't seem to work as well, though it gives us some interesting results. Here for instance, is a visualization of some units in convolutional layer 4c from GoogLeNet:

Visualization of filter 411 in Googlenet Visualization of filter 418 in Googlenet
Visualization of filter 223 in Googlenet Visualization of filter 423 in Googlenet
Visualization of filter 390 in Googlenet Visualization of filter 340 in Googlenet
Visualization of (from top left) unit 411,418,223,423,390 & 340 in convolutional layer 4c in GoogLeNet

You might recognize some of these as the "puppyslugs" from DeepDream. While these visualization are more detailed than the ones we get from VGG-S, they also have a tendency to look more psychedelic and unreal. It is not completely clear why this happens, but it seems like the center of the visualization generally seems to be a good representation of what the unit detects, while the edges gives us lots of random details.

Similarly for VGG-16, the visualizations we get are much harder to interpret, though we can see in some of these that the unit seems to detect respectively some kind of dog, a theater and a brass instrument (with players as blobs).

Visualization of filter 13 in VGG-16 Visualization of filter 10 in VGG-16
Visualization of filter 0 in VGG-16 Visualization of filter 4 in VGG-16
Visualization of filter 17 in VGG-16 Visualization of filter 2 in VGG-16
Visualization of (from top left) unit 13,10,0,4,17 & 2 in convolutional layer 5 in VGG-16

A hypothetical reason that these visualizations doesn't work as well for deeper networks, has to do with the nature of the convolutional networks. What each convolutional layer tries to do is to be able to detect specific features, without being sensitive to irrelevant variations such as pose, lighting, partial obstruction etc. In this sense, each convolutional layer "compresses" information and throws away irrelevant details such as pose etc. This works great when doing detection, which is what the network is actually meant to do. However, when we try to run the network in reverse and generate feasible images, for each layer we have to "guess" the irrelevant structural details that have been thrown away, and as the choices made in one layer might not be coordinated with other layers, this in effect introduces some amount of "structural noise" for each layer we have to run in reverse. This might be a minor issue for networks with few layers, such as VGG-S, but as we introduce more and more layers, the cumulative "structural noise" might simply overpower the generated structure in the image, and make the image look much less like what we would recognize as e.g. a dog, and more like what we recognize as the "puppyslugs" seen in DeepDream.

More investigations might have to be done to tell whether this is actually the reason that visualization fails for deeper networks, but I wouldn't be surprised if this is part of the reason. Below I briefly describe the technical details of how I made these visualizations.

Technical details

To visualize the features, I'm using pretty much the same technique I described earlier in this blogpost, starting from a randomly initialized image, and doing gradient ascent on the image with regards to the activation of a specific unit. We also use blurring between gradient descent iterations (which is equivalent to regularization via a smoothness prior), and gradually reduce the "width" of the blur during gradient descent in order to get natural-looking images. Since units in intermediate layers actually output a grid of activations over the entire image, we choose to optimize a single point in this grid, which gives us a feature visualization corresponding to the units receptive field.

Another trick I also used, was to modify the network to use leaky ReLUs instead of regular ReLUs, since otherwise the gradient will usually be zero when we start from a blank image, thus hindering initial gradient ascent. Since this modification doesn't seem to have significant effect on the predictions of the network, we can assume it doesn't have a major impact on the feature visualizations.

I've released the code I used to make these visualizations, so take a look if you want to know more details.

Similar work

There has been similar work on visualizing convolutional networks by e.g. Zeiler and Fergus and lately by Yosinski, Nugyen et al. In a recent work by Nguyen, they manage to visualize features very well, based on a technique they called "mean-image initialization". Since I started writing this blog post, they've also published a new paper using Generative Adversarial Networks as priors for the visualizations, which lead to far far better visualizations than the ones I've showed above. If you are interested, do take a look at their paper or the code they've released!

If you enjoyed this post, you should follow me on twitter!

Bio: Audun M. Øygard is a Data Scientist with a background in fine arts and statistics, creator of the Javascript facetracking library clmtrackr, and currently working with image recognition at Schibsted Media Group.

Original. Reposted with permission.