Capsule Networks Are Shaking up AI  – Here’s How to Use Them

If you follow AI you might have heard about the advent of the potentially revolutionary Capsule Networks. I will show you how you can start using them today.

By Nick Bourdakos, IBM Watson.

Header image
Geoffrey Hinton [Source]

If you follow AI you might have heard about the advent of the potentially revolutionary Capsule Networks. I will show you how you can start using them today.

Geoffrey Hinton is known as the father of “deep learning.” Back in the 50s the idea of deep neural networks began to surface and, in theory, could solve a vast amount of problems. However, nobody was able to figure out how to train them and people started to give up. Hinton didn’t give up and in 1986 showed that the idea of backpropagation could train these deep nets. However, it wasn’t until 5 years ago in 2012 that Hinton was able to demostrate his breakthrough, because of the lack of computational power of the time. This breakthrough set the stage for this decade’s progress in AI.

And now, on October 26, 2017, he has released a paper on a new groundbreaking concept, Capsule Networks.

Note: I won’t go into too much detail, because Hinton’s papers do a fabulous job at explaining all the technical information and can be found here and here.


Problems with Traditional Neural Networks

Up until now Convolutional Neural Networks (CNNs) have been the state-of-the-art approach to classifying images.

CNNs work by accumulating sets of features at each layer. It starts of by finding edges, then shapes, then actual objects. However, the spatial relationship information of all these features is lost.

This is a gross oversimplification, but you can think of a CNN like this:

if (2 eyes && 1 nose && 1 mouth) {
  It's a face!

You might be thinking that this sounds pretty good, it makes sense, and it does. Although, we might run into a few problems, take this picture of Kim Kardashian for example:

Yikes! There’s definately two eyes, a nose and a mouth, but something is wrong, can you spot it? We can easily tell that an eye and her mouth are in the wrong place and that this isn’t what a person is supposed to look like. However, a well trained CNN has difficulty with this concept:

In addition to being easily fooled by images with features in the wrong place a CNN is also easily confused when viewing an image in a different orientation. One way to combat this is with excessive training of all possible angles, but this takes a lot of time and seems counter intuitive. We can see here the massive drop in performance by simply flipping Kim upside down:

Finally, convolutional neural networks can be susceptible to white box adversarial attacks. Which is essentially embedding a secret pattern into an object to make it look like something else.

Fooling Neural Networks in the Physical World with 3D Adversarial Objects [Source]

“Convolutional neural networks are doomed” — Geoffrey Hinton


Capsule Networks to the Rescue!


Architecture of CapsNet

The introduction of Capsule Networks gives us the ability to take full advantage of spatial relationship, so we can start to see things more like:

if (2 adjacent eyes && nose under eyes && mouth under nose) {
  It's a face!

You should be able to see that with this definition our neural net shouldn’t be as easily fooled by our misshapen Kardashian.

This new architecture also achieves significantly better accuracy on the following data set. This data set was carefully designed to be a pure shape recognition task that shows the ability to recognize the objects even from different points of view. It beat out the state-of-the-art CNN, reducing the number of errors by 45%.

CapsNet was able to identify the bottom images were within the same category (animals, humans, airplanes, cars, trucks) as the correspoding top image far better than CNNs.

Further more, in their most recent paper, they found that Capsules show far more resistance to white box adversarial attack than a baseline convolutional neural network.


Training CapsNet

I have pieced together a repo that is an implementation of Hinton’s paper (many thanks to naturomics). In order to use the Capsule Network model you first need to train it.

The following guide will get you a model trained on the MNIST data set. For those of you who don’t know, MNIST is a data set of handwritten digits and is a good baseline for testing out machine learning algorithms.

Start by cloning the repo:

And install the requirements:

pip install -r requirements.txt

Start training!


The MNIST data set is 60,000 training images. By default the model will be trained for 50 epochs at a batch size of 128. An epoch is one full run through the training set. Since the batch size is 128 it will do about 468 batches per epoch.

Note: Training might take a very long time if you don’t have a GPU. You can read this article on how to speed up training time.


Making Inferences

Once our model is fully trained we can test it by running the following command:

python --is_training False


Final Thoughts

Capsule Networks seem awesome, but they are still babies. We could see problems in the future when training huge datasets, but I have faith.

P.S. Here is a great video that I recommend taking the time to watch.

Thanks for reading! If you have any questions, feel free to reach out at, connect with me on LinkedIn, or follow me on Medium.

If you found this article helpful, it would mean a lot if you gave it some applause👏 and shared to help others find it!

Thanks to Josh Zheng.

Bio: Nick Bourdakos (@nick_bourdakos) is a computer vision addict at IBM Watson, specializing in Java, Swift, Node.js, and React. He also paints pretty pictures.

Original. Reposted with permission.