Stochastic Depth Networks Accelerate Deep Network Training

Read about the presentation and overview of a new deep neural network architectural method, and the response to some strong reaction that it brought about.

By Delip Rao, Joostware.

Stochastic Depth Networks will become the new normal... in deep learning that is.

Editor's note: This post originally appeared as 2 separate posts; as such, the second half of the post reflects reactions to the content of the first half.

Part 1

Everyday a half dozen or so new deep learning papers come out on ArXiv, but very few catch my eye. Recently, I read about “Deep Networks with Stochastic Depth“. I think, like dropout and batch normalization, this will be a game changer. For one, the results speak for themselves — in some cases up to 40% reduction in training time while at the same time beating the state of the art.

Varying survival probabilities
Figure 1. Error rate vs. Survival Probability (explained later)

Why is that a big deal? The biggest impediment in applying deep learning (or for that matter any S/E process) in product development is turnaround time. If I spend 1 week training my model and _then_ find it is a pile of shit, because I did not initialize something well or the architecture was missing something, that’s not good. For this reason, everyone I know wants to get the best GPUs or work on the biggest clusters — not just it lets them build more expressive networks but simply they’re super fast. So, any technique that improves experiment turnaround time is welcome!

The idea is ridiculously simple (perhaps why it is effective?): randomly skip layers while training. As a result you have a network that has expected depth really small, while the maximum depth can be in the order of 1000s. In effect, like dropout training, this creates an ensemble model from the 2L possible networks for an L-layer deep network.

Active/inactive layers

I also like that, this new method just adds only one (actually two if consider decay scheme) hyperparameter for tuning — layer survival probability. From their experiments it appears that this hyperparameter is quite low maintenance. Most arbitrary values you pick seem to do well unless you pick something really low (See Figure 1).

Something weird you notice, also from Figure 1., is this training seems to do well (at least on CIFAR data), even when you keep the deepest layers only 20% of the time. Remember all the narratives we told about how depth learns hierarchical representations, and higher level representations — those higher level representations don’t seem to matter so much after all.

Question: For really deep networks can we ditch the model weights at the higher levels to keep the model footprint small enough to fit mobile devices? (In addition to things like binarization etc.)

Expect to see a flurry of papers showing results of Stochastic Depth applied to other network architectures pretty soon.

Part 2

Yesterday, I wrote (excitedly) about stochastic depth in neural networks. The reactions I saw for that paper ranged from, “dang! I should’ve thought of that” to, umm, shall we say annoyed?

U mad, bro?

This reaction is not surprising at all. The idea was one of those “Frustratingly Simple” ideas that worked. If you read the paper, there is no new theory or model there. Neither do the authors spend a lot of time on why things work other than a hand-wavy explanation of ensembles. Critics might argue if there was any “contribution to science” — I’m sure some reviewers will.

The fact of the matter is nobody knows *exactly* why this works. My guess is this: A lot of the regularization we are seeing is probably coming from preventing the layers from co-adapting with each other. Just as dropout discourages adjacent layers from coadapting with each other, my guess is stochastic depth is discouraging entire subsets of layers from coadapting with each other. No doubt, there is an army of people out there to science the hell out of this, and explain better what’s going on. Kudos to them & I look forward to those works.

But that is Science. The realities of practice, however, are different. As a practitioner, if you are in the business of approximating functions, there is no escaping from methods now branded as Deep Learning, and (old and new) ensemble methods. In fact, all top submissions at Kaggle, for instance, use one or both of these. As a practitioner, I care about turnaround time for my experiments. I do care about accuracy improvements I can get by squeezing in more parameters and not losing generalization performance at the same time. But more importantly, I care about training/testing turnaround time. A lot. Fancy tree-structured models that take forever to converge for a marginal improvement in accuracy? No, thank you.

Whenever I see something that fits the bill, I don’t hesitate in coopting it. We don’t understand some of these things well today. A lot of deep learning is like that. I have used my Twitter stream to call this out, but it hasn't stopped me from using deep learning where I see fit. But at the same time, I will be the first to criticize anyone claiming things like “deep learning is intelligence” or “deep learning will solve all problems”.

So back to stochastic depth. I think this idea has a lot of promise. I'm bullish on the time savings in training. While we don’t fully understand, we will use it, and eventually figure out *exact* reasons for its success. This reminds me of Random Projections, where the idea of dimensionality reduction is to simply multiply with random binary matrices — i.e. you throw away columns at random. Sounds stupid? When it came out, (Kaski, 1998), Kaski’s paper also has a hand wavy explanation. It wasn’t until few years later that connections were made to the Johnson-Lindenstrauss Lemma, and the method was understood in depth.

Until then, I will have an open mind, be critical, and use what’s good whatever works.

Bio: Delip Rao is the founder of Joostware, a San Francisco based, AI consulting and product development studio. He can be reached for further questions at delip @

Originals: Part 1 & Part 2. Reposted with permission.