Silver BlogDeep Learning Can be Applied to Natural Language Processing

This post is a rebuttal to a recent article suggesting that neural networks cannot be applied to natural language given that language is not a produced as a result of continuous function. The post delves into some additional points on deep learning as well.

Image credit

There is an article going around the rounds at LinkedIn that attempts to make an argument against the use of Deep Learning in the domain of NLP. The article written by Riza Berkan “Is Google Hyping it? Why Deep Learning cannot be Applied to Natural Languages Easily” has several arguments about DL cannot possibly work and that Google is exaggerating its claims. The latter argument is of course borderline conspiracy theory.

Yannick Vesley has written a rebuttal “Neural Networks are Quite Neat: a Reply to Riza Berkan” where he makes his arguments on each point that Berkan makes. Vesley’s points are on the mark, however one can not ignore the feeling that DL theory has a few unexplained parts in it.

However, before I do get into that, I think it is very important for readers to understand that DL currently is an experimental science. That is, DL capabilities are actually discovered by researchers by surprise. There are certainly a lot of engineering that goes into the optimization and improvement of these machines. However, its capabilities are ‘unreasonably effective’, in short, we don’t have very good theories to explain its capabilities.

It is clear that there are gaps in understanding are in at least 3 open questions:

  1. How is DL able to search high dimensional discrete spaces?
  2. How is DL able to perform generalization if it appears to be performing rote memorization?
  3. How does (1) and (2) arise from simple components?

Berkan’s arguments exploit our current lack of a solid explanation with his own alternative approach. He is arguing that a symbolicist approach is the road to salvation. Unfortunately, no where in his arguments does he reveal the brittleness of the symbolicist approach, the lack of generalization and the lack of scalability. Has anyone created a rule based system that is able to classify images based on low level features that rivals DL? I don’t think so.

DL practitioners, however, aren’t stopping their work just because they don’t have air tight theoretical foundations. DL works and works surprisingly well. DL at is present state is an experimental science and it is absolutely clear that there is something going on underneath the covers that we don’t fully understand. A lack of understanding however does not invalidate the approach.

To understand the issues better, I wrote in an earlier article about “Architecture Ilities found in Deep Learning Systems”. I basically spell out the 3 capabilities in DL:

  • Expressibility — This quality describes how well a machine can approximate universal functions.
  • Trainability — How well and quickly a DL system can learn its problem.
  • Generalizability — How well machine can perform predictions on data that it has not been trained on.

There are of course other capabilities that also need to be considered in DL: Interpretability, modularity, transferability, latency, adversarial stability and security. But these are the main ones.

To get our bearing right about explaining all of these, we have to consider the latest experimental evidences. I’ve written about this here “Rethinking Generalization” which I summarize again:

The ICLR 2017 submission “Understanding Deep Learning required Rethinking Generalization“ is certainly going to disrupt our understanding of Deep Learning . Here is a summary of what the had discovered through experiments:

1. The effective capacity of neural networks is large enough for a brute-force memorization of the entire data set.

2. Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels.

3. Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged.

The point here that surprises most Machine Learning practitioners is the ‘brute-force memorization’. See, ML has always been about curve fitting. In curve fitting you find a sparse set of parameters that describe your curve and you use that to fit the data. The generalization that comes into play relates to the ability to interpolate between points. The major disconnect here is that DL have exhibited impressive generalization, yet it cannot possibly work if we consider them as just memory stores.

However, if we consider them as holographic memory stores, then that problem of generalization has a decent explanation. In “Deep Learning are Holographic Memories” I point out the experimental evidence that:

The Swapout learning procedure which tells us that if you sample any subnetwork of the entire network the resulting prediction will be the similar to any other subnetwork you look sample. Just like holographic memory where you can slice of pieces and still recreate the whole.

As it turns out, the universe itself is driven by a similar theory called the Holographic Principle. In fact, this serves as a very good base camp to begin a more solid explanation of the capabilities of Deep Learning. I introduce the “The Holographic Principle: Why Deep Learning Works” where I introduce a technical approach of using Tensor Networks that performs a reduction of the high dimensional problem space into a space that is computable within acceptable response times.

So going back again to the question about wether NLP can be handled by Deep Learning approaches. We certainly know that it can work, afterall, are you not reading and comprehending this text?

There certainly is a lot of confusion in the ranks of expert data scientists and ML practitioners. I was aware of the existence of this “push back” when I wrote: “11 Arguments that Experts get Wrong about Deep Learning”. However, Deep Learning likely can be best explained by a simple intuition that can be explained to a five year old:

DE3p Larenn1g wrok smliair to hOw biarns wrok.

Tehse mahcnies wrok by s33nig f22Uy pa773rns and cnonc3t1ng t3Hm t0 fU22y cnoc3tps. T3hy wRok l4y3r by ly43r, j5ut lK1e A f1l73r, t4k1NG cmopl3x sc3n3s aNd br3k41ng tH3m dwon itno s1pmLe iD34s.

A symbolic system cannot read this, however a human can.

In 2015, Chris Manning, an NLP practitioner wrote about the concerns of the field regarding Deep Learning (see: Computational Linguistics and Deep Learning). It is very important to take note of his arguments since his arguments are not in conflict with the capabilities of Deep Learning. His two arguments why NLP experts need not worry are as follows:

(1) It just has to be wonderful for our field for the smartest and most influential people in machine learning to be saying that NLP is the problem area to focus on; and (2) Our field is the domain science of language technology; it’s not about the best method of machine learning — the central issue remains the domain problems.

The first argument isn’t a criticism of Deep Learning. The second argument explains that he doesn’t believe in one-size-fits-all generic machine learning that works for all domains. That is not in conflict with the above Holographic Principle approach that indicates the importance of the network structure.

To conclude, I hope this article puts an end to the discussion that DL is not applicable to NLP.

If perhaps you still aren’t convinced, then maybe Chris Manning himself should convince you himself:

Bio: Carlos Perez is a software developer presently writing a book on "Design Patterns for Deep Learning". This is where he sources his ideas for his blog posts.

Original. Reposted with permission.