KDnuggets Exclusive: Interview with Yann LeCun, Deep Learning Expert, Director of Facebook AI Lab

We discuss what enabled Deep Learning to achieve remarkable successes recently, his argument with Vapnik about (deep) neural nets vs kernel (support vector) machines, and what kind of AI can we expect from Facebook.

Yann LeCun Prof. Yann LeCun has been much in the news lately, as one of the leading experts in Deep Learning - a breakthrough advance in machine learning which has been achieving amazing successes, as a founding Director of NYU Center for Data Science, and as the newly appointed Director of the AI Research Lab at Facebook. See his bio at the end of this post and you can learn more about his work at yann.lecun.com/.

He is extremely busy, combining his new job at Facebook and his old job at NYU, so I am very pleased that he agreed to answer a few questions for KDnuggets readers.

Gregory Piatetsky: 1. Artificial Neural networks have been studied for 50 years, but only recently they have achieved remarkable successes, in such difficult tasks as speech and image recognition, with Deep Learning Networks. What factors enabled this success - big data, algorithms, hardware?

Yann LeCun: Despite a commonly-held belief, there have been numerous successful applications of neural nets since the late 80's.

Deep learning has come to designate any learning method that can train a system with more than 2 or 3 non-linear hidden layers.

Around 2003, Geoff Hinton, Yoshua Bengio and myself initiated a kind of "conspiracy" to revive the interest of the machine learning community in the problem of learning representations (as opposed to just learning simple classifiers). It took until 2006-2007 to get some traction, primarily through new results on unsupervised training (or unsupervised pre-training, followed by supervised fine-tuning), with work by Geoff Hinton, Yoshua Bengio, Andrew Ng and myself.

But much of the recent practical applications of deep learning use purely supervised learning based on back-propagation, altogether not very different from the neural nets of the late 80's and early 90's.

What's different is that we can run very large and very deep networks on fast GPUs (sometimes with billions of connections, and 12 layers) and train them on large datasets with millions of examples. We also have a few more tricks than in the past, such as a regularization method called "drop out", rectifying non-linearity for the units, different types of spatial pooling, etc.

Many successful applications, particularly for image recognition use the convolutional network architecture (ConvNet), a concept I developed at Bell Labs in the late 80s and early 90s. At Bell Labs in the mid 1990s we commercially deployed a number of ConvNet-based systems for reading the amount on bank check automatically (printed or handwritten).

At some point in the late 1990s, one of these systems was reading 10 to 20% of all the checks in the US. Interest in ConvNet was rekindled in the last 5 years or so, with nice work from my group, from Geoff Hinton, Andrew Ng, and Yoshua Bengio, as from Jurgen Schmidhuber's group at IDSIA in Switzerland, and from NEC Labs in California. ConvNets are now widely used by Facebook, Google, Microsoft, IBM, Baidu, NEC and others for image and speech recognition. [GP: A student of Yann Lecun recently won Dogs vs Cats competition on Kaggle using a version of ConvNet, achieving 98.9% accuracy.]

GP: 2. Deep learning is not an easy to use method. What tools, tutorials would you recommend to data scientists, who want to learn more and use it on their data? Your opinion of Pylearn2, Theano?

Yann LeCun: There are two main packages:
They have slightly different philosophies and relative advantages and disadvantages. Torch7 is an extension of the LuaJIT language that adds multi-dimensional arrays and numerical library. It also includes an object-oriented package for deep learning, computer vision, and such. The main advantage of Torch7 is that LuaJIT is extremely fast, in addition to being very flexible (it's a compiled version of the popular Lua language).

Theano+Pylearn2 has the advantage of using Python (it's widely used, and has lots of libraries for many things), and the disadvantage of using Python (it's slow).

GP: 3. You and I have met a while ago at a scientific advisory meeting of KXEN, where Vapnik's Statistical Learning Theory and SVM were a major topic. What is the relationship between Deep Learning and Support Vector Machines / Statistical Learning Theory?

Yann LeCun: Vapnik and I were in nearby office at Bell Labs in the early 1990s, in Larry Jackel's Adaptive Systems Research Department. Convolutional nets, Support Vector Machines, Tangent Distance, and several other influential methods were invented within a few meters of each other, and within a few years of each other. When AT&T spun off Lucent In 1995, I became the head of that department which became the Image Processing Research Department at AT&T Labs - Research. Machine Learning members included Yoshua Bengio, Leon Bottou, and Patrick Haffner, and Vladimir Vapnik. Visitors and interns included Bernhard Scholkopf, Jason Weston, Olivier Chapelle, and others.

Support Vectors Vapnik and I often had lively discussions about the relative merits of (deep) neural nets and kernel machines. Basically, I have always been interested in solving the problem of learning features or learning representations. I had only a moderate interest in kernel methods because they did nothing to address this problem. Naturally, SVMs are wonderful as a generic classification method with beautiful math behind them. But in the end, they are nothing more than simple two-layer systems. The first layer can be seen as a set of units (one per support vector) that measure a kind of similarity between the input vector and each support vector using the kernel function. The second layer linearly combines these similarities.

It's a two-layer system in which the first layer is trained with the simplest of all unsupervised learning method: simply store the training samples as prototypes in the units. Basically, varying the smoothness of the kernel function allows us to interpolate between two simple methods: linear classification, and template matching. I got in trouble about 10 years ago by saying that kernel methods were a form of glorified template matching. Vapnik, on the other hand, argued that SVMs had a very clear way of doing capacity control. An SVM with a "narrow" kernel function can always learn the training set perfectly, but its generalization error is controlled by the width of the kernel and the sparsity of the dual coefficients. Vapnik really believes in his bounds. He worried that neural nets didn't have similarly good ways to do capacity control (although neural nets do have generalization bounds, since they have finite VC dimension).

My counter argument was that the ability to do capacity control was somewhat secondary to the ability to compute highly complex function with a limited amount of computation. Performing image recognition with invariance to shifts, scale, rotation, lighting conditions, and background clutter was impossible (or extremely inefficient) for a kernel machine operating at the pixel level. But it was quite easy for deep architectures such as convolutional nets.

GP: 4. Congratulations on your recent appointment as the head of Facebook new AI Lab. What can you tell us about AI and Machine Learning advances we can expect from Facebook in the next couple of years?

Facebook Yann LeCun: Thank you! it's a very exciting opportunity. Basically, Facebook's main objective is to enable communication between people. But people today are bombarded with information from friends, news organizations, websites, etc. Facebook helps people sift through this mass of information. But that requires to know what people are interested in, what motivates them, what entertains them, what makes them learn new things. This requires an understanding of people that only AI can provide. Progress in AI will allow us to understand content, such as text, images, video, speech, audio, music, among other things.

Here is part 2 of the interview.

Bio: Yann LeCun is Director of AI Research at Facebook, and the founding director of the Center for Data Science at New York University. He is Silver Professor of Computer Science, Neural Science, and Electrical Engineering and NYU, affiliated with the Courant Institute of Mathematical Science, the Center for Neural Science, and the ECE Department.

He received the Electrical Engineer Diploma from Ecole Supérieure d'Ingénieurs en Electrotechnique et Electronique (ESIEE), Paris in 1983, and a PhD in Computer Science from Université Pierre et Marie Curie (Paris) in 1987. After a postdoc at the University of Toronto, he joined AT&T Bell Laboratories in Holmdel, NJ in 1988. He became head of the Image Processing Research Department at AT&T Labs-Research in 1996, and joined NYU as a professor in 2003, after a brief period as a Fellow of the NEC Research Institute in Princeton. He was named Director of AI Research at Facebook in late 2013 and retains a part-time position on the NYU faculty.

His current interests include machine learning, computer perception, mobile robotics, and computational neuroscience. He has published over 180 technical papers and book chapters on these topics as well as on neural networks, handwriting recognition, image processing and compression, and on dedicated circuits and architectures for computer perception. The character recognition technology he developed at Bell Labs is used by several banks around the world to read checks and was reading between 10 and 20% of all the checks in the US in the early 2000s. His image compression technology, called DjVu, is used by hundreds of web sites and publishers and millions of users to access scanned documents on the Web. A pattern recognition method he developed, called convolutional network, is the basis of products and services deployed by companies such as AT&T, Google, Microsoft, NEC, IBM, Baidu, and Facebook for document recognition, human-computer interaction, image tagging, speech recognition, and video analytics.

LeCun has been on the editorial board of IJCV, IEEE PAMI, and IEEE Trans. Neural Networks, was program chair of CVPR'06, and is chair of ICLR 2013 and 2014. He is on the science advisory board of Institute for Pure and Applied Mathematics, and has advised many large and small companies about machine learning technology, including several startups he co-founded. He is the recipient of the 2014 IEEE Neural Network Pioneer Award.