Inside Deep Learning: Computer Vision With Convolutional Neural Networks

Deep Learning-powered image recognition is now performing better than human vision on many tasks. We examine how human and computer vision extracts features from raw pixels, and explain how deep convolutional neural networks work so well.

By Nikhil Buduma (MIT). The human sense of vision is unbelievably advanced. Within fractions of seconds, we can identify objects within our field of view, without thought or hesitation. But not only can we name objects we are looking at, we can also perceive their depth, perfectly distinguish their contours, and separate the objects from their backgrounds. Somehow our eyes take in raw voxels of color data, but our brain transforms that information into more meaningful primitives – lines, curves, and shapes – that might indicate, for example, that we’re looking at a can of Coke. Coke Can - Human Vision Programming machines to replicate human vision has huge implications, from Facebook’s facial recognition algorithms to Google’s self-driving cars to futuristic biomedical diagnostics. But it turns out this is a pretty difficult problem. Why? Because where we automatically see lines, contours, and objects, computers just see large matrices of numbers. Vision Grey Square To tackle this problem of learning more complex features out of raw pixel values, we’re going to use a special kind of neural network called a convolutional network. Convolutional neural networks, popularized by Yann LeCun and others in 1998 with LeNet are behind many of the successes of Deep Learning that have been reported recently in image and speech recognition and are a very hot topic. Convolutional neural networks use two major constructs as primitives: 1) filters(also called feature detectors) and 2) feature maps As we will see later, we can express these primitives as specialized groups of neurons, which enables us to build neural networks to represent them. Computer Vision Filters Let’s say we have an image (shown on the left in the diagram above), and our goal is to detect horizontal and vertical edges. To do so, we can create what’s called a filter (drawn in green). A filter is essentially a small matrix that represents a feature that we want to find in the original image. The filter on the top attempts to discover the parts of the original image with vertical lines, while the filter on the bottom tries to discover parts of the image with horizontal lines.