Deep Learning, Language Understanding, and the Quest for Human Capacity Cognitive Computing

To develop cognitive computing at human capacity understanding, deep learning research must heed what certain aspects of human symbol processing reveal about the architecture of the human mind.

By Patrick Ehlen, Loop AI Labs.

The world of deep learning research now seems to be divided into two major areas of pursuit: visual scene understanding and natural language understanding.The latter can be divided into subtasks that include speech recognition and various forms of semantic interpretation, such as translation. If you survey the papers at the major deep learning conferences (e.g., NIPS, ICLR, ICML), you’ll find many more papers focusing on methods for vision processing, and only a fraction on language processing. Of course, there are conferences dedicated to language processing tasks that offer many of the latter.Still, vision processing seems to get a lot more attention in the deep learning community at large. Why is this so?


One answer is that maybe language processing tasks are not as interesting or as challengingas vision tasks.Or maybe language understanding is a problem that has mostly been solved. After all, can’t we now talk to our phones and cars and ask them to skip to the next song, or to tell us how far away the moon is?

Strangely, in nature we find a similar imbalance of focus. We need only look around at the solutions of natural selection to problems of species survival to see that vision has been widely adopted as a universal and highly variable solution, while language—or at least what we tend to think of as language—only appears in a single and rather odd species, informally known as humans.

I say ‘odd’ because language isn’t the only aspect that distinguishes humans from other species. It is one piece in a complex of unique capabilities related to symbolic reasoning. Even more puzzling is the fact that evidence strongly suggests these capabilities did not evolve slowly over millions of years in concert with the physiological changes that shaped modern humans, but instead burst onto the planetwith a mushrooming of symbolic activity among Cro-Magnon populations during the last Ice Age, only some 35,000 years ago, evincing a set of unique symbol-processing-related behaviors that Paleolithic archaeologist Alexander Marshack collectively dubbed “The Human Capacity.”

Sure, various forms of communication are spread throughout the animal kingdom, ranging from simple spatial representations exchanged by the diminutive nervous systems of ants and bees to the more complex and mysterious melodic productions of birds and whales. But humans somehow managed to devise something quite different. Marshack singles out the “variable use of a generic symbol over time, in a range of contexts” as the prime indicator of the human capacity.In fact we can flatten this idea into three cornerstone abilities that define the base of that unique faculty.

In the first corner we note the exclusively human ability to communicate flexibly in multiple modalities, and to interchange symbols at will among those modalities. Not only are we able to port multivalent symbols across a range of external contexts, using the same token in vastly different circumstances of meaning, but we can also use them cross-modally, such that a visual symbol (like a written word or a gesture) can be substituted for a spoken one, or vice-versa, or the two can be entwined into a multimodal stream of complementary signals.

man-pointingHumans alone are so good at multimodal communication that a person who has completely lost the ability to communicate through our primary communicative channel (speech and hearing) can still use a different channel to communicate the same range of utterances that could be communicated by anyone who employs full use of that primary channel. No other species can perform this multimodal juggling of symbols, and it exposes an underlying unified representation of the world—a “cognitive map”—in which signals and symbols that arise from different modalities, including symbolic abstractions of language, are all encoded in the same representational space. The work of psychologist David McNeill and his legacy of students at the University of Chicago is particularly enlightening on the richness of human fluency in this regard.

The second cornerstone of the human capacity is our ability to manipulate and transform symbols via a process of recursion. Humans appear to be the only species that embed our communicative symbols into functions and then feed the products of those functions back into themselves, allowing us to store, manipulate and communicate highly complex ideas that can span long sequences and employ sophisticated, hierarchical syntactic structures. Even small children show a facility with recursion that cannot be found in the communicative systems of other species, as demonstrated in popular nursery rhymes, like this classic:

This is the rat that ate the malt that lay in the house that Jack built

Some argue this facility for symbol recursion is in fact the only uniquely human trait, Noam Chomsky the most notable among them. He also frequently touts the third keystone of the human capacity, which arises as a natural result of the second (recursion): our faculty for discrete infinity, or “the infinite use of finite means.”While the number of elementary symbols and syntactic categories humans have to work with is quite limited,we are nevertheless capable of conceiving and understanding an infinite number of possible productions, thanks to our facility with recursion. A good example can be found again by extending the aforementioned nursery rhyme:

This is the farmer sowing the corn, that kept the cock that crowed in the morn, that waked the priest all shaven and shorn, that married the man all tattered and torn, that kissed the maiden all forlorn, that milked the cow with the crumpled horn, that tossed the dog, that worried the cat, that killed the rat, that ate the malt that lay in the house that Jack built….

These three stones—multimodal communication, recursion, and discrete infinity—form the triangular base of the human capacity, unseen in other species and, moreover, as far as evidence shows, unprecedented in the long history of life on the planet until a short while ago, by evolutionary time scales. Without these pieces, language and human society as we know it would not exist. Thus, the human capacity reveals a unique solution of nature to a mysterious problem that either did not appear or did not succeed as an adaptive trait in any other species until now.

So let’s get back to our original question: Why do we see more effort on vision processing than language understanding in the deep learning community? Perhaps the relative scarcity of language solutions in literature mirrors the scarcity of that solution in nature for roughly the same reason: While language is important to us as a species and as a society, the means and purpose of its mechanism as a force in nature is not immediately clear, and this obscurity of purpose hinders our ability to see clearly what problem is actually addressed by human capacity language understanding, and how that problem gets handled by our cognitive system. In other words, the problem is not well formulated, and we cannot simply say, “Let’s do language understanding!” and make much progress because the human capacity is itself a solution to a different problem—albeit one we haven’t figured out yet.

But the aforementioned cornerstones of the human capacity give us some clues about where we should focus our efforts. They may even form the basis of a proto-manifesto that describes the desiderata for a “human capacity cognitive computing” effort (should we decide such a thing would be necessary or useful).

First, to accomplish a faculty of multimodal communication similar to that of humans, it seems clear that signals and symbols from different signal source streams should be encoded into a single representational space. The past year has already seen a flurry of work on one such effort, withmultimodal systems that embed the derived semantic features of both images and phrases into a common semantic tensor. This design choice created systems that use the resulting association between visual and linguistic semantics to perform linguistic search on images that was not possible before.

Second, the use of recursion should be central to architecture designs. Recurrent neural networks already allow a spatial representation of time that is useful in many contexts, but current implementations do not go far enough to accomplish the level of recursive power we witness in human capacity processing. What’s more, the building blocks of that recursive architecture should be portable and embeddable in such a way as to facilitate discrete infinity, in the hope that we can one day break machine understanding from its current confines of ability.

With these desiderata in hand, we can make steps toward a cognitive computing capability that aspires to the level of human capacity. And the wonderful thing about deep learning research is that it gives us a wide playing field in which to experiment with architectures—a pursuit where we’ve only begun to scratch the surface of what is possible.

patrick-ehlenBio: Patrick Ehlen is Chief Scientist at Loop AI Labs, and has pursued the dream of artificial intelligence since reading Arthur C. Clarke’s 2001: A Spacy Odyssey in third grade. He received a PhD in Cognitive Psychology at the New School for Social Research, followed by a post-doc in the Computational Semantics Lab at Stanford University’s Center for the Study of Language and Information. He also worked as a research scientist on multimodal and natural language technologies at AT&T Labs and the AT&T Interactive R&D Group.