An Inside Update on Natural Language Processing

This article is an interview with computational linguist Jason Baldridge. It's a good read for data scientists, researchers, software developers, and professionals working in media, consumer insights, and market intelligence. It's for anyone who's interested in, or needs to know about, natural language processing (NLP).

Jason Baldridge

Computational linguist Jason Baldridge.

Jason and NLP go way back. As a linguistics graduate student at the University of Edinburgh, in 2000, Jason co-created the OpenNLP text-processing framework, now part of Apache. He joined the University of Texas linguistics faculty in 2005 and, a few years back, helped build a text analytics system for social-media agency Converseon. Jason's Austin start-up, People Pattern, applies NLP and machine learning for social-audience insights; he co-founded the company in 2013 and serves as chief scientist. Finally, he'll keynote on "Personality and the Science of Sharing" and teach a tutorial at the 2016 Sentiment Analysis Symposium.
(Save 10% with the KDNUGGETS registration code.)

In sum, Jason is an all-around cool guy, and he deserves special recognition for providing the most thorough Q&A responses I have ever received in response to an interview request. The interview? This one, covering AI, neural networks, computational linguistics, Java vs. Scala, and accuracy evaluation with a detour into Portuguese-English translation challenges, that is --

An Inside Update on Natural Language Processing

Seth Grimes> Let's jump in the deep end. What's the state of NLP, of natural language processing?

Jason Baldridge> There's work to be done.

The first thing to keep in mind is that many of the most interesting NLP tasks are AI-complete. That means we are likely to need representations and architectures that recognize, capture, and learn knowledge about people and the world in order to exhibit human-level competence in these tasks. Do we need to represent word senses, predicate-argument relations, discourse models, etc? Almost certainly. An optimistic deep learning person might say "the network will learn all that," but I'm skeptical that a generic model structure will learn all these things from the data that is available to it.

Seth> So you're an neural-network skeptic.

Jason>No, they are a great set of tools and techniques that are providing large improvements for many tasks. But they aren't magic and they won't suddenly solve every problem we throw at them, out-of-the-box. When it comes to language, the only competent device we know of for processing human language fully -- the human brain -- is the result of hundreds of millions of years of evolution. That process has afforded it with a complex architecture that dwarfs the relative puny networks that are used for language and vision tasks today.

Humans learn language from a surprisingly small amount of data, and they go through different phases in that process, including memorization to generalization (including overgeneralization, e.g., "Mommy goed to the store"). Having said that, I love the boldness and confidence of the neural optimists, but I think we will need to figure out the architectures and the reward mechanisms by which a very deep network processes, represents, stores, and generalizes information and how it relates to language. That will imply choices about how lexicons are stored, how morphological and syntactic regularities are captured, and so on.

Is there academic computational-linguistics work that you'd call out as interesting, surfaced in NLP software tools or not?

Two items: vectorization and reinforcement learning.

The vectorization of words and phrases is one of the big overall trends these days, with the use of those vectors as the inputs for NLP tasks. The good part is that vectors are learned on large, unlabeled corpora. This injects knowledge into supervised learning tasks that have much less data.

For example, "pope," "catholic," and "vatican" will have similar vectors, so training examples that have just one of these words will still contribute toward better learning of shared parameters. Without this, a classifier based on bags-of-words sees these words as being as separate as "apple," "hieroglyph," and "bucket." So the use of word vectors, and the fact that they can be learned with respect to a particular problem when using neural networks, has led to a standardization of sorts, of an important strategy for dealing with language inputs as continuous elements rather than as discrete, atomic symbols. The success of using convolutional neural networks for tasks like sentiment analysis concretely demonstrates the effectiveness of this strategy.

Second, reinforcement learning has been used by researchers on dialogue systems, and it's now the new (yet, old) rage in machine learning. Witness, for example, the success of deep learning plus reinforcement learning in AlphaGo. I think it will be very interesting to see how reinforcement learning plus deep learning can be used to tackle many language tasks while reducing the number of modules and the amount of supervision needed to achieve strong performance.

With these developments plus ever faster computation and improved physical sensors, it's fascinating and exciting times for natural language processing and for artificial intelligence more generally.

Any other thoughts to share on neural networks and AI?

The worry for someone like me is that you then get people who think deep learning method X or Y is all you need to handle NLP problems and all this interesting and possibly useful facts about language start getting swept away. Having said that, there are great papers coming out that take a neural network approach while using/building interesting representations for language. For example, Ballesteros, Dyer, and Smith's (2015) parser, which uses character level word representations in combination with an LSTM that controls the actions of a shift-reduce dependency parser. Dyer, Kuncoro, Ballesteros, and Smith (2016) introduce Recurrent Neural Network Grammars and define parsing models for them that outperform all previous single-model parsers---and they do it with far less feature-engineering and tweaking. The DeepMind group are working on memory-based recurrent networks with stacks and queues (Grefenstette et al., 2015). This is important for working toward general neural architecture for approaching language tasks that require stacked and/or arbitrarily long dependencies (e.g. parsing).

What about to-be-done challenges in computational linguistics?

I think/hope we'll see better computational analysis of discourse. Many models assume that the sentence is the core unit of analysis, but that's of course a massive simplification. Consider machine translation. We've seen tremendous progress in the past twenty years. If you translate news articles from Portuguese to English, the results are really good. However, many things break when you move to narrative texts. For example, Portuguese is a pro-drop language that allows you to say things like "went to the beach" and leave off the subject. Consider a sequence of sentences like "A garota estava feliz. Foi a praia. Falou come seus amigos." This sequence should translate as "The girl was happy. [She] went to the beach. [She] talked with her friends." However, Google Translate returns "The girl was happy. Went to the beach. He talked with his friends." Part of the reason for this is that there is a bias toward male pronouns in the English language text that the models were trained on, and when the pronoun "ela" is missing from the sentence "falou com seus amigos," it erroneously fills the subject in the English sentence with "he." So, we'll likely see models coming soon that will work on both coreference and discourse structure to better translate texts like this. I actually worked on discourse structure and coreference a decade ago. We made progress, but our models were fragile and hard to integrate with each other. I'm hoping that some of the newer neural architectures can help with that.