An Inside View of Language Technologies at Google
Learn about language technologies at Google, including projects, technologies, and philosophy, from an interview with a Googler.
Do you you get involved in productizing research innovations? Is there a typical path from research into products, at Google?
Yes, we are responsible of bringing to production all the technologies that we develop. If research and production are handled separately, there are at least two common causes of failure.
By having the research team not so close to the production needs, it is possible that their evals and datasets are not fully representative of the exact needs of the product. This is particularly problematic if a research team is to work on a product that is being constantly improved. Unless working directly on the product itself, it is likely that the settings under which the research team is working will quickly become obsolete and positive results will not translate into product improvements.
At the same time, if the people bringing research innovations to product are not the researchers themselves, it is likely that they will not know enough about the new technologies to be able to make the right decisions, for example, if product needs require you to trade-off some accuracy to reduce computation cost.
Your LT-Accelerate presentation, Language Technologies at Google, could occupy both conference days, just itself. But you're planning to focus on information extraction and a couple of other topics. You have written that information extraction has proved to be very hard. You cite challenges that include entity resolution and consistency problems of knowledge bases. Actually, first, what are definitions of "entity resolution" and "knowledge base"?
We call "entity resolution" the problem of finding, for a given mention of a topic in a text, the entry in the knowledge base that represents that topic. For example, if your knowledge base is Wikipedia, one may refer to this entry in English text as "Barack Obama", "Barack", "Obama", "the president of the US", etc. At the same time, "Obama" may refer to any other person with the same surname, so there is an ambiguity problem. In literature people also refer to this problem with other names, like entity linking or entity disambiguation. Two years ago, some colleagues at Google released a large corpus of entity resolution annotations in a large web corpus that includes 11 billion references to Freebase topics that has already been exploited by researchers worldwide working on Information Extraction.
When we talk about knowledge bases, we refer to structured information about the real world (or imaginary worlds) on which one can ground language analysis of texts, amongst many other applications. These typically contain topics (concepts and entities), attributes, relations, type hierarchies, inference rules… There have been decades of work on knowledge representation and on manual and automatic acquisition of knowledge, but these are far from solved problems.
So ambiguity, name matching, and pronouns and other anaphora are part of the challenge, all sorts of coreference. Overall, what's the entity-resolution state of the art?
Coreference is indeed a related problem and I think it should be solved jointly with entity resolution.
Depending on the knowledge base and test set used, results vary, but mention-level annotation currently has an accuracy between 80% and 90%. Most of the knowledge bases, such as Wikipedia and Freebase, have been constructed in large part manually, without a concrete application in mind, and issues commonly turn up when one tries to use them for entity disambiguation.
Where do the knowledge-base consistency issues arise? In representation differences, incompatible definitions, capture of temporality, or simply facts that disagree? (It seems to me that human knowledge, in the wild, is inconsistent for all these reasons and more.) And how do inconsistencies affect Google's performance, from the user's point of view?
Different degrees of coverage of topics, and different levels of detail in different domains, are common problems. Depending on the application, one may want to tune the resolution system to be more biased to resolve mentions as head entities or tail entities, and some entities may be artificially boosted simply because they are in a denser, more detailed portion of the network in the knowledge base. On top of this, schemas are thought out to be ontologically correct but exceptions happen commonly; many knowledge bases have been constructed by merging datasets with different levels of granularity, giving rise to reconciliation problems; and Wikipedia contains many "orphan nodes" that are not explicitly related to other topics even though they are clearly related to them.
Is "curation" part of the answer -- along the lines of the approaches applied for IBM Watson and Wolfram Alpha, for instance -- or can the challenges be met algorithmically? Who's doing interesting work on these topics, outside Google, in academia and industry?
There is no doubt that manual curation is part of the answer. At the same time, if we want to cover the very long tail of facts, it would be impractical to try to enter all that information manually and to keep it permanently up-to-date. Automatically reconciling existing structured sources, like product databases, books, sports results, etc. is part of the solution as well. I believe it will eventually be possible to apply information extraction techniques over structured and unstructured sources, but that is not without challenges. I mentioned before that the accuracy of entity resolution systems is between 80% and 90%. That means that for any set of automatically extracted facts, at least 10% of them are going to be associated to the wrong entity -- an error that will accumulate on top of any errors from the fact extraction models. Aggregation can be helpful in reducing the error rate, but will not be so useful for the long tail.
On the bright side, the area is thriving -- it is enough to review the latest proceedings of ACL, EMNLP and related conferences to realize that there is fast progress in the area. Semantic parsing of queries to answer factoid questions from Freebase, how to integrate deep learning models in KB representation and reasoning tasks, better combinations of global and local models for entity resolution... are all problems in which important breakthroughs have happened in the last couple of years.
Finally, what's new and exciting on the NLP horizon?
On the one hand, the industry as a whole is quickly innovating in the personal assistant space: a tool that can interact with humans through natural dialogue, understands their world, their interests and needs, answers their information needs, helps in planning and remembering tasks, and can help control their appliances to make their lives more comfortable. There are still many improvements in NLP and other areas that need to happen to make this long-term vision a reality, but we are already starting to see how it can change our lives.
On the other hand, the relation between language and embodiment will see further progress as development happens in the field of robotics, and we will not just be able to ground our language analyses on virtual knowledge bases, but on physical experiences.
Thanks Enrique!
Google Research's Enrique Alfonseca will be speaking at the LT-Accelerate conference, 23-24 November in Brussels. The program features brand, agency, researcher and solution provider speakers on the application of language technologies - in particular, text, sentiment and social analytics - to a range of business and governmental challenges. Join us there!
Bio: Seth Grimes is an analytics strategy consultant with Washington DC based Alta Plana Corporation. He is founding chair of the Text Analytics Summit (2005-13), the Sentiment Analysis Symposium (next July 15-16, 2015 in New York), and the LT-Accelerate conference (November 25-26, 2015 in Brussels). He is the leading industry analyst covering text analytics and sentiment analysis.
Original.
Related:
Yes, we are responsible of bringing to production all the technologies that we develop. If research and production are handled separately, there are at least two common causes of failure.
By having the research team not so close to the production needs, it is possible that their evals and datasets are not fully representative of the exact needs of the product. This is particularly problematic if a research team is to work on a product that is being constantly improved. Unless working directly on the product itself, it is likely that the settings under which the research team is working will quickly become obsolete and positive results will not translate into product improvements.
At the same time, if the people bringing research innovations to product are not the researchers themselves, it is likely that they will not know enough about the new technologies to be able to make the right decisions, for example, if product needs require you to trade-off some accuracy to reduce computation cost.
Your LT-Accelerate presentation, Language Technologies at Google, could occupy both conference days, just itself. But you're planning to focus on information extraction and a couple of other topics. You have written that information extraction has proved to be very hard. You cite challenges that include entity resolution and consistency problems of knowledge bases. Actually, first, what are definitions of "entity resolution" and "knowledge base"?
We call "entity resolution" the problem of finding, for a given mention of a topic in a text, the entry in the knowledge base that represents that topic. For example, if your knowledge base is Wikipedia, one may refer to this entry in English text as "Barack Obama", "Barack", "Obama", "the president of the US", etc. At the same time, "Obama" may refer to any other person with the same surname, so there is an ambiguity problem. In literature people also refer to this problem with other names, like entity linking or entity disambiguation. Two years ago, some colleagues at Google released a large corpus of entity resolution annotations in a large web corpus that includes 11 billion references to Freebase topics that has already been exploited by researchers worldwide working on Information Extraction.
When we talk about knowledge bases, we refer to structured information about the real world (or imaginary worlds) on which one can ground language analysis of texts, amongst many other applications. These typically contain topics (concepts and entities), attributes, relations, type hierarchies, inference rules… There have been decades of work on knowledge representation and on manual and automatic acquisition of knowledge, but these are far from solved problems.
So ambiguity, name matching, and pronouns and other anaphora are part of the challenge, all sorts of coreference. Overall, what's the entity-resolution state of the art?
Coreference is indeed a related problem and I think it should be solved jointly with entity resolution.
Depending on the knowledge base and test set used, results vary, but mention-level annotation currently has an accuracy between 80% and 90%. Most of the knowledge bases, such as Wikipedia and Freebase, have been constructed in large part manually, without a concrete application in mind, and issues commonly turn up when one tries to use them for entity disambiguation.
Where do the knowledge-base consistency issues arise? In representation differences, incompatible definitions, capture of temporality, or simply facts that disagree? (It seems to me that human knowledge, in the wild, is inconsistent for all these reasons and more.) And how do inconsistencies affect Google's performance, from the user's point of view?
Different degrees of coverage of topics, and different levels of detail in different domains, are common problems. Depending on the application, one may want to tune the resolution system to be more biased to resolve mentions as head entities or tail entities, and some entities may be artificially boosted simply because they are in a denser, more detailed portion of the network in the knowledge base. On top of this, schemas are thought out to be ontologically correct but exceptions happen commonly; many knowledge bases have been constructed by merging datasets with different levels of granularity, giving rise to reconciliation problems; and Wikipedia contains many "orphan nodes" that are not explicitly related to other topics even though they are clearly related to them.
Is "curation" part of the answer -- along the lines of the approaches applied for IBM Watson and Wolfram Alpha, for instance -- or can the challenges be met algorithmically? Who's doing interesting work on these topics, outside Google, in academia and industry?
There is no doubt that manual curation is part of the answer. At the same time, if we want to cover the very long tail of facts, it would be impractical to try to enter all that information manually and to keep it permanently up-to-date. Automatically reconciling existing structured sources, like product databases, books, sports results, etc. is part of the solution as well. I believe it will eventually be possible to apply information extraction techniques over structured and unstructured sources, but that is not without challenges. I mentioned before that the accuracy of entity resolution systems is between 80% and 90%. That means that for any set of automatically extracted facts, at least 10% of them are going to be associated to the wrong entity -- an error that will accumulate on top of any errors from the fact extraction models. Aggregation can be helpful in reducing the error rate, but will not be so useful for the long tail.
On the bright side, the area is thriving -- it is enough to review the latest proceedings of ACL, EMNLP and related conferences to realize that there is fast progress in the area. Semantic parsing of queries to answer factoid questions from Freebase, how to integrate deep learning models in KB representation and reasoning tasks, better combinations of global and local models for entity resolution... are all problems in which important breakthroughs have happened in the last couple of years.
Finally, what's new and exciting on the NLP horizon?
On the one hand, the industry as a whole is quickly innovating in the personal assistant space: a tool that can interact with humans through natural dialogue, understands their world, their interests and needs, answers their information needs, helps in planning and remembering tasks, and can help control their appliances to make their lives more comfortable. There are still many improvements in NLP and other areas that need to happen to make this long-term vision a reality, but we are already starting to see how it can change our lives.
On the other hand, the relation between language and embodiment will see further progress as development happens in the field of robotics, and we will not just be able to ground our language analyses on virtual knowledge bases, but on physical experiences.
Thanks Enrique!
Google Research's Enrique Alfonseca will be speaking at the LT-Accelerate conference, 23-24 November in Brussels. The program features brand, agency, researcher and solution provider speakers on the application of language technologies - in particular, text, sentiment and social analytics - to a range of business and governmental challenges. Join us there!
Bio: Seth Grimes is an analytics strategy consultant with Washington DC based Alta Plana Corporation. He is founding chair of the Text Analytics Summit (2005-13), the Sentiment Analysis Symposium (next July 15-16, 2015 in New York), and the LT-Accelerate conference (November 25-26, 2015 in Brussels). He is the leading industry analyst covering text analytics and sentiment analysis.
Original.
Related: