Efficient Graph-based Word Sense Induction

This paper describes a set of algorithms for Natural Language Processing (NLP) that match or exceed the state of the art on several evaluation tasks, while also being much more computationally efficient.

Efficient Graph-based Word Sense Induction by Distributional Inclusion Vector Embeddings

By UMass Amherst graduate students Haw-Shiuan Chang, Amol Agrawal, Ananya Ganesh, Anirudha Desai and Vinayak Mathur; Andrew McCallum, Professor and Director of the Center for Data Science at UMass Amherst; and Alfred Hough, Lead AI Researcher at Lexalytics

The paper was first presented at TextGraphs-2018, a workshop series at The 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) on June 6, 2018 in New Orleans. This new approach to word-sense induction comes from the work of the Lexalytics Magic Machines AI Labs, launched in 2017 in partnership with the University of Massachusetts Amherst’s Center for Data Science and Northwestern University’s Medill School of Journalism, Media and Integrated Marketing Communications to drive innovation in AI.  

Word sense induction (WSI) is a challenging task of natural language processing whose goal is to categorize and identify multiple senses of polysemous words from raw text without the help of predefined sense inventory like WordNet (Miller, 1995). The problem is sometimes also called unsupervised word sense disambiguation (Agirre et al., 2006; Pelevina et al., 2016).

An effective WSI has wide applications. For example, we can compare different induced senses in different documents to detect novel senses over time (Lau et al., 2012; Mitra et al., 2014) or analyze sense difference in multiple corpora (Mathew et al., 2017). WSI could also be used to group and diversify the documents retrieved from search engine (Navigli and Crisafulli, 2010; Di Marco and Navigli, 2013). After identifying senses, we can train an embedding for each sense of a word. Li and Jurafsky (2015) demonstrate that this multi-prototype word embedding is useful in several downstream applications including part-of-speech (POS) tagging, relation extraction, and sentence relatedness tasks. Sumanth and Inkpen (2015) also show that word sense disambiguation could be successfully applied to sentiment analysis.

Since word sense induction (WSI) methods are unsupervised, the senses are typically derived from the results of different clustering techniques. Like most of the clustering problems, it is usually challenging to predetermine the number of clusters/senses each word should have. In fact, for many words, the “correct” number of senses is not unique. Setting the number of clusters differently can capture different resolutions of senses. For instance, race in the car context could share the same sense with the race in the game context because they all mean contest, but the race in the car context actually refers to the specific contest of speed. Therefore, they can also be separated into two different senses, depending on the level of granularity we would like to model.

For graph-based clustering methods, it is easy and natural to model the multiple resolutions of senses in a consistent way by hierarchical clustering and defer the difficult problem of choosing the number of clusters to the end. This makes it easier to incorporate other information, such as users’ resolution preference on each hierarchical sense tree. The flexibility is one of the reasons why graph-based methods are widely studied and applied to many downstream applications (Mitra et al., 2014; Mathew et al., 2017; Navigli and Crisafulli, 2010; Di Marco and Navigli, 2013).

Nevertheless, graph-based WSI methods usually require a substantial amount of computational resources. For example, Pelevina et al. (2016) build the graph by finding the nearest neighbors of the target word in the word embedding space (i.e., ego network). Thus, constructing ego networks for all the words takes at least O(|V | 2 ) time, where |V | is the size of the vocabulary, unless some approximation is made (e.g., approximate nearest neighbor search such as k-d tree).1 Next, if our goals include finding less common senses, the method needs to construct a large graph by including more nearest neighbors. For each target word, computing the pairwise distances between nodes in the large graph is also computationally intensive.

To overcome the limitations and make graph-based WSI more practical, we propose a novel WSI algorithm that first groups words into a set of basis indexes (i.e., a set of topics) efficiently and then, constructs the graph where each node corresponds to a basis index (i.e., a topic) instead of a word. The motivation behind the approach is that different senses of a word usually appear in different topics. For example, food and technology will be at least two distinct topics in most of the topic models, so we can find senses by clustering corresponding basis indexes safely when the target word is apple. If one word could have distinct senses in one topic, humans will constantly face difficult word sense disambiguation tasks while reading a document.

Although the main idea is simple, improving the efficiency significantly without sacrificing the quality is difficult. One of the challenges is that similarity between two basis indexes changes given different target words. For example, a country topic should be clustered together with a city topic if the target word is place. However, if the query word is bank, it makes more sense to group the country topic with the money topic into one sense so that the bank mention in Bank of America will belong to the sense. This means we want to focus on the geographical meaning of country when the target word is more about geography, while focus on the economic meaning of country when the target word is more about economics.

In order to tackle the issue, we adopt a recently proposed approach called distributional inclusion vector embedding (DIVE) (Chang et al., 2018). DIVE compresses the sparse bag-of-words while preserving the co-occurrence frequency order, so DIVE is able to model not only the possibility of observing one target word in a topic as typical topic models but also the possibility of observing one topic of a sentence containing a target word mention. This allows us to efficiently identify the topics relevant to each target word, and only focus on an aspect of each of these topics composed of the words relevant to both the topic and the target word.

Experiments show that our method performs similarly compared with Pelevina et al. (2016), a state-of-the-art graph-based WSI method, without the need of expensive nearest neighbor search. Our method is even better for the words without a dominating sense.

Figure 1 below illustrates the flowchart of the proposed method. The blue boxes are processing steps, the orange boxes are input and output data of each step, and the gray areas indicate the sections describing the included steps.


Umass Figure1
Figure 2 below is a visualization of finding the senses of the word core. (a) The DIVE w[bj ] of the word core (only top 15 basis indexes are shown). The words in each row of the table are sorted by its embedding value in the basis index. (b) Weighted DIVE of six words on these 15 basis indexes relevant to core as the features for measuring the similarity between basis indexes. The two red boxes indicate two final clusters we discovered at the end within which the feature words tend to have similar embedding values. (c) The ego network constructed for the word core. Each blue box and the corresponding circle represent a basis index or topic (only 8 out of 15 basis indexes are plotted and their index numbers bj are shown in the blue boxes), which is a node in the network. Two basis indexes are more similar if more relevant words (i.e., close to core) occur frequently in both corresponding topics. For example, the topic 7 and 8 are more similar because of the frequent appearance of the relevant words such as computer in both topics. The larger similarity is represented by a thicker red line. The ego network is a complete graph but only a subset of edges are plotted in the figure. (d) The final clustering results when the number of clusters is set to be 4.


Umass Figure2 Embedding

For the complete paper with further explanation, please visit: arxiv.org/pdf/1804.03257.pdf