Math of Ideas: A Word is Worth a Thousand Vectors
Word vectors give us a simple and flexible platform for understanding text, there are a few diverse examples that should help build your confidence in developing and deploying NLP systems and what problems they can solve.
While word vectorization is an elegant way to solve many practical text processing problems, it does have a few shortcomings and considerations:
Word vectorization requires a lot of text. You can download pretrained word vectors yourself, but if you have a highly specialized vocabulary then you'll need to train your own word vectors and have a lot of example text. Typically this means hundreds of millions of words, which is the equivalent of 1,000 books, 500,000 comments, or 4,000,000 tweets.
Cleaning the text. You'll need to clean the words of punctuation and normalize Unicode11 characters, which can take significant manual effort. In this case, there are a few tools that can help like FTFY, SpaCy, NLTK, and the Stanford Core NLP. SpaCy even comes with word vector support built-in.
Memory & performance. The training of vectors requires a high-memory and high-performance multicore machine. Training can take several hours to several days but shouldn't need frequent retraining. If you use pretrained vectors, then this isn't an issue.
Databases. Modern SQL systems aren't well-suited to performing the vector addition, subtraction and multiplication searching in vector space requires. There are a few libraries that will help you quickly find the most similar items12: annoy, ball trees, locality-sensitive hashing (LSH) or FLANN.
False-positives & exactness. Despite the impressive results that come with word vectorization, no NLP technique is perfect. Take care that your system is robust to results that a computer deems relevant but an expert human wouldn't.
The goal of this post was to convince you that word vectors give us a simple and flexible platform for understanding text. We've covered a few diverse examples that should help build your confidence in developing and deploying NLP systems and what problems they can solve. While most coverage of word vectors has been from a scientific angle, or demonstrating toy examples, we at Stitch Fix think this technology is ripe for industrial application.
In fact, Stitch Fix is the perfect testbed for these kinds of new technologies: with expert stylists in the loop, we can move rapidly on new and prototypical algorithms without worrying too much about edge and corner cases. The creative world of fashion is one of the few domains left that computers don't understand. If you're interested in helping us break down that wall, apply!Further reading
There are a few miscellaneous topics that we didn't have room to cover or were too peripheral:
Translating word-by-word English into Spanish is equivalent to matrix rotations. This means that all of the basic linear algebra operators (addition, subtraction, dot products, and matrix rotations) have meaningful functions on human language.
Word vectors can also be used to find the odd word out.
Interestingly, the same skip-gram algorithm can be applied to a social graph instead of sentence structure. The authors equate a sequence of social network graph visits (a random walk) to a sequence of words (a sentence in word2vec) to generate a dense summary vector.
A brief but very visual overview of distributed representations is available here.
Intriguingly, the word2vec algorithm can be reinterpreted as a matrix factorization method using point-wise mutual information. This theoretical breakthrough cleanly connects older and faster but more memory-intensive techniques with word2vec's streaming algorithm approach.
1 See also the original papers, and the subsequently bombastic media frenzy, the race to understand why word2vec works so well, some academic drama on GloVe vs word2vec, and a nice introduction to the algorithms behind word2vec from my friend Radim Řehůřek. ←
2 Although see Omer Levy and Yoav Goldberg's post for an interesting approach that has the word2vec context defined by parsing the sentence structure. Doing this introduces a more functional similarity between words (see this demo). For example, Hogwarts in word2vec is similar to dementors and dumbledore, as they're all from Harry Potter, while parsing context gives sunnydale and colinwood as they're similarly prestigious schools. ←
3 This is describing the ‘skip-gram' mode of word2vec where the target word is asked to predict the surrounding context. Interestingly, we can also get similar results by doing the reverse: using the surrounding text to predict a word in the middle! This model, called continuous bag-of-words (CBOW), loses word order and so we lose a bit of grammatical information since that's very sensitive to the position of a word in a sentence. This means CBOW-trained word vectors tend to do worse in a syntactic sense: the resulting vectors more poorly encode whether a word is an adjective or a verb, or a noun. ←
4 More generally, a linear combination of axes encodes the properties. We can attempt to rotate into the correct basis by using PCA (as long as we only include a few nearby words) or visualize that space using t-SNE (although we lose the concept of a single axis encoding structure). ←
5 Compare word vectors to sentiment analysis, which effectively distills everything into one dimension of ‘happy or sad', or document labeling efforts like Latent Dirichlet Allocations that sort words into a few types. In either case, we can only ask these simpler models to categorize new documents into a few predetermined groups. With word vectors we can encapsulate far more diversity without having to build a labeled training text (and thus with less effort.) ←
This is using an advanced visualization technique called t-SNE. This allows us to project down to 2D while still trying to maintain the local structure. This helps pop up the several word clusters that are near to the word
9 We've used cosine similarity to find the nearest items, but, we could've chosen the 3COSMUL method. This combines vectors multiplicatively instead of additively and seems to get better results (pdf warning!). This stays truer to cosine distance and in general prevents one word from dominating in any one dimension. ←
11 If you're using Python 2, this is a great reason to reduce Unicode headaches and switch to Python 3. ←
Reposted with permission from A Word is Worth a Thousand Vectors.
Bio: Chris E. Moody @chrisemoody, Caltech / Astrostats / high-perf supercomputing and now data labs at Stitch Fix. Currently enjoying coding up word2vec, Gaussian Processes, and t-SNE.