Math of Ideas: A Word is Worth a Thousand Vectors

Word vectors give us a simple and flexible platform for understanding text, there are a few diverse examples that should help build your confidence in developing and deploying NLP systems and what problems they can solve.

What we didn't mention

While word vectorization is an elegant way to solve many practical text processing problems, it does have a few shortcomings and considerations:

  1. Word vectorization requires a lot of text. You can download pretrained word vectors yourself, but if you have a highly specialized vocabulary then you'll need to train your own word vectors and have a lot of example text. Typically this means hundreds of millions of words, which is the equivalent of 1,000 books, 500,000 comments, or 4,000,000 tweets.

  2. Cleaning the text. You'll need to clean the words of punctuation and normalize Unicode11 characters, which can take significant manual effort. In this case, there are a few tools that can help like FTFY, SpaCy, NLTK, and the Stanford Core NLP. SpaCy even comes with word vector support built-in.

  3. Memory & performance. The training of vectors requires a high-memory and high-performance multicore machine. Training can take several hours to several days but shouldn't need frequent retraining. If you use pretrained vectors, then this isn't an issue.

  4. Databases. Modern SQL systems aren't well-suited to performing the vector addition, subtraction and multiplication searching in vector space requires. There are a few libraries that will help you quickly find the most similar items12: annoy, ball trees, locality-sensitive hashing (LSH) or FLANN.

  5. False-positives & exactness. Despite the impressive results that come with word vectorization, no NLP technique is perfect. Take care that your system is robust to results that a computer deems relevant but an expert human wouldn't.


The goal of this post was to convince you that word vectors give us a simple and flexible platform for understanding text. We've covered a few diverse examples that should help build your confidence in developing and deploying NLP systems and what problems they can solve. While most coverage of word vectors has been from a scientific angle, or demonstrating toy examples, we at Stitch Fix think this technology is ripe for industrial application.

In fact, Stitch Fix is the perfect testbed for these kinds of new technologies: with expert stylists in the loop, we can move rapidly on new and prototypical algorithms without worrying too much about edge and corner cases. The creative world of fashion is one of the few domains left that computers don't understand. If you're interested in helping us break down that wall, apply!

Further reading

There are a few miscellaneous topics that we didn't have room to cover or were too peripheral:

  1. There's an excellent nuts and bolts explanation and derivation of the word2vec algorithm. There's a similarly useful iPython Notebook version too.

  2. Translating word-by-word English into Spanish is equivalent to matrix rotations. This means that all of the basic linear algebra operators (addition, subtraction, dot products, and matrix rotations) have meaningful functions on human language.

  3. Word vectors can also be used to find the odd word out.

  4. Interestingly, the same skip-gram algorithm can be applied to a social graph instead of sentence structure. The authors equate a sequence of social network graph visits (a random walk) to a sequence of words (a sentence in word2vec) to generate a dense summary vector.

  5. A brief but very visual overview of distributed representations is available here.

  6. Intriguingly, the word2vec algorithm can be reinterpreted as a matrix factorization method using point-wise mutual information. This theoretical breakthrough cleanly connects older and faster but more memory-intensive techniques with word2vec's streaming algorithm approach.

1 See also the original papers, and the subsequently bombastic media frenzy, the race to understand why word2vec works so well, some academic drama on GloVe vs word2vec, and a nice introduction to the algorithms behind word2vec from my friend Radim Řehůřek.

2 Although see Omer Levy and Yoav Goldberg's post for an interesting approach that has the word2vec context defined by parsing the sentence structure. Doing this introduces a more functional similarity between words (see this demo). For example, Hogwarts in word2vec is similar to dementors and dumbledore, as they're all from Harry Potter, while parsing context gives sunnydale and colinwood as they're similarly prestigious schools.

3 This is describing the ‘skip-gram' mode of word2vec where the target word is asked to predict the surrounding context. Interestingly, we can also get similar results by doing the reverse: using the surrounding text to predict a word in the middle! This model, called continuous bag-of-words (CBOW), loses word order and so we lose a bit of grammatical information since that's very sensitive to the position of a word in a sentence. This means CBOW-trained word vectors tend to do worse in a syntactic sense: the resulting vectors more poorly encode whether a word is an adjective or a verb, or a noun.

4 More generally, a linear combination of axes encodes the properties. We can attempt to rotate into the correct basis by using PCA (as long as we only include a few nearby words) or visualize that space using t-SNE (although we lose the concept of a single axis encoding structure).

5 Compare word vectors to sentiment analysis, which effectively distills everything into one dimension of ‘happy or sad', or document labeling efforts like Latent Dirichlet Allocations that sort words into a few types. In either case, we can only ask these simpler models to categorize new documents into a few predetermined groups. With word vectors we can encapsulate far more diversity without having to build a labeled training text (and thus with less effort.)

6 You can download this file freely from here.

7 This is using an advanced visualization technique called t-SNE. This allows us to project down to 2D while still trying to maintain the local structure. This helps pop up the several word clusters that are near to the word vacation.

8 Check out this live demo with just wikipedia words here.

9 We've used cosine similarity to find the nearest items, but, we could've chosen the 3COSMUL method. This combines vectors multiplicatively instead of additively and seems to get better results (pdf warning!). This stays truer to cosine distance and in general prevents one word from dominating in any one dimension.

10 You can easily make a vector for a whole sentence by following the Doc2Vec tutorial (also called paragraph vector) in gensim, or by clustering words using the Chinese Restaurant Process.

11 If you're using Python 2, this is a great reason to reduce Unicode headaches and switch to Python 3.

12 See a comparison of these techniques here. My recommendation is using LSH if you need a pure Python solution, and annoy if you need a solution that is memory light.

Reposted with permission from A Word is Worth a Thousand Vectors.

Bio: Chris E. Moody @chrisemoody, Caltech / Astrostats / high-perf supercomputing and now data labs at Stitch Fix. Currently enjoying coding up word2vec, Gaussian Processes, and t-SNE.