Math of Ideas: A Word is Worth a Thousand Vectors

Word vectors give us a simple and flexible platform for understanding text, there are a few diverse examples that should help build your confidence in developing and deploying NLP systems and what problems they can solve.

By Chris Moody, Stitch Fix.

Standard natural language processing (NLP) is a messy and difficult affair. It requires teaching a computer about English-specific word ambiguities as well as the hierarchical, sparse nature of words in sentences. At Stitch Fix, word vectors help computers learn from the raw text in customer notes. Our systems, composed of machines and human experts, need to recommend the maternity line when she says she's in her 'third trimester', identify a medical professional when she writes that she 'used to wear scrubs to work', and distill 'taking a trip' into a Fix for vacation clothing.

While we're not totally "there" yet with the holy grail to NLP, word vectors (also referred to as distributed representations) are an amazing tool that sweeps away some of the issues of dealing with human language. The machines work in tandem with the stylists as a support mechanism to help identify and summarize textual information from the customers. The human experts will make the final call on what actions will be taken. The goal of this post is to be a motivating introduction to word vectors and demonstrate their real-world utility.

The following example set the natural language community afire 1 back in 2013:

king - man + women = queen

In this example, a human posed a question to a computer: what is king - man + woman? This is similar to an SAT-style analogy (man is to woman as king is to what?). And a computer solved this equation and answered: queen. Under the hood, the machine gets that the biggest difference between the words for man and woman is gender. Add that gender difference to king, and you get queen.

This is astonishing because we've never explicitly taught the machine anything about gender!

In fact, we've never handed the computer anything like a dictionary, a thesaurus, or a network of word relationships. We haven't even tried to break apart a sentence into its constituent parts of speech 2. We've simply fed a mountain of text into an algorithm called word2vec and expected it to learn from context. Word by word, it tries to predict the other surrounding words in a sentence. Or rather, it internally represents words as vectors, and given a word vector, it tries to predict the other word vectors in the nearby text3.

The algorithm eventually sees so many examples that it can infer the gender of a single word, that both the The Times and The Sun are newspapers, that The Matrix is a sci-fi movie, and that the style of an article of clothing might be boho or edgy. That word vectors represent much of the information available in a dictionary definition is a convenient and almost miraculous side effect of trying to predict the context of a word.

Internally high dimensional vectors stand in for the words, and some of those dimensions are encoding gender properties. Each axis of a vector encodes a property, and the magnitude along that axis represents the relevance of that property to the word4. If the gender axis is more positive, then it's more feminine; more negative, more masculine.


Applied appropriately,word vectors are dramatically more meaningful and more flexible than current techniques5 and let computers peer into text in a fundamentally new way. It's surprisingly easy to get started using libraries like gensim (in Python) or Spark (in Scala & Python) -- all you need to know is how to add, subtract, and multiply vectors!

Let's review the new abilities that word vectors grant us.

Similar words are nearby vectors

Similar words are nearby vectors in a vector space. This is a powerful convention since it lets us wipe away a lot of the noise and nuance in vocabulary. For example, let's use gensim to find a list of words similar to vacation using the freebase skipgram data6:

from gensim.models import Word2Vec
fn = "freebase-vectors-skipgram1000-en.bin.gz"
model = Word2Vec.load_word2vec_format(fn)

# [('trip', 0.7234684228897095),
#  ('honeymoon', 0.6447688341140747),
#  ('beach', 0.6249285936355591),
#  ('vacations', 0.5868890285491943),
#  ('wedding', 0.5541957020759583),
#  ('resort', 0.5231006145477295),
#  ('traveling', 0.5194448232650757),
#  ('vacation.', 0.5068142414093018),
#  ('vacationing', 0.5013546943664551)]

We've calculated the vectors most similar to the vector for vacation, and then looked up what words those vectors represent. As we read the list, we note that these words aren't just similar in vector space, but that they make sense intuitively too.