Math of Ideas: A Word is Worth a Thousand Vectors


Word vectors give us a simple and flexible platform for understanding text, there are a few diverse examples that should help build your confidence in developing and deploying NLP systems and what problems they can solve.



In this case, we've looked for vectors that are nearby to the word vacation by measuring the similarity (usually cosine similarity) to the root word and sorting by that.

words close to vacation

Above is a screen shot of visualization of the words nearest to "vacation". (here is interactive visualization.)
The more similar a word to it's genre, the larger the radius of the marker. Hover over the bubbles to reveal the words they represent7.

And these words aren't just nearby; they're also in several clusters. So we can determine that the words most similar to vacation come in a variety of flavors: one cluster might be wedding-related, but another might relate to destinations like Belize.

Of course our human stylists understand when a client says "I'm going to Belize in March" that she has an upcoming vacation. But the computer can potentially tag this as a 'vacation' fix because the word vector for Belize is similar to that for vacation. We can then make sure that the Fixes our customers get are vacation-appropriate!

Ideas are words that can be added & subtracted

We have the ability to search semantically by adding and subtracting word vectors8. This empowers us to creatively add and subtract concepts and ideas. Let's start with a style we know a customer liked, item_3469:

Vectors

Our customer recently became pregnant, so let's try and find something like item_3469 but along the pregnant dimension:

model.most_similar('ITEM_3469', 'pregnant')
matches = list(filter(lambda x: 'ITEM_' in x[0], matches))

# ['ITEM_13792',
# 'ITEM_11275',
# 'ITEM_11868']


Of course the item IDs aren't immediately informative, but the pictures let us know that we've done well:

vectors

The first two are items have prominent black & white stripes like item_3469 but have the added property that they're great maternity-wear. The last item changes the pattern away from stripes but is still a loose blouse that's great for an expectant mother. Here we've simply added the word vector for pregnant to the word vector for item_3469, and looked up the word vectors most similar to that result9.

Our stylists tailor each Fix to their clients, and this prototype system may free them to mix and match artistic concepts about style, size and fit to creatively search for new items.

Summarizing sentences & documents

At Stitch Fix, we work hard to craft a uniquely-styled Fix for each of our customers. At every stage of a Fix we collect feedback: what would you like in your next Fix? What did you think of the items we sent you? What worked? What didn't?

The spectrum of responses is myriad, but vectorizing those sentences10 allows us to begin systematically categorizing those documents:

from gensim.models import Doc2Vec
fn = "word_vectors_blog_post_v01_notes"
model = Doc2Vec.load(fn)
model.most_similar('pregnant')
matches = list(filter(lambda x: 'SENT_' in x[0], matches))

# ['...I am currently 23 weeks pregnant...',
#  '...I'm now 10 weeks pregnant...',
#  '...not showing too much yet...',
#  '...15 weeks now. Baby bump...',
#  '...6 weeks post partum!...',
#  '...12 weeks postpartum and am nursing...',
#  '...I have my baby shower that...',
#  '...am still breastfeeding...',
#  '...I would love an outfit for a baby shower...']


In this example we calculate which sentences are closest to the word pregnant. This list also skips over many literal matches of pregnant in order to demonstrate the more advanced capabilities. We've also censored sentences to keep out personally identifying text. Also note that the last sentence is a false positive: while similar to the word pregnant, she's unlikely to be interested in maternity clothing.

This allows us to understand not just what words mean, but condense our client comments, notes, and requests in a quantifiable way. We can for example categorize our sentences by first calculating the similarity between a sentence and a word:

def get_vector(word): 
    return model.syn0norm[model.vocab[word].index]
def calculate_similarity(sentence, word):
   vec_a = get_vector(sentence)
   vec_b = get_vector(word)
   sim = np.dot(vec_a, vec_b)
   return sim
calculate_similarity('SENT_47973, 'casual')

# 0.308


We calculated the overlap between a sentence with label SENT_47973 and the word casual. The sentence is previously trained from this customer text: 'I need some weekend wear. Comfy but stylish.' The similarity to casual is about 0.308, which is pretty high.

Having built a function that computes the similarity between a sentence and a word, we can build a table of customer comments and their similarities to a given topic:

raw text snippets 'broken' 'casual' 'pregnant'
'... unfortunately the lining ripped after wearing if twice ...' 0.281 0.082 0.062
'... I need some weekend wear. Comfy but stylish.' 0.096 0.308 0.191
'... 12 weeks postpartum and am nursing ...' 0.158 0.110 0.378


A table like this around helps us quickly answer how many people are looking for comfortable clothes or finding defects in the clothing we send them.