Math of Ideas: A Word is Worth a Thousand Vectors
Word vectors give us a simple and flexible platform for understanding text, there are a few diverse examples that should help build your confidence in developing and deploying NLP systems and what problems they can solve.
In this case, we've looked for vectors that are nearby to the word
vacation by measuring the similarity (usually cosine similarity) to the root word and sorting by that.
Above is a screen shot of visualization of the words nearest to "vacation".
(here is interactive visualization.)
The more similar a word to it's genre, the larger the radius of the marker. Hover over the bubbles to reveal the words they represent7.
And these words aren't just nearby; they're also in several clusters. So we can determine that the words most similar to
vacation come in a variety of flavors: one cluster might be
wedding-related, but another might relate to destinations like
Of course our human stylists understand when a client says "I'm going to Belize in March" that she has an upcoming vacation. But the computer can potentially tag this as a 'vacation' fix because the word vector for
Belize is similar to that for
vacation. We can then make sure that the Fixes our customers get are vacation-appropriate!
We have the ability to search semantically by adding and subtracting word vectors8. This empowers us to creatively add and subtract concepts and ideas. Let's start with a style we know a customer liked,
Our customer recently became pregnant, so let's try and find something like
item_3469 but along the
model.most_similar('ITEM_3469', 'pregnant') matches = list(filter(lambda x: 'ITEM_' in x, matches)) # ['ITEM_13792', # 'ITEM_11275', # 'ITEM_11868']
Of course the item IDs aren't immediately informative, but the pictures let us know that we've done well:
The first two are items have prominent black & white stripes like
item_3469 but have the added property that they're great maternity-wear. The last item changes the pattern away from stripes but is still a loose blouse that's great for an expectant mother. Here we've simply added the word vector for
pregnant to the word vector for
item_3469, and looked up the word vectors most similar to that result9.
Our stylists tailor each Fix to their clients, and this prototype system may free them to mix and match artistic concepts about style, size and fit to creatively search for new items.Summarizing sentences & documents
At Stitch Fix, we work hard to craft a uniquely-styled Fix for each of our customers. At every stage of a Fix we collect feedback: what would you like in your next Fix? What did you think of the items we sent you? What worked? What didn't?
The spectrum of responses is myriad, but vectorizing those sentences10 allows us to begin systematically categorizing those documents:
from gensim.models import Doc2Vec fn = "word_vectors_blog_post_v01_notes" model = Doc2Vec.load(fn) model.most_similar('pregnant') matches = list(filter(lambda x: 'SENT_' in x, matches)) # ['...I am currently 23 weeks pregnant...', # '...I'm now 10 weeks pregnant...', # '...not showing too much yet...', # '...15 weeks now. Baby bump...', # '...6 weeks post partum!...', # '...12 weeks postpartum and am nursing...', # '...I have my baby shower that...', # '...am still breastfeeding...', # '...I would love an outfit for a baby shower...']
In this example we calculate which sentences are closest to the word
pregnant. This list also skips over many literal matches of
pregnant in order to demonstrate the more advanced capabilities. We've also censored sentences to keep out personally identifying text. Also note that the last sentence is a false positive: while similar to the word pregnant, she's unlikely to be interested in maternity clothing.
This allows us to understand not just what words mean, but condense our client comments, notes, and requests in a quantifiable way. We can for example categorize our sentences by first calculating the similarity between a sentence and a word:
def get_vector(word): return model.syn0norm[model.vocab[word].index] def calculate_similarity(sentence, word): vec_a = get_vector(sentence) vec_b = get_vector(word) sim = np.dot(vec_a, vec_b) return sim calculate_similarity('SENT_47973, 'casual') # 0.308
We calculated the overlap between a sentence with label
SENT_47973 and the word
casual. The sentence is previously trained from this customer text: 'I need some weekend wear. Comfy but stylish.' The similarity to
casual is about 0.308, which is pretty high.
Having built a function that computes the similarity between a sentence and a word, we can build a table of customer comments and their similarities to a given topic:
|raw text snippets||'broken'||'casual'||'pregnant'|
|'... unfortunately the lining ripped after wearing if twice ...'||0.281||0.082||0.062|
|'... I need some weekend wear. Comfy but stylish.'||0.096||0.308||0.191|
|'... 12 weeks postpartum and am nursing ...'||0.158||0.110||0.378|
A table like this around helps us quickly answer how many people are looking for comfortable clothes or finding defects in the clothing we send them.