Vita Markman is currently employed as a Staff Software Engineer at LinkedIn, where she works on various natural language processing applications such as performing sentiment analysis of customer feedback and extracting relevant information from job postings. Before joining LinkedIn, she was a Staff Research Engineer at Samsung Research America. Prior to Samsung Research, she was employed as a Computational Linguist at Disney Interactive.
In addition, she conducts independent research on mining the language of social media. Here primary interests are focused on extracting topics and sentiment from micro-text – the short, snippet-like pieces of text found on Twitter, Facebook, and in various other social media sources.
Her education background is in theoretical and computational linguistics (Rutgers, 2005). In addition to computational linguistics, she has publication record in theoretical syntax and morphology, which was her primary area of research between the years of 2002 – 2008.
I had the pleasure of attending her talk "Integrating Linguistic Features into Sentiment Models: Sentiment Mining in Social media within industry Setting" at Sentiment Analysis Innovation Summit 2014 in San Francisco, CA.
Here is second and last part of my interview with her:
Anmol Rajpurohit: Q5. What are the most unexpected insights on domain and customer expectation that you discovered through your research?
Vita Markman: One striking example of customer expectation is related to delivery time frames. Namely, temporal expressions such as “within” + [some time frame] vs. “in over” + [some time frame] that customers use in their reviews strongly indicate positive vs. negative sentiment respectively. Even when “within” is used with a really long time period such as “within a month” it still indicates positive sentiment, while the use of “in over” with a much shorter time period such as “in over two days” indicates negative sentiment. Customers used these phrases in such positive contexts as “the book arrived within a month to Australia!” and in such negative contexts as “I paid for expedited shipping and the book arrived in over two days!” What is interesting is that it is not the absolute time frame itself “two days” vs. “a month” that matters for positivity vs. negativity, but the preposition with which it is used – “within” vs. “in over”. Importantly, the use of “within” vs. “in over” is highly domain specific: it relates to shipment time intervals and is unlikely to transfer over to another domain such as movies. For example, it is completely plausible to have a negative sentiment expressed in the manner such as “this movie bored me within minutes”.
The following additional insights related specifically to the domain of transactions are interesting: customers appear to really appreciate and explicitly mention in their reviews the use of bubble wrap and other protective wrapping in packages; customers expect delivery confirmation by email and complain overwhelmingly when it is absent, even when delivery happens on time! Customers also explicitly mention when a product is delivered in time for a major holiday or birthday (e.g. “arrived in time for xmas” or “arrived just before my wife’s birthday!”). Domain-specific complaints related to books include things such as water damage, mildew or mold on pages, unclean or torn cover.
The clearly negative mention of mold or mildew in the context of books is domain specific, as one can readily imagine a positive mention of “mold” or “mildew” in the context of cleaning supplies such as “Mr. Clean fights mildew” or “Removes mold in seconds!” Finally, and perhaps most interestingly, people make a very clear distinction between book editions that are “old” or having “yellowed pages” vs. those that are “vintage”. This is interesting because “vintage” clearly implies “not new” and is often in bad condition. Yet, a beat-up condition is something customers would forgive in case of vintage editions. This is surprising and also poses a challenge for sentiment analysis as the phrases “vintage edition” and “old edition” are relatively close in lexical meaning.
AR: Q6. What trends do you perceive in the current research on sentiment analysis? What do you consider as some of the most innovative applications of insights from sentiment analysis?
One of the dominant trends in current sentiment research is aspect-based sentiment analysis that takes into account specific components or aspects of the product in opinion attribution. The more diverse the components of the product are the more crucial aspect-based sentiment modeling becomes in order to glean actionable insights from the model.
For example, in restaurant reviews customers may give positive reviews to food, while giving negative reviews to ambiance or service.
Similarly hotels have very varied aspects such as room quality, service, location, amenities, etc. that may elicit polar opposite opinion from customers. On that note, one of the more innovative applications of insights from aspect-based sentiment analysis is “pivoting” hotel reviews based on the above-mentioned aspects of hotels as opposed to showing positive/negative reviews of the hotel as a whole. Aspect-based sentiment attribution gives potential customers much more insight and allows them to choose the hotel based on the aspects that are key for them.
AR: Q7. What soft skills do you think are the most important for practitioners in the field of Sentiment Mining?
VM: I will name three skills that most immediately come to mind:
Being aware that language is greater than just a bag of words; it contains rich syntactic structure and oftentimes sentiment crucially depends on this structure. For example, sentiment bearing phrases such as “no problems” and “no delivery” and “no problems with delivery” all mean very different things due to the scope of negation, i.e. what negation is applied to, and not due to the mere presence of negation in the phrase. The shorter the review is, the more important structural features of this kind become. A bag of word representation of language is a great over-simplification and only takes one so far, especially in models that require nuanced understanding of sentence meaning such as in sentiment classification.
Being aware of the critical need to design proper code-books for annotation and of the need to test inter-annotator agreement when preparing training sets for supervised learning. The learning algorithm will only be as good as the data that it trains on. The more subtle the annotations on data are the harder it is to design clear annotation instructions. For example, labeling something as positive or negative is already hard, but requiring a three-way or even five-way sentiment assignment is exceedingly subjective and requires the clearest possible instructions to distinguish between “fine” vs. “good” and “good” vs. “excellent”. Ideally, researchers/practitioners should test their own instructions internally prior to crowdsourcing in order to ensure better quality of the resulting annotated data.
Being aware that domain specificity exists. While we want our models to be as general as possible, they simply won’t be in all cases. What is “good” in one domain may be neutral or even “bad” in another. One must we aware of which features are susceptible to this effect and which are more robust. The rule of thumb is, unfortunately, that there are only very few very obvious robust sentiment bearing adjectives and verbs such as “hate” “terrible” “amazing” etc. that unambiguously cut across virtually all domains.
AR: Q8. What was the last book that you read and liked? What do you like to do when you are not working?
VM: The last book I read that I really liked is “Never Let Me Go” by Kazuo Ishiguro, a disquieting futuristic novel, akin to Huxley’s Brave New World, that traces the lives of unusual children in an elite and mysterious school (no more spoilers!). The novel is a brilliant allegory to human life and a beautifully written piece.
When I am not working I dance argentine tango. It is my greatest passion and hobby other than mining text. I have been dancing tango for now over 10 years.