Exploring Twitter Hashtags

Using a dataset of 29 million messages, Jan Poeschko explores relations among the hashtags with respect to co-occurrences. He classifies hashtags into five intuitive classes, using a machine-learning approach.

By Jan Pöschko, October 28, 2010

Twitter messages often contain so-called hashtags to denote keywords related to them. Using a dataset of 29 million messages, I explore relations among these hashtags with respect to co-occurrences. Furthermore, I present an attempt to classify hashtags into five intuitive classes, using a machine-learning approach. The overall outcome is an interactive Web application to explore Twitter hashtags.

Naturally, the language used in tweets is characterized by many abbreviations (e.g. 4 U) and emoticons (e.g. :)), like in SMS. However, there are also very Twitter-specific forms of annotations, most notably so-called @-replies and hashtags, like in the following tweet:

@merazindagi Thanks! Will make more 4 U. Live performances in #boulder area will be on saxy.us :) #jazz #rock #funk #dance #livemusic

Hashtags are simply words that are preceded by a hash (#). They can be used both inside the text and at its end to annotate keywords for a tweet. Twitter displays each hashtag as a link to a page listing other tweets containing the hashtag; that is where the "tag" in "hashtags" comes from, as they serve a similar purpose as tags on websites like Flickr and Delicious.

The problem with many hashtags is that, just from their name, it is often impossible to tell what they are about (e.g. #tcot, #p2, #sgp). This problem might (at least partially) be solved by the two approaches described in this work: a dictionary built upon co-occurrences (section 3) and a machine-learnt classification into basic classes (section 4), plugged into an interactive Web application (section

Full paper at

twex.poeschko.com/media/files/ExploringTwitterHashtags.pdf