Algorithmia Tested: Human vs Automated Tag Generation
Algorithmia, the marketplace for algorithms, can be a platform for hosting APIs to do a plethora of text analytics and information retrieval tasks. Automatic post tagging is done in this case study to demonstrate the effectiveness and ease-of-use of the platform.
Conclusions – Tagging
The results are interesting. For the tags, it seems much of the categorization done when writing articles is automatically extracted by Algorithmia. For example, most articles in the news category on KDnuggets are given the news tag. Also, the automatically generated tags are sometimes very specific to the article. For example, the tag “treenet” was generated for these articles in May and January 2014, five months apart. I’d expect a human editor wouldn’t make these connections, and this shows one of the big benefits you can reap from machine-generated tags.
Table 1: Top Tags by Frequency
|Rank||Hand Generated||Machine Generated|
|1||Big Data (2.0%)||Data (5.2%)|
|2||Machine Learning (1.8%)||Analytics (2.4%)|
|3||Hadoop (1.4%)||Data Analytics (1.9%)|
|4||Data Science (1.4%)||Big (1.7%)|
|5||Deep Learning (1.2%)||News (1.6%)|
|6||Interview (1.0%)||Big Data (1.5%)|
|7||R (1.0%)||Data News (1.3%)|
|8||Data Mining (0.9%)||Mining (1.2%)|
|9||Data Science (0.9%)||Science (1.2%)|
|10||Analytics (0.8%)||Data Mining (1.2%)|
The table above paints a picture that the machine generated and hand tags do diverge to some degree. Many of the top tags are similar like data science, data mining and big data.
But the itemset-generated tags tend to be distributed more heavily towards the more common tags."Data" alone accounts for 5.2% of those tags. This makes the some of the machine-generated tags somewhat less useful. But this also goes along in the same vein as having common tags used as high-level classifications of posts. Note that "KDnuggets" was an extremely common tag in the machine tags, but was removed because of the redundancy of tagging posts from KDnuggets with the name of the site.
One weakness of the automatic tag generation is that irrelevant tags that aren’t in most lists of stop words aren’t caught. The most obvious examples are the link shorteners
shrd.bywhich are common in top twitter posts. On top of this, every permutation of big, data, science, mining, analytics, and social, can show up, which is again partially attributable to data cleaning. To make the automatic tags more palatable, a more site-specific list of stop words would be required.
With this in mind, I think the optimal workflow would involve some degree of human tagging augmented by machine tagging after the fact. Perhaps rerunning the tagging algorithm offline every month on the whole corpus to draw connections between articles that may not come to mind immediately to a human tagger when publishing.
Conclusions – Algorithmia
The online editors work how you’d expect for the language, with the Python editor being quick and offering good autocomplete. Accessing the API created around your algorithm is really easy, and it’s nice that a lot of the sample generation is done for you. There is one part of this process that wasn’t entirely clear – turning and algorithm from private to public – but this was easy to figure out after trying to publish a new minor revision and seeing the option.
It will be interesting to watch this platform develop. I could imagine research papers publishing an online version of their algorithm as a sort of demo. This would be great to see and would impress me as a reader. I could also see companies using Algorithmia to offload some work to the cloud. For example, if an algorithm needed to be run intermittently (so as to not necessitate its own hardware), but might need to be called on-demand, Algorithmia would be a good option. Working with Algorithmia has been a pleasure and I’d definitely recommend it to friends.