Algorithmia Tested: Human vs Automated Tag Generation
Algorithmia, the marketplace for algorithms, can be a platform for hosting APIs to do a plethora of text analytics and information retrieval tasks. Automatic post tagging is done in this case study to demonstrate the effectiveness and ease-of-use of the platform.
Algorithmia is a new platform that acts as a sort of “marketplace for algorithms”. We’ve looked at Algorithmia from a high level previously, so today we’re going to get hands-on and see how Algorithmia actually works from an applications perspective.
Our goal is to compare KDnuggets tags on articles from 2014 (a subset of all KDnuggets tags) to the tags generated by an automatic tagging algorithm hosted on Algorithmia. We’ll first use Algorithmia as a client, sending scraped articles to the Algorithmia API and receiving generated tags. Then, we will use Algorithmia as a developer, writing an algorithm for frequent itemset generation to get multi-word tags. Then we will use this algorithm and see how the platform works end-to-end. Finally, we’ll look at the resulting tags and see how machine generated tags compare to our handmade tags.
Fig 1: Word Cloud from KDnuggets 2014 Hand-Generated Tags
Scraping KDnuggets and Accessing Algorithmia
Before we can generate tags, we need documents to tag. Using
beautifulsoup, this is a fairly straightforward task in Python. The gist for the code used to do this can be found here.
This code first scrapes the KDnuggets articles, and then passes the article to the AutoTagUrl algorithm on Algorithmia using
urllib2. The same tools you use to write API-aware applications in Python are the same things you’d use to target Algorithmia, making this whole process very easy if you have any experience writing these kinds of applications. Once these tags are generated, a TSV file is outputted with the links, hand-made tags, and machine-generated tags.
Writing on Algorithmia
The resulting tags looked good, but there was one glaring problem – they consisted only of single words. The
AutoTagUrlalgorithm depends on Mallet’s implementation of LDA, meaning that the output will take the form of the most representative word from each topic. What’s nice about Algorithmia is the emphasis on open source, which let me see that this is how the algorithm was performed.
Fig 2: Tag Cloud from Machine-Generated TagsNever fear, by using a frequent itemset algorithm, we can group together co-occurring terms to recover multi-word tags like “Data Mining”. To do this task, I simply implemented a naïve version of the Apriori algorithm in Python on Algorithmia. I used their online editor, which proved to be much more refined than most others that I’ve used previously. The way the algorithms act like
pipprojects makes handling dependencies easy as well. You can see my resulting algorithm implemented on Algorithmia here.
Fig 3: Tag Cloud from Machine-Generated Tags with Frequent ItemsetsAdding Itemsets
With our freshly minted algorithm written and published on Algorithmia, we now must access the algorithm to add the newer tags. This works in the same way as accessing the API provided by
AutoTagUrl. The script for taking the output from our scraping, passing it into the Apriori algorithm, and getting the new itemset-based tags, can be found here.