REST APIs and crawling offer two different ways to gather big data presentations from SlideShare, but they provide different results and lead to a very different view of the data. We examine why and find a useful data science lesson.
By Grant Marshall, Jan 2015.
In a previous article about Most Popular Slideshare Presentations on Big Data, I analyzed the top articles about Big Data on the site. This was achieved by using Python to gather data directly from
SlideShare’s API. Since then, however, it’s been pointed out that there are some slideshares about big data that are clearly very important that were excluded.
For example, this slideshare, What is Big Data, by Bernard Marr was not included in the previous post despite having ten times the number of views of any of the slideshares included before! So clearly, using the API to retrieve the top slideshares was missing something. But just how much do these results differ? Here are the top 15 big data presentations on slideshare, using a the Google search “big data site:slideshare.net”:
Immediately, compared to the results returned by the SlideShare API, we see these presentations have far more views. In fact, the presentation returned from the SlideShare API with the most views/day has fewer views/day than the least popular presentation returned from the Google search! There is definitely something strange going on here.
Figure 1: Big Data Presentation Views/Week vs. Publication Date
To understand what’s causing this difference, we must first dive into the SlideShare API. The results from the previous post were reached using these scripts (information on running them in the comments). What this does is contact the SlideShare API, request posts matching the provided query string by tag (in this case, “big data”), and return the most viewed presentations. The Google results, on the other hand, are returned from Google’s site-specific search.
So if the API is searching by tag, if we search for “big data” and the post has “big data” as a tag, we should get it right? Maybe these posts simply weren’t tagged properly? Nope! If you use SlideShare’s own API tool and search for the details of, say, “Big Data and Big Profits” by McKinsey on Marketing & Sales with “detailed” enabled, you will find this presentation does have the tag “big data”.
With this in mind, it seems the SlideShare search API doesn’t exactly return the results I’d expect. On top of this, if you choose to use the API to search by text, the results are very noisy, only being tangentially related to big data. This serves as a lesson: with the increasing availability of REST APIs to access data on the web, it’s becoming more convenient to gather data. But before you choose to use a site’s API (versus a more old-school crawler), make sure the API works as expected. Otherwise, you may miss out on data you’re interested in.