Top SlideShare Presentations on Big Data, updated

REST APIs and crawling offer two different ways to gather big data presentations from SlideShare, but they provide different results and lead to a very different view of the data. We examine why and find a useful data science lesson.

By Grant Marshall, Jan 2015.

In a previous article about Most Popular Slideshare Presentations on Big Data, I analyzed the top articles about Big Data on the site. This was achieved by using Python to gather data directly from SlideShare’s API. Since then, however, it’s been pointed out that there are some slideshares about big data that are clearly very important that were excluded.

For example, this slideshare, What is Big Data, by Bernard Marr was not included in the previous post despite having ten times the number of views of any of the slideshares included before! So clearly, using the API to retrieve the top slideshares was missing something. But just how much do these results differ? Here are the top 15 big data presentations on slideshare, using a the Google search “big data”:

Title Username Date Views Downloads Favorites
What is Big Data? Bernard Marr 2/28/2014 205,756 4175 222
Big Data and Big Profits McKinsey on Marketing & Sales 11/26/2013 187,828 3289 360
Big Data - The 5 Vs Everyone Must Know Bernard Marr 2/28/2014 157,846 3570 251
Big Data in Real-Time at Twitter nkallen 4/19/2010 146,303 6132 665
Big Data Analytics with Hadoop Philippe Julio 10/19/2009 114,477 2399 552
Big Data - 25 Amazing Facts Everyone Should Know Bernard Marr 9/24/2014 98,858 2196 178
Big data landscape v 3.0 - Matt Turck (FirstMark) Matt Turck 5/11/2014 72,695 1786 99
Big Data and Advanced Analytics McKinsey on Marketing & Sales 7/10/2013 72,682 0 243
Big Data: The 6 Key Skills Every Business Needs Bernard Marr 11/25/2014 67,693 738 54
Big Data: The 4 Layers Everyone Must Know Bernard Marr 9/17/2014 65,122 1644 116
Big Data and advanced analytics McKinsey on Marketing & Sales 12/19/2012 62,903 2 190
Big Data Trends David Feinleib 7/24/2012 59,276 4132 219
Big data 2020 HP Software Solutions 8/12/2014 58,328 86 106
A Primer on Big Data for Business Leslie Bradshaw 3/2/2013 48,356 362 59
Big Data in Retail - Examples in Action David Pittman 1/24/2013 41,147 1,010 30

Immediately, compared to the results returned by the SlideShare API, we see these presentations have far more views. In fact, the presentation returned from the SlideShare API with the most views/day has fewer views/day than the least popular presentation returned from the Google search! There is definitely something strange going on here.

SlideShare presentations average views from Google

Figure 1: Big Data Presentation Views/Week vs. Publication Date

To understand what’s causing this difference, we must first dive into the SlideShare API. The results from the previous post were reached using these scripts (information on running them in the comments). What this does is contact the SlideShare API, request posts matching the provided query string by tag (in this case, “big data”), and return the most viewed presentations. The Google results, on the other hand, are returned from Google’s site-specific search.

So if the API is searching by tag, if we search for “big data” and the post has “big data” as a tag, we should get it right? Maybe these posts simply weren’t tagged properly? Nope! If you use SlideShare’s own API tool and search for the details of, say, “Big Data and Big Profits” by McKinsey on Marketing & Sales with “detailed” enabled, you will find this presentation does have the tag “big data”.

With this in mind, it seems the SlideShare search API doesn’t exactly return the results I’d expect. On top of this, if you choose to use the API to search by text, the results are very noisy, only being tangentially related to big data. This serves as a lesson: with the increasing availability of REST APIs to access data on the web, it’s becoming more convenient to gather data. But before you choose to use a site’s API (versus a more old-school crawler), make sure the API works as expected. Otherwise, you may miss out on data you’re interested in.