Top SlideShare Presentations on Big Data, updated
REST APIs and crawling offer two different ways to gather big data presentations from SlideShare, but they provide different results and lead to a very different view of the data. We examine why and find a useful data science lesson.
In a previous article about Most Popular Slideshare Presentations on Big Data, I analyzed the top articles about Big Data on the site. This was achieved by using Python to gather data directly from SlideShare’s API. Since then, however, it’s been pointed out that there are some slideshares about big data that are clearly very important that were excluded.
For example, this slideshare, What is Big Data, by Bernard Marr was not included in the previous post despite having ten times the number of views of any of the slideshares included before! So clearly, using the API to retrieve the top slideshares was missing something. But just how much do these results differ? Here are the top 15 big data presentations on slideshare, using a the Google search “big data site:slideshare.net”:
|What is Big Data?||Bernard Marr||2/28/2014||205,756||4175||222|
|Big Data and Big Profits||McKinsey on Marketing & Sales||11/26/2013||187,828||3289||360|
|Big Data - The 5 Vs Everyone Must Know||Bernard Marr||2/28/2014||157,846||3570||251|
|Big Data in Real-Time at Twitter||nkallen||4/19/2010||146,303||6132||665|
|Big Data Analytics with Hadoop||Philippe Julio||10/19/2009||114,477||2399||552|
|Big Data - 25 Amazing Facts Everyone Should Know||Bernard Marr||9/24/2014||98,858||2196||178|
|Big data landscape v 3.0 - Matt Turck (FirstMark)||Matt Turck||5/11/2014||72,695||1786||99|
|Big Data and Advanced Analytics||McKinsey on Marketing & Sales||7/10/2013||72,682||0||243|
|Big Data: The 6 Key Skills Every Business Needs||Bernard Marr||11/25/2014||67,693||738||54|
|Big Data: The 4 Layers Everyone Must Know||Bernard Marr||9/17/2014||65,122||1644||116|
|Big Data and advanced analytics||McKinsey on Marketing & Sales||12/19/2012||62,903||2||190|
|Big Data Trends||David Feinleib||7/24/2012||59,276||4132||219|
|Big data 2020||HP Software Solutions||8/12/2014||58,328||86||106|
|A Primer on Big Data for Business||Leslie Bradshaw||3/2/2013||48,356||362||59|
|Big Data in Retail - Examples in Action||David Pittman||1/24/2013||41,147||1,010||30|
Immediately, compared to the results returned by the SlideShare API, we see these presentations have far more views. In fact, the presentation returned from the SlideShare API with the most views/day has fewer views/day than the least popular presentation returned from the Google search! There is definitely something strange going on here.
Figure 1: Big Data Presentation Views/Week vs. Publication DateTo understand what’s causing this difference, we must first dive into the SlideShare API. The results from the previous post were reached using these scripts (information on running them in the comments). What this does is contact the SlideShare API, request posts matching the provided query string by tag (in this case, “big data”), and return the most viewed presentations. The Google results, on the other hand, are returned from Google’s site-specific search.
So if the API is searching by tag, if we search for “big data” and the post has “big data” as a tag, we should get it right? Maybe these posts simply weren’t tagged properly? Nope! If you use SlideShare’s own API tool and search for the details of, say, “Big Data and Big Profits” by McKinsey on Marketing & Sales with “detailed” enabled, you will find this presentation does have the tag “big data”.
With this in mind, it seems the SlideShare search API doesn’t exactly return the results I’d expect. On top of this, if you choose to use the API to search by text, the results are very noisy, only being tangentially related to big data. This serves as a lesson: with the increasing availability of REST APIs to access data on the web, it’s becoming more convenient to gather data. But before you choose to use a site’s API (versus a more old-school crawler), make sure the API works as expected. Otherwise, you may miss out on data you’re interested in.