Top SlideShare Presentations on Big Data, updated
REST APIs and crawling offer two different ways to gather big data presentations from SlideShare, but they provide different results and lead to a very different view of the data. We examine why and find a useful data science lesson.
By Grant Marshall, Jan 2015.
In a previous article about Most Popular Slideshare Presentations on Big Data, I analyzed the top articles about Big Data on the site. This was achieved by using Python to gather data directly from SlideShare’s API. Since then, however, it’s been pointed out that there are some slideshares about big data that are clearly very important that were excluded.
For example, this slideshare, What is Big Data, by Bernard Marr was not included in the previous post despite having ten times the number of views of any of the slideshares included before! So clearly, using the API to retrieve the top slideshares was missing something. But just how much do these results differ? Here are the top 15 big data presentations on slideshare, using a the Google search “big data site:slideshare.net”:
Immediately, compared to the results returned by the SlideShare API, we see these presentations have far more views. In fact, the presentation returned from the SlideShare API with the most views/day has fewer views/day than the least popular presentation returned from the Google search! There is definitely something strange going on here.
So if the API is searching by tag, if we search for “big data” and the post has “big data” as a tag, we should get it right? Maybe these posts simply weren’t tagged properly? Nope! If you use SlideShare’s own API tool and search for the details of, say, “Big Data and Big Profits” by McKinsey on Marketing & Sales with “detailed” enabled, you will find this presentation does have the tag “big data”.
With this in mind, it seems the SlideShare search API doesn’t exactly return the results I’d expect. On top of this, if you choose to use the API to search by text, the results are very noisy, only being tangentially related to big data. This serves as a lesson: with the increasing availability of REST APIs to access data on the web, it’s becoming more convenient to gather data. But before you choose to use a site’s API (versus a more old-school crawler), make sure the API works as expected. Otherwise, you may miss out on data you’re interested in.
Related:
More On This Topic
In a previous article about Most Popular Slideshare Presentations on Big Data, I analyzed the top articles about Big Data on the site. This was achieved by using Python to gather data directly from SlideShare’s API. Since then, however, it’s been pointed out that there are some slideshares about big data that are clearly very important that were excluded.
For example, this slideshare, What is Big Data, by Bernard Marr was not included in the previous post despite having ten times the number of views of any of the slideshares included before! So clearly, using the API to retrieve the top slideshares was missing something. But just how much do these results differ? Here are the top 15 big data presentations on slideshare, using a the Google search “big data site:slideshare.net”:
Title | Username | Date | Views | Downloads | Favorites |
What is Big Data? | Bernard Marr | 2/28/2014 | 205,756 | 4175 | 222 |
Big Data and Big Profits | McKinsey on Marketing & Sales | 11/26/2013 | 187,828 | 3289 | 360 |
Big Data - The 5 Vs Everyone Must Know | Bernard Marr | 2/28/2014 | 157,846 | 3570 | 251 |
Big Data in Real-Time at Twitter | nkallen | 4/19/2010 | 146,303 | 6132 | 665 |
Big Data Analytics with Hadoop | Philippe Julio | 10/19/2009 | 114,477 | 2399 | 552 |
Big Data - 25 Amazing Facts Everyone Should Know | Bernard Marr | 9/24/2014 | 98,858 | 2196 | 178 |
Big data landscape v 3.0 - Matt Turck (FirstMark) | Matt Turck | 5/11/2014 | 72,695 | 1786 | 99 |
Big Data and Advanced Analytics | McKinsey on Marketing & Sales | 7/10/2013 | 72,682 | 0 | 243 |
Big Data: The 6 Key Skills Every Business Needs | Bernard Marr | 11/25/2014 | 67,693 | 738 | 54 |
Big Data: The 4 Layers Everyone Must Know | Bernard Marr | 9/17/2014 | 65,122 | 1644 | 116 |
Big Data and advanced analytics | McKinsey on Marketing & Sales | 12/19/2012 | 62,903 | 2 | 190 |
Big Data Trends | David Feinleib | 7/24/2012 | 59,276 | 4132 | 219 |
Big data 2020 | HP Software Solutions | 8/12/2014 | 58,328 | 86 | 106 |
A Primer on Big Data for Business | Leslie Bradshaw | 3/2/2013 | 48,356 | 362 | 59 |
Big Data in Retail - Examples in Action | David Pittman | 1/24/2013 | 41,147 | 1,010 | 30 |
Immediately, compared to the results returned by the SlideShare API, we see these presentations have far more views. In fact, the presentation returned from the SlideShare API with the most views/day has fewer views/day than the least popular presentation returned from the Google search! There is definitely something strange going on here.
Figure 1: Big Data Presentation Views/Week vs. Publication Date
To understand what’s causing this difference, we must first dive into the SlideShare API. The results from the previous post were reached using these scripts (information on running them in the comments). What this does is contact the SlideShare API, request posts matching the provided query string by tag (in this case, “big data”), and return the most viewed presentations. The Google results, on the other hand, are returned from Google’s site-specific search.So if the API is searching by tag, if we search for “big data” and the post has “big data” as a tag, we should get it right? Maybe these posts simply weren’t tagged properly? Nope! If you use SlideShare’s own API tool and search for the details of, say, “Big Data and Big Profits” by McKinsey on Marketing & Sales with “detailed” enabled, you will find this presentation does have the tag “big data”.
With this in mind, it seems the SlideShare search API doesn’t exactly return the results I’d expect. On top of this, if you choose to use the API to search by text, the results are very noisy, only being tangentially related to big data. This serves as a lesson: with the increasing availability of REST APIs to access data on the web, it’s becoming more convenient to gather data. But before you choose to use a site’s API (versus a more old-school crawler), make sure the API works as expected. Otherwise, you may miss out on data you’re interested in.
Related:
- Most Popular Slideshare Presentations on Big Data
- Debunking Big Data Myths. Again
- Data Visualization of Census Data with R
- How to Use Data Visualization to Add Impact to Your Work Reports and…
- An Introduction to AI, updated
- Top KDnuggets tweets, Sep 23-29: An Introduction to #AI - updated for 2020;…
- High-Fidelity Synthetic Data for Data Engineers and Data Scientists Alike
- AIRSIDE LIVE Is Where Big Data, Data Security and Data Governance Converge
- Data Scientist, Data Engineer & Other Data Careers, Explained
Top Posts Last Week |
---|
|
More Recent Posts
- OpenAI’s Whisper API for Transcription and Translation
- AgentGPT: Autonomous AI Agents in your Browser
- RedPajama Project: An Open-Source Initiative to Democratizing ...
- Essential MLOps: A Free eBook
- KDnuggets News, May 31: Bard for Data Science Cheat Sheet • ...
- Go from Engineer to ML Engineer with Declarative ML
- Solving 5 Complex SQL Problems: Tricky Queries Explained
- KDnuggets Top Posts for March 2023: AutoGPT: Everything You Ne...
- The Top AutoML Frameworks You Should Consider in 2023
- How Hard is it to Get into FAANG Companies
Related Posts
- Top November Stories: Top Python Libraries for Data Science, Data…
- Data Scientist vs Data Analyst vs Data Engineer
- Data science is not about data - applying Dijkstra principle to data…
- AWS Webinar: How are data-driven companies using ESG and sustainability…
- How To Use Synthetic Data To Overcome Data Shortages For Machine Learning…
- PASS Data Community Summit – Free Online Conference for Data Professionals
Get The Latest News!
Published on January 18, 2015 by