Data Science Tools – Are Proprietary Vendors Still Relevant?
We examine and quantify the dramatic impact of open source tools like R and Python on SAS, IBM, Microsoft, and other proprietary Data Science vendors. We also investigate how open source tools were faring against each other, which are growing, which are falling, and look R versus Python debate.
Python versus R: Tools democratization in action?
Python is a general purpose programming language. Growth in Python search volume, in and itself, is not indicative of increased Python usage within data science teams. Search data for Pandas, Numpy, and scikit-learn, popular Python data analytics and machine learning add-on packages, more accurately reflects adoption. And here, the growth has been spectacular: A 4 year CAGR for Pandas of 45%, and a 2 year CAGR for scikit-learn of 58%.
While search volume for R grew at a faster pace than Python over the period from 2008, this momentum appears to have tapered comparatively: an 8.8% CAGR from 2008 versus 5.5% for Python, to a two year CAGR to 2015 of 5.9% versus Python’s 10.5%. As noted above, the growth in interest in Python’s analytics packages far outpaced the growth in Python itself.
The growth in use of R and Python-related StackOverflow Tags reveal a similar pattern to that of the Google Trends data. These trends may reflect a democratization of advanced analytics tools beyond data scientists. Software engineering teams who do not use R are increasingly using powerful, high-quality tools for data analysis and machine learning.
Spark ignites and Scala follows
While Apache Spark did not feature in Google Trends data prior to 2013, year on year growth for 2014-2015 was a phenomenal 121%. Scala, tied closely to the success of Spark, has seen accelerated adoption over the period: A 4 year CAGR of 8.1%, to a year over year 2014-2015 growth rate of 12.4%.
Proprietary versus open source: correlation with causation?
When looking at the Google Trends charts we noticed an interesting relationship between search volume for proprietary and open source tools. In 2010, searches for tools from the three largest advanced analytics vendors in our study appear to have reached an inflection point. So did search volume for R, and pandas followed a year later.
We tested the time series data for correlation, and found a strong inverse relationship between R and the proprietary vendors.
While this result does not imply causation, the actions of proprietary vendors over the last few years provides insight into the impact of open source tools on the advanced analytics market.
Some vendors have invested heavily in supporting open source tools. Microsoft acquired Revolution Analytics, the developers of a high-performance distribution of R. While others, SAS included, have integrated their products with R and Python. This coopetition extends to the cloud and SAAS services discussed above. Many of the analytics services available on Microsoft Azure support both Python and R.
While Microsoft Azure analytics services and IBM Watson are seeing some usage, it’s probable that this is existing customer adoption and engineering, rather than data science teams. The jury is still out on whether these products will see widespread adoption by data scientists.
For a more detailed account of our findings visit the Data Science Blog. All Jupyter Python notebooks, data extraction scripts, and raw data may be found in this Domino project.
Bio: Daniel Chalef is a marketer at Domino Data Lab, a central hub for enterprise data science teams. Prior to Domino, he co-founded and built two enterprise marketing technology companies. He’s also a lapsed software engineer and builds robots for fun.
Related: