Integrating Python and R into a Data Analysis Pipeline, Part 1
The first in a series of blog posts that: outline the basic strategy for integrating Python and R, run through the different steps involved in this process; and give a real example of how and why you would want to do this.
By Chris Musselle and Kate Ross-Smith, (Mango Solutions)
For a conference in the R language, EARL London 2015 saw a surprising number of discussions about Python. I like to think that at least some of this was to do with the fact that the day before the conference, we ran a 3-hour workshop outlining various strategies for integrating Python and R.
This is the first in a series of three blog posts that:
- outline the basic strategy for integrating Python and R;
- run through the different steps involved in this process; and
- give a real example of how and why you would want to do this.
This post kicks everything off by:
- covering the reasons why you may want to include both languages in a pipeline;
- introducing ways of running R and Python from the command line; and
- showing how you can accept inputs as arguments and write outputs to various file formats.
Why “And” not “Or”?
From a quick internet search for articles about “R Python”, of the top 10 results, only 2 discuss the merits of using both R and Python rather than pitting them against each other. This is understandable; from their inception, both have had very distinctive strengths and weaknesses. Historically, though, the split has been one of educational background: statisticians have preferred the approach that R takes, whereas programmers have made Python their language of choice. However, with the growing breed of data scientists, this distinction blurs:
Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician. — twitter @josh_wills
With the wealth of distinct library resources provided by each language, there is a growing need for data scientists to be able to leverage their relative strengths. For example:
Python tends to outperform R in such areas as:
- Web scraping and crawling: though rvest has simplified web scraping and crawling within R, Python’s beautifulsoup and Scrapy are more mature and deliver more functionality.
- Database connections: though R has a large number of options for connecting to databases, Python’s sqlachemy offers this in a single package and is widely used in production environments.
Whereas R outperforms Python in such areas as:
- Statistical analysis options: though Python’s combination of Scipy, Pandas and statsmodels offer a great set of statistical analysis tools, R is built specifically around statistical analysis applications and so provides a much larger collection of such tools.
- Interactive graphics/dashboards: bokeh, plotly and intuitics have all recently extended the use of Python graphics onto web browsers, but getting an example up and running using shiny and shiny dashboard in R is faster, and often requires less code.
Further, as data science teams now have a relatively wide range of skills, the language of choice for any application may come down to prior knowledge and experience. For some applications – especially in prototyping and development – it is faster for people to use the tool that they already know.
Flat File “Air Gap” Strategy
In this series of posts we are going to consider the simplest strategy for integrating the two languages, and step though it with some examples. Using a flat file as an air gap between the two languages requires you to do the following steps.
- Refactor your R and Python scripts to be executable from the command line and accept command line arguments.
- Output the shared data to a common file format.
- Execute one language from the other, passing in arguments as required.
- Simplest method, so commonly the quickest
- Can view the intermediate outputs easily
- Parsers already exist for many common file formats: CSV, JSON, YAML
- Need to agree upfront on a common schema or file format
- Can become cumbersome to manage intermediate outputs and paths if the pipeline grows.
- Reading and writing to disk can become a bottleneck if data becomes large.