Integrating Python and R into a Data Analysis Pipeline, Part 1
The first in a series of blog posts that: outline the basic strategy for integrating Python and R, run through the different steps involved in this process; and give a real example of how and why you would want to do this.
Command Line Scripting
Running scripts from the command line via a Windows/Linux-like terminal environment is similar in both R and Python. The command to be run is broken down into the following parts,
<command_to_run> <path_to_script> <any_additional_arguments>
- <command> is the executable to run (Rscript for R code and Python for Python code),
- <path_to_script> is the full or relative file path to the script being executed. Note that if there are any spaces in the path name, the whole file path must me enclosed in double quotes.
- <any_additional_arguments> This is a list of space delimited arguments parsed to the script itself. Note that these will be passed in as strings.
So for example, an R script is executed by opening up a terminal environment and running the following:
Rscript path/to/myscript.R arg1 arg2 arg3
A Few Gotchas
- For the commands Rscript and Python to be found, these executables must already be on your path. Otherwise the full path to their location on your file system must be supplied.
- Path names with spaces create problems, especially on Windows, and so must be enclosed in double quotes so they are recognised as a single file path.
Accessing Command Line Arguments in R
In the above example where arg1, arg2 and arg3 are the arguments parsed to the R script being executed, these are accessible using the commandArgs function.
# Fetch command line arguments
myArgs <- commandArgs(trailingOnly = TRUE)
# myArgs is a character vector of all arguments
By setting trailingOnly = TRUE, the vector myArgs only contains arguments that you added on the command line. If left as FALSE (by default), there will be other arguments included in the vector, such as the path to the script that was just executed.
Accessing Command Line Arguments in Python
For a Python script executed by running the following on the command line
python path/to/myscript.py arg1 arg2 arg3
the arguments arg1, arg2 and arg3 can be accessed from within the Python script by first importing the sys module. This module holds parameters and functions that are system specific, however we are only interested here in the argv attribute. This argv attribute is a list of all the arguments passed to the script currently being executed. The first element in this list is always the full file path to the script being executed.
# Fetch command line arguments
my_args = sys.argv
# my_args is a list where the first element is the file executed.
If you only wished to keep the arguments parsed into the script, you can use list slicing to select all but the first element.
# Using a slice, selects all but the first element
my_args = sys.argv[1:]
As with the above example for R, recall that all arguments are parsed in as strings, and so will need converting to the expected types as necessary.
Writing Outputs to a File
You have a few options when sharing data between R and Python via an intermediate file. In general for flat files, CSVs are a good format for tabular data, while JSON or YAML are best if you are dealing with more unstructured data (or metadata), which could contain a variable number of fields or more nested data structures.
All these are very common data serialisation formats, and parsers already exist in both languages. In R the following packages are recommended for each format:
And in Python:
The csv and json modules are part of the Python standard library, distributed with Python itself, whereas PyYAML will need installing separately. All R packages will also need installing in the usual way.
So passing data between R and Python (and vice-versa) can be done in a single pipeline by:
- using the command line to transfer arguments, and
- transferring data through a commonly-structured flat file.
However, in some instances, having to use a flat file as an intermediate data store can be both cumbersome and detrimental to performance. In the next post, we will look at how R and Python can directly call each other and return the output in memory.