R vs Python (Again): A Human Factor Perspective
This post is tentative to explain by "human factor" - a typical Python vs. R user, the widespread opinion that Python is better suited than R for developing production-quality code.
Image by the author based on the image by OpenClipart-Vectors on pixabay
I often hear or read things that say in essence "R is good for quick and dirty analyses, but if you want to do serious work you should use Python". I completely disagree with this statement, as it is absolutely possible to write efficient, reliable, robust, production-quality code in R - I have done it, so others can do it, too. When comparing two languages, you can think about their intrinsic qualities, like language constructs, syntax, functions, and available libraries. But you can also employ a more empirical approach, mentally visualizing at a high speed all the examples of the code written in either of the two languages that you have encountered. And then, I am willing to admit that, if you rephrase the above statement in ''the average Python code is of higher quality than the average R code", there might be some truth in it, and it can be explained by the "human factor".
This opinion is obviously not based on a rigorous scientific approach, in the sense that it is not based on objective data, as such data is not (and I think can't be) available. Indeed, we don't have access to all Python/R code that has ever been created (nor that has been created let's say last year), nor to a representative sample of such code. We also don't have access to the educational background and professional profile of all Python/R users, nor to their representative sample. And even if we had such data, it would be extremely difficult to define an average piece of Python/R code and to assess its quality. So it is based solely on my subjective but honest appreciation of the situation.
Read on to find out more.
About code quality
Although there is no universal definition or measure of the code quality, it is generally admitted that the good code should be readable (well documented, follow a consistent style, …), modular, reusable, reliable, testable, and of course, do what it is intended to do. It should also efficiently use the computing resources (memory, CPU) and the time resource, i.e., be reasonably fast.
Code is only a part of the software, there's also the algorithm that code implements. Whatever programming language you use and even if you respect the above code quality requirements, the software that's algorithmically inefficient is not good software.
Typical R User Profile
Most R users come from a scientific background. They are often self-thought R programmers (Google and YouTube are the most important R tutors!), and even if they had formal training in R, it was probably quite minimal and most likely embedded in a statistics class (exception made of professional statisticians and data scientists). Also, they were likely not exposed to general computer science training and concepts, and therefore are not familiar with good coding practices.
R is generally used as an auxiliary tool to accomplish their main job - do research. So what counts is to get the job done - analyze the data, produce tables and charts for the Ph.D. thesis, journal or conference paper, or project report for the funding agency. As a consequence, the code developed is a single-use, throwaway software. Documentation or coding style is not an issue, spaghetti-style single scripts instead of modular code are frequent, not to mention the hard-coded file paths and other parameters which destroy any hope for the reuse of the code. Code is in general run interactively by the author, and if an error occurs it is hot-fixed at the spot, so, of course, there's no provision for proper error handling. If the code runs today on my computer and on my data set, then I'm happy, and I have done my job! But we shouldn't blame them (at least not too much) - their job is to produce scientific results, not the efficient and high-quality computer code. It is sure that writing quality code is not that difficult once you get used to it, and even if the code was intended to be single-use, parts of it could probably be reused on other occasions, but low code quality makes the task almost impossible. All this considered, we can argue that the "average" R code will most often not meet the high-quality coding standards.
The notable exception is again professional statisticians and data scientists, for whom producing code can be considered as a part of their main task - developing statistical and machine-learning methodology and tools, and therefore they often produce good quality code made available as R packages on CRAN and other repositories.
Typical Python User Profile
Most Python users come from a computing background. Python is often used at universities as a tool to introduce computer science concepts (https://www.edx.org/learn/python). Even if they didn't graduate in computer science, Python users probably had some formal training in Python programming, as this was a requirement for them to get the job. Also, even if they didn't initially study computer science, they likely had some formal training in general computer science, besides Python programming, and are familiar with good coding practices.
Python is generally used by software developers - people whose main task is to produce code. The code is intended to be run many times, on different computers (and possibly even operating systems), and on many different data sets. The code tends to be robust and reliable, copes with exceptional situations, and gracefully handles errors. It is also expected that the code will have to be maintained, possibly by other people (bug correction, adaptation, evolution), so the notions of documentation, style, readability, reuse become naturally a part of the development process. All this considered, we can argue that the "average" Python code will most often meet the high-quality coding standards.
In this post, I have tried to explore the hypothesis about the origins of the belief that Python is more suitable than R for writing production-quality code. I don't think that it comes from the intrinsic characteristics of the two languages and their ecosystems. I suppose that it rather comes from the fact that average R users are probably less concerned by the code quality than their Python counterparts, which can be explained by their background and the place programming occupies in their respective jobs.
And, of course, the final conclusion is that the belief that Python is superior to R for writing quality code is wrong!
Zivan Karaman is a freelance data science & software engineering consultant, passionate about using mathematics, statistics, and computing to transform the data into actionable insights that can help solve real-world problems. Particularly fond of the R ecosystem has been using it even before it was created (using R ancestor S/S+ since the early 90s). Connect with him on LinkedIn.