Guide to Data Science Cheat Sheets

Selection of the most useful Data Science cheat sheets, covering SQL, Python (including NumPy, SciPy and Pandas), R (including Regression, Time Series, Data Mining), MATLAB, and more.



By Ajay Ohri, May 2014.

Over the past few years, as the buzz and apparently the demand for data scientists has continued to grow, people are eager to learn how to join, learn, advance and thrive in this seemingly lucrative profession. As someone who writes on analytics and occasionally teaches it, I am often asked - How do I become a data scientist?

Adding to the complexity of my answer is data science seems to be a multi-disciplinary field, while the university departments of statistics, computer science and management deal with data quite differently.

But to cut the marketing created jargon aside, a data scientist is simply a person who can write code in a few languages (primarily R, Python and SQL) for data querying, manipulation , aggregation, and visualization using enough statistical knowledge to give back actionable insights to the business for making decisions.

Since this rather practical definition of a data scientist is reinforced by the accompanying words on a job website for “data scientists” , ergo, here are some tools for learning the primary languages in data science- Python, R and SQL. A cheat sheet or reference card is a compilation of mostly used commands to help you learn that language’s syntax at a faster rate.

The inclusion of SQL may lead to some to feel surprised (isn’t this the NoSQL era?) , but it is there for a logical reason. Both PIG and Hive Query Language are closely associated with SQL- the original Structured Query Language. In addition one can solely use the sqldf package within R (and the less widely used python-sql or python-sqlparse libraries for Pythonic data scientists) or even the Proc SQL commands within the old champion language SAS, and do most of what a data scientist is expected to do (at least in data munging).

Python Cheat Sheets For Python, this is a rather partial list given the fact that Python, the most general purpose language within the data scientist quiver, can be used for many things. But for the data scientist, the packages of numpy, scipy , pandas and scikit-learn seem the most pertinent.

Do all the thousands of R packages have useful interest to the aspiring data scientist? No.

Accordingly we chose the appropriate cheat sheets for you. Note that this is a curated list of lists. If there is anything that can be assumed in the field of data science, it should be that the null hypothesis is that the data scientist is intelligent enough to make his own decisions based on data and it’s context. 3 printouts is all it takes to speed up the aspiring data scientist’s journey.

Please add additional cheat sheets in comments below.

Cheat Sheets for Python

Cheat Sheets for R

Cross Reference between R, Python (and Matlab)

Cheat Sheets for SQL

Additional

Ajay Ohri is a popular writer and blogger on Analytics and Data Mining and is the author of R for Business Analytics book (Springer, 2012).