From Science to Data Science, a Comprehensive Guide for Transition
An in-depth, multifaceted, and all-around very helpful roadmap for making the switch from 'science' to 'data science,' yet generally useful for data science beginners or anyone looking to get into data science.
Most likely practical programming is the main skill you are missing. For general data science, the standard tools are Python and R. If you already know some other languages it will help, still - learn one of the above.
But… Python or R? There are some crazy fights, right?
- Choosing R or Python for data analysis? An infographic
- Should you teach Python or R for data science? - a more detailed use-cases
- Python, Machine Learning, and Language Wars - Sebastian Raschka - also mentions R, and why MATLAB is too old (and Julia - too young)
tl;dr: both are good choices. Pick one you prefer for any reason; two really good ones are:
- This thing is great! I want to apply it to [some other data]. Oh, it is in [a language]!
- Having a community of people from whom you can learn.
I mean, there are use cases when one is better than the other. But in the majority of tasks both are fine. And well (some may disagree), but they are tools, not religions (no need of fighting, not need of using exclusively one).
I won’t point to a general tutorials - there are tons of it and personal preferences vary (MOOCs, interactive courses, websites, textbooks, …) and I tired to link only to things I recommend myself. When I provide links - it is usually web materials rather than classical books. And it is for a reason:
- things change fast; a 2-year old book on a programming language may be well out-of-date,
- it is important how much you use in practice; dry-reading won’t teach you a thing.
R is a tool for statistics turned into a language. The standard way of using it is via RStudio (though, you can use Jupyter). Be sure to learn basics of dplyr and ggplot2 (I almost always load them by default; especially dplyr, which makes operations on dataframes much easier, faster and more readable). Then everything else depends on the problems you are solving.
If you go the R way, at least:
- Do your “data janitor work” like a boss with dplyr
- Data Wrangling with dplyr and tidyr Cheat Sheet
- Quick Introduction to ggplot2 - Edwin Chen
- Graphics, ggplot2
- Getting Started with R: Kaggle’s Titanic Competition
Some R pearls:
- R Markdown - dynamic documents, presentations, and reports from R
- Shiny - turn your analyses into interactive web applications
Python is a much better general-purpose language (with pros and cons on not being statistics-oriented).
For Python, I would suggest installing it (Python 3) through Anaconda, and using Jupyter Notebook. Main packages are NumPy, SciPy (numerics), Pandas (like R dataframes), matplotlib (plots, but not as nice as ggplot2) and scikit-learn (for machine learning). Learn to be comfortable with Python (installing packages, loading, saving and transforming data, etc) - links below may help:
- Overview of Python Visualization Tools
- Pandas Visualization
- Web Scraping - It’s Your Civic Duty
- Scipy Lecture Notes - One document to learn numerics, science, and data with Python
- A gallery of interesting IPython Notebooks
Statistics and Machine Learning
You need some basic linear algebra (vectors, matrices, SVD, …), calculus (exp, log, differentiation, integration, …) probability (independence, conditional probability, …), but if you are from natural science background, you already know that. It does not mean that you know all - it just means that right now you have mathematical skills sufficient to be an employable data scientists and you are able to read about other methods, algorithms, etc.
- A Visual Introduction to Machine Learning
- Machine Learning at Coursera by Andrew Ng
- Dive into Machine Learning with Jupyter notebook, Python, and scikit-learn
- PyCon 2015 Scikit-learn Tutorial
If you need to get a real dataset suitable for working with a given machine learning algorithm, there is a wonderful collection:
For statistics, screw learning by heart various statistical distributions and tests - you can easily look them up later. What is crucial, is to understand the idea of tests, cross validation, bootstrapping and Bayesian inference. For the latter I recommend:
- David MacKay, Information Theory, Inference, and Learning Algorithms - doing the Bayesian Inference and Machine Learning track
It’s a fast changing field - I am constantly tracking new libraries and updates to ones I am using. I read a lot of academic papers - not just to stretch my intellectual muscles, but solve a particular problem.
Other software skills
Often you will need to install something, collaborate with others and do other tasks. The crucial point so to know what is possible - especially not to reinvent the wheel.
- basics of bash/shell (
- CSV, JSON
- git for version control - see Why use version control systems for writing a paper?
- basics of SQL
- An Introductory SQL Tutorial: How to Write Simple Queries - Rachel Sprung
- Learn SQL in stages - SQLZOO
- Stack Exchange Data - see examples, write your own queries
- working with REST APIs
Don’t be afraid of learning new technologies (e.g. this data is in MongoDB, a NoSQL database; can you fetch it?) - often you can get the basics in a day. Most technologies, from the user’s perspective, are easy (at least comparing to algebraic geometry or quantum field theory).