From Science to Data Science, a Comprehensive Guide for Transition

An in-depth, multifaceted, and all-around very helpful roadmap for making the switch from 'science' to 'data science,' yet generally useful for data science beginners or anyone looking to get into data science.



Programming languages


Most likely practical programming is the main skill you are missing. For general data science, the standard tools are Python and R. If you already know some other languages it will help, still - learn one of the above.

But… Python or R? There are some crazy fights, right?

tl;dr: both are good choices. Pick one you prefer for any reason; two really good ones are:

  • This thing is great! I want to apply it to [some other data]. Oh, it is in [a language]!
  • Having a community of people from whom you can learn.

I mean, there are use cases when one is better than the other. But in the majority of tasks both are fine. And well (some may disagree), but they are tools, not religions (no need of fighting, not need of using exclusively one).

I won’t point to a general tutorials - there are tons of it and personal preferences vary (MOOCs, interactive courses, websites, textbooks, …) and I tired to link only to things I recommend myself. When I provide links - it is usually web materials rather than classical books. And it is for a reason:

  • things change fast; a 2-year old book on a programming language may be well out-of-date,
  • it is important how much you use in practice; dry-reading won’t teach you a thing.

R

R is a tool for statistics turned into a language. The standard way of using it is via RStudio (though, you can use Jupyter). Be sure to learn basics of dplyr and ggplot2 (I almost always load them by default; especially dplyr, which makes operations on dataframes much easier, faster and more readable). Then everything else depends on the problems you are solving.

If you go the R way, at least:

Some R pearls:

  • R Markdown - dynamic documents, presentations, and reports from R
  • Shiny - turn your analyses into interactive web applications

Python

Python is a much better general-purpose language (with pros and cons on not being statistics-oriented).

For Python, I would suggest installing it (Python 3) through Anaconda, and using Jupyter Notebook. Main packages are NumPySciPy (numerics), Pandas (like R dataframes), matplotlib (plots, but not as nice as ggplot2) and scikit-learn (for machine learning). Learn to be comfortable with Python (installing packages, loading, saving and transforming data, etc) - links below may help:

Statistics

Statistics and Machine Learning


You need some basic linear algebra (vectors, matrices, SVD, …), calculus (exp, log, differentiation, integration, …) probability (independence, conditional probability, …), but if you are from natural science background, you already know that. It does not mean that you know all - it just means that right now you have mathematical skills sufficient to be an employable data scientists and you are able to read about other methods, algorithms, etc.

If you need to get a real dataset suitable for working with a given machine learning algorithm, there is a wonderful collection:

For statistics, screw learning by heart various statistical distributions and tests - you can easily look them up later. What is crucial, is to understand the idea of tests, cross validation, bootstrapping and Bayesian inference. For the latter I recommend:

It’s a fast changing field - I am constantly tracking new libraries and updates to ones I am using. I read a lot of academic papers - not just to stretch my intellectual muscles, but solve a particular problem.

Other software skills


Often you will need to install something, collaborate with others and do other tasks. The crucial point so to know what is possible - especially not to reinvent the wheel.

Don’t be afraid of learning new technologies (e.g. this data is in MongoDB, a NoSQL database; can you fetch it?) - often you can get the basics in a day. Most technologies, from the user’s perspective, are easy (at least comparing to algebraic geometry or quantum field theory).