KDnuggets Home » News » 2016 » Apr » Tutorials, Overviews » From Science to Data Science, a Comprehensive Guide for Transition ( 16:n13 )

From Science to Data Science, a Comprehensive Guide for Transition


An in-depth, multifaceted, and all-around very helpful roadmap for making the switch from 'science' to 'data science,' yet generally useful for data science beginners or anyone looking to get into data science.



Piotr Migdał, deepsense.io.

After posting What I do or: science to data science I got a lot of emails on how to make this transition.

In this post I try to summarize my advice. I don’t intend to write a complete walkthrough, but to provide a starting point, with links to further materials. I target it at people with academic, quantitative background (e.g. physics, mathematics, statistics), regardless if they are undergraduate students, PhDs or after a few postdocs. Some points may be valid for other backgrounds1 (but then - use it at your own risk).

Here and everywhere else: please don’t take approach of learn book[s] then play - start with playing!

Data science

My story


In short:

  • I had a strong background in physics and interest in complex system; I did a lot of academic programming and none of - practical.
  • After the 1st year of my PhD studies I started learning Python (for web scraping and plotting) on my own time.
  • 9 months later I participated in a 1-month data science school (Big Dive in Turin).
  • 8 months later I went to a summer internship in data science in San Francisco (for 4 months).
  • I started part-time freelancing (as I was finishing my PhD).
  • After finishing PhD I made it my main activity.

All projects required me to learn something new - be it a library, a machine learning model or a software tool.

What is data science?


Analyzing real, and often - dirty, data using a mixture of programming and statistics. Or, as Josh Wills put it:

Data scientist is a person who is better at statistics than any programmer and better at programming than any statistician.

From my perspective the whole process looks that way:

  • ask question that is relevant to the project
  • get data (CSV, SQL, plain text)
  • process it (joining, cleaning, supplementing it)
  • run analysis (statistical tests or machine learning)
  • interpret and use results (being able to understand the above)
  • present results (a report, plot, interactive data visualization)

And everything needs to be done in a reproducible way - so others can interact with your code, or even run it on a server. Depending on the job, there may be more emphasis on one part or the other. Or even look at this tweet - while humorous2, it shows a balanced list of typical skills and activities of a data scientist:

A data scientist should

If you want to learn more about what is data science, look at the following links:

On the transition


When you have some academic title, no-one will question your intelligence. But they are justified to question your practical skills. From my experience, you need to fulfill two requirements:

  • have minimal skills so that you are useful starting from day 1 (e.g. you can get data and present summary statistics; they don’t want to start with teaching you Python and Git),
  • be able and eager to learn (in general, their technologies, be self-driven to discover and solve new problems even without being explicitly guided).

Most data science things are simple and at the point that you are able to use R or Python you can start working, gradually increasing your knowledge and experience. That is, after a few months you should be ready to start an entry-level job.

Initially, I was afraid that it is a problem that I lack 10+ years of experience with C++ and Java. So how could I compete with serious software engineers, who did their computer science major? But it turned out that most of my commercial projects are for IT companies - they have wonderful programmers but often no-one proficient at dealing with real data. So (from Academia to Industry linked below):

While having a strong coding ability is important, data science isn’t all about software engineering (in fact, have a good familiarity with Python and you’re good to go). Data scientists live at the intersection of coding, statistics, and critical thinking.

See also:

Priorities

In academia, you are allowed to cherry-pick an artificial problem and work on it for 2 years. The result needs to be novel, and you need to research previous and similar solutions. The solution needs to be perfect, even if not on time.

In industry, you should solve a given problem end-to-end. Things need to work, and there is little difference if it is based on an academic paper, usage of an existing library, your own code or an impromptu hack. The solution needs to be on time, even if just good enough and based on shady and poorly understood assumptions.

So, contrary to its name, it’s rarely science3. That is, in data science the emphasis is on practical results (like in engineering) - not proofs, mathematical purity or rigor characteristic to academic science.

Resume vs academic CV

In the software industry resume plays a different role than CV in academia. Rather than being a complete record or all positions, awards and publication, it is a short (typically 1 page) summary of the main skills and the most important positions/accomplishments. It is used to screen candidates, not as the final judgement. To see the difference, compare and contrast my data science resume with my academic CV.

Interviews

Applying for a job involves being asked technical questions - on the phone or Skype. For software engineering it involves both conceptual questions and whiteboard coding; for data science it may vary. In any case, take a look at:

If you need learn basic algorithms and data structures, I recommend:

If you get no technical questions, it may be a red flag. If you get only software engineering questions, it may be a sign that they want to hire a programmer, not - a data scientist (no matter what their job calling says); and given you background you want to be a Type A Data scientist (i.e. more a statistician than a regular programmer), according to this taxonomy.