KDnuggets Home » News » 2017 » Oct » Opinions, Interviews » Learning git is not enough: becoming a data scientist after a science PhD ( 17:n41 )

Silver BlogLearning git is not enough: becoming a data scientist after a science PhD


Here is useful advice about moving from academia into data science after completing a PhD in a natural science.



By Mike Lee Williams.

If you’re thinking of leaving post-PhD science for data science then doubtless people have told you to learn version control.

They’re absolutely right. You should. But learning git is not enough.

So, in the spirit of A PhD is Not Enough, a great book about careers in science, here’s some advice about moving from academia into data science after completing a PhD in a natural science.

Unlike A PhD is Not Enough, however, this post is not a complete guide to a career. It’s just a collection of (hopefully non-obvious) things that have occurred to me since I made the move myself three years ago.

And to be clear: none of what I say here applies to you if you have a PhD in computer science, mathematics, statistics or the humanities.

Contents

Advice about advice

Data science is a young discipline, and it’s only in the past couple of years that tech firms have recognized the particular value of an advanced natural science education.

This means that the people who’ve being doing data science for long enough to give advice from a position of experience are very different than the average science PhD considering leaving academia.

Conversely, data scientists who recently completed a science PhD simply don’t know what they’re talking about because they haven’t been working (or hiring) for long enough. I include myself in this category.

People are going to seem like they know what they’re talking about, but take their advice with a big pinch of salt.

Why tech companies shouldn’t hire you

Before I get into the practical advice, I think it’s important to know why you and your peers are of interest to a hiring manager in tech.

This is a perfectly reasonable list of things a data scientist should know. On average, science PhDs know this material no better, and often far worse, than anyone who can solve FizzBuzz.

Tech firms are by now all too aware of this. They know that, left alone, a typical science PhD cannot build robust, complex software systems. More fundamentally, science PhDs are often ignorant about the basic tools and conventions of collaborative software development. I certainly was (and compared to an undergraduate CS major, I probably still am).

And yes, most science PhDs are comfortable with some pretty sophisticated ideas from mathematics and statistics. But they rarely have the breadth of a statistics or a machine learning PhD. They often lack knowledge of the particular areas of statistics that come up in industrial data science.

And there’s another problem with science PhDs: depending on their thesis adviser they may have acquired a tendency to treat the word “data” as a plural. The good news is this absurd habit can be unlearned. The other problems are more serious.

Why tech companies should hire you

So why do tech firms hire science PhDs? Science is not perfect, but it’s been pretty successful. And the intellectual posture and methods it uses seems likely to be partly responsible for that success.

It’s difficult to finish a science PhD without acquiring two things:

  • A deeply ingrained attitude of skepticism toward claims made about data, including your own.
  • The ability to conduct undirected research programs whose job is to determine whether that attitude is warranted.

Now, It’s possible to get a science PhD without picking these up, and it’s possible to learn them without doing a science PhD. But a random person who has a PhD is more likely to have learned them than someone without.

I’ve met machine learning PhDs, the kind of people who get hired at NIPS and start on $300,000, who neither know nor care about the most fundamental concerns about experimental data such as sample bias and censorship.

I don’t think it’s a coincidence that the recently publicized article about relative merits of men and women in tech was written by a Google engineer who quit grad school before he had to do any research. It was glib, intellectually lazy and arrogant. These are things that a modern graduate research education in a natural science seeks to beat out of you (admittedly not always successfully).

Salaries

OK, that was kind of philosophical and, coming from someone still trying to rationalize the fact that he spent 10 years on a science PhD and subsequent postdocs, probably a little self-serving.

Here’s some practical advice.

Unless you were a very successful academic, you’ve probably never negotiated over money. How do you know if an offer is fair? Data!

There are two useful sources of information. The first is the O’Reilly Data Science Salary Survey (US 2016 editionEuropean 2017 edition). This survey gives you a formula you can use to estimate what people like you get paid. In my case it’s right to within about 10%.

The second useful source is the H-1B salary database. The salaries of people on H-1B visas are matters of public record. You can search by title and employer. It’s not impossible you’ll find the salaries of people who work where you’ve applied, doing exactly the job you’ve applied for. There’s nothing like knowing the salary of your interviewer to level the playing field during negotiations.

Both data sources are flawed. Naively applied, the O’Reilly formula tells you that you should subtract $5000-6000 from your salary if your gender is female. Correlation does not imply causation though, so you should not lower your expectations or demands based on this (or any other) tendency in their data.

In aggregate, the H-1B data is perhaps even more problematic than the O’Reilly data. It’s dodgy data about a biased sample.

The H-1B sample is flawed because H-1B visa holders are atypical. In some situations they are hired precisely because they have lower salary expectations than US residents (paying them less than the prevailing rate violates the terms of their Labor Condition Application, but it happens). In other situations the employer puts up the with expense and delay of the visa application because they have unique skills that also make them more valuable.

The H-1B data is flawed because the salary information is only recorded at the time the offer was made, since which it has presumably increased, and it does not include bonuses or non-cash compensation.

But some information is better than none.