KDnuggets Home » News » 2016 » Aug » Opinions, Interviews » How to Become a Data Scientist – Part 1 ( 16:n31 )

2016 Silver BlogHow to Become a Data Scientist – Part 1

Check out this excellent (and exhaustive) article on becoming a data scientist, written by someone who spends their day recruiting data scientists. Do yourself a favor and read the whole way through. You won't regret it!

By Alec Smith, Data Science Recruiter.

This is Part One in a Three-Part Series

Part Two: Learning | Part Three: The Job Market

I am a recruiter specialised in the field of data science. The idea for this project arose because one of the most common questions I am asked is: “how do I obtain a position as a data scientist?” It is not just the regularity of this question that got my attention, but also the diverse backgrounds from where it was coming from. To name a few, I have had this conversation with: software engineers, database developers, data architects, actuaries, mathematicians, academics (of various disciplines), biologists, astronomers, theoretical physicists – I could go on. And through these conversations, it has become apparent that there is a huge amount of misinformation out there, which has left people confused about what they need to do, in order to break into this field.

I decided, therefore, that I would investigate this subject to cut through the BS and provide a useful resource for anyone looking to move into commercial data science – whether you are just starting out, or already possess all the necessary skills but have no industry experience. And so I set out with the aim of answering two very broad questions:

  • What skills are required for data science, and how should you go about picking these up? (Chapters One, Two and Three)
  • From a job market perspective, what steps can you take to maximise your chances of gaining employment in data science? (Chapter Four)

Why am I qualified to write this? Well, I speak with data scientists every day and to be an effective recruiter, I need to understand career paths, what makes a good data scientist, and what employers look for when hiring. So I already possess some knowledge on the matter. But I also wanted to find out directly from those who have trodden this path, so I began speaking with data scientists of different backgrounds to see what I could unearth. And this took me on a journey through ex-software engineers, an ex-astrophysicist and even an ex-particle physicist, who – to my excitement – had been involved in one of the biggest scientific breakthroughs of the 21st century.

Data scientist


Different Types of Data Science

So you have made the decision to become a data scientist. Great, you are on your way. But now you have another choice, which is: what kind of data scientist do you want to become? Because – it is important to acknowledge – while data science as a profession has been recognised for a number of years now, there still isn’t a commonly accepted definition of what it actually is.

In reality, the term ‘data scientist’ is regarded as a broad job title and so it comes in many forms, with the specific demands dependent on the industry, the business, and the purpose/output of the role in question. As a result, certain skillsets suit certain positions better than others, and this is why the path to data science is not uniform and can be via a diverse range of fields such as statistics, computer science and other scientific disciplines.

The purpose is the biggest factor that dictates what form data science takes, and this is related to the Type A-Type B classification that has emerged (see here: What is Data Science?). Broadly speaking, the categorisation can be summarised as:

  • Data science for people (Type A), i.e. analytics to support evidence-based decision making
  • Data science for software (Type B), for example: recommender systems as we see in Netflix and Spotify

We may see further evolution of these definitions as the field matures, and this is where we will introduce our first expert into the mix: Yanir Seroussi (remember the names, as we will be returning to them throughout). Yanir is currently Head of Data Science at Car Next Door (a start-up enabling car sharing), and he wrote about this very topic in his blog: Is Data Scientist a Useless Job Title? If you enjoy this, check out Yanir’s other posts – he is a regular and eloquent writer on a variety of topics around data science.

Owning Up To The Title

Before we delve any deeper, it is worth taking a moment to reflect on the ‘science’ in ‘data science’, because – in a sense – all scientists are data scientists, as they all work with data in one form or another. But to take what is generally considered to be data science in industry, what actually makes it a science? Great question! The answer should be: ‘the scientific method’. Given the multi-disciplinary nature of science, the scientific method is the one thing that binds the fields together. If you got this right, full marks to you.

However, job titles tend to be applied very loosely in industry and so not all data scientists are true scientists. Ask yourself though: can you justify calling yourself a scientist if your role does not involve actual science? Personally, I do not see what is wrong with alternatives like ‘analyst’, or whatever best fits the position in question. But maybe this is just me, and perhaps I would be better off calling myself a recruitment scientist.

For an excellent discussion on this, I thoroughly recommend this post by Sean McClure: Data Scientist: Owning Up To The Title(yes, I admit it – I plagiarised the heading).

And with that out the way, we will continue this exploration by considering what areas of expertise you will need to master (if you haven’t already).

1. Problem Solving

If this is not top of your list, amend that list. Immediately. At the core of all scientific disciplines is problem solving: a great data scientist is a great problem solver; it is as simple as that. Need further proof? How about every single person I met for this project, irrespective of background or current working situation, mentioned this as THE most important factor in data science.

Clearly, you need to possess the tools to solve the problems, but they are just that: tools. In this sense, even the statistical/machine learning techniques can be thought of as the tools by which you solve problems. New techniques arise, technology evolves; the one constant is problem solving.

To an extent, your ability as a problem solver is dictated by your nature, but at the same time, there is only one-way to improve:experience, experience, experience. We will re-visit this in Chapter Three, so at this point, just remember this important lesson: you can only master something through doing.

Before we move on, I would like to direct you to another great post from Sean McClure: The Only Skill You Should Be Concerned With (just to be clear, I am not receiving any payment for these pointers, but I am totally open to it. Sean – if you are reading this, you can send me money anytime).

2. Statistics / Machine Learning

Scientific calculator

Ok, having read the above, it might seem like I have trivialised statistics and machine learning. But we are not talking about a power tool here; these are complex – and to an extent – esoteric fields, and if you do not possess expert knowledge, you will not be solving data science problems any time soon.

To provide some much-needed clarification on these terms, machine learning can be viewed as a multi-disciplinary field that grew out of both artificial intelligence/computer science and statistics. It is often seen as a subfield of AI, and while this is true, it is important to recognise that there is no machine learning without statistics (ML is heavily dependent on statistical algorithms in order to work). For a long time statisticians were unconvinced by machine learning, with collaboration between the two fields being a relatively recent development (see statistical learning theory), and it is interesting to note that high dimensional statistical learning only happened when statisticians embraced ML results (thanks to Bhavani Rascutti, Advanced Analytics Domain Lead at Teradata for this input).

For the technical readers who are interested in a more detailed account, check out this classic paper published in 2001 by Leo Breiman: Statistical Modelling: The Two Cultures.