KDnuggets Home » News » 2016 » Aug » Opinions, Interviews » How to Become a Data Scientist – Part 2 ( 16:n32 )

How to Become a Data Scientist – Part 2

Check out part 2 of this excellent series of articles on becoming a data scientist, written by someone who spends their day recruiting data scientists. This installation focuses on learning.

By Alec Smith, Data Science Recruiter.

This is Part Two in a Three-Part Series

Part One: What is Data Science? | Part Three: The Job Market

Desk setup


Having read Chapters One and Two (i.e. Part One), you should now have a good comprehension of what commercial data science entails, the different forms it takes, and what is required to be a success in the profession. And having thought deeply about your motivations, you should have a clear picture of your goals, and ultimately – the type of data scientist you want to become. So give yourself a pat on the back, because you are now ready to begin the real fun: learning.

In this chapter, we will explore the options at your disposal – but first – we will begin proceedings by discussing an important notion that concerns data science and learning.

Continual Learning

Just like a doctor has to stay abreast of medical developments, learning never stops for a data scientist. The field (and the technology) evolves so quickly; what you learn now might not be relevant in the years to come. Look at the rise of deep learning, to take just one example. This is what Sean McClure was alluding to in his post emphasising the importance of problem solving (highlighted in Chapter One).

Quite simply, if you are not passionate about the field and do not enjoy learning, then data science is not for you. Attending conferences and networking with the data science community are effective ways of keeping on top of the latest developments, and it is advisable to regularly read books and papers. On the latter: if you do not have a background in research, it is worth familiarising yourself with academic papers so you can get the most out of them (I haven’t specifically researched the best way to go about this, but after this post was featured on Hacker News, the user ‘Obi_Juan_Kenobi’ came up with an interesting answer to this question – if you have the patience to scroll through this thread: https://news.ycombinator.com/item?id=12243377).

Play. Build. Experiment.

Going back to the message we touched on in Chapter One, there is only one-way to develop your capability as a data scientist: experience, experience, experience. I could launch into a lengthy discussion on this, but I happened to come across two excellent posts that cover the key points so have a read of Brandon Rohrer: A One-Step Program for Becoming a Data Scientist and Rossella Blatt Vital: The Scary Rise of the 'Fake Data Scientists'.

This is what should you take from these: data science is an expert field, it takes a long time to master, and you will only do so through practical experience. As James Petterson summarised:

Nothing beats experience. You can read as much as you want, you can do all the Coursera courses, but unless you get your hands dirty, you won’t learn.

The good news is there are some great avenues to gain practical experience, and we will turn our attention to these now.


Kaggle / Open-Source / Freelancing

If you haven’t heard of Kaggle, Google it... NOW! Kaggle is an incredible platform where you can play around, develop your expertise and learn, of course. James put it this way:

If I hadn’t competed in Kaggle competitions, I would have finished my PhD without knowing the tools that people use in industry. For example, a lot of the methods used in industry are based on ensembles or decision trees, like random forests. They are really powerful and are my first choice in both competitions and industry, but I wasn't exposed to them during my PhD.

There you have it: you can improve your skills while learning the techniques that are commonly applied in industry. And if you start doing well in the competitions, it provides evidence of your capability, as we will see in Chapter Four.

Outside of Kaggle, another option is to contribute to open-source projects. A simple search on GitHub should reveal some projects you can start to sink your teeth into, and gain practical experience while doing so.

Finally, if you can get freelancing work, this is a great way to build a track record and demonstrates that you can operate in a commercial environment. And rather conveniently, you could even utilise the Experfy platform for that purpose.

To PhD or not to PhD

Do you need a PhD to be a data scientist? Not necessarily, but there are many advantages, as Sean Farrell noted:

The process of obtaining a PhD is a filter for creative problem solving skills [and it] shows you can master a particular field in a short space of time and become a world expert, which proves you’ll be able to do it again and again.

And apart from anything, it provides you with the time to study and to develop your skills. Furthermore, if you are interested in specialising within a specific area like image processing or natural language processing, then PhD research is certainly worth considering.

But going down this path is not the only way to data science. James did a PhD in Machine Learning (focused on researching a very specific type of method) and he feels that a lot of PhD research is not always applicable to industry, i.e. if your job is to apply machine learning rather than research it, you don’t necessarily need a PhD. As such, I asked him whether he thinks people should choose a PhD based on its relevance to industry and he said:

If possible, but that’s really hard because most of what we do in industry is not state of the art, we use methods that have been around for years and apply them to different problems. There are exceptions of course: you might work at Google in research, for example. But most of the knowledge I use day-to-day, I learnt working [at Commonwealth Bank] and by competing in Kaggle. Of course, doing a PhD, you learn about the whole process, spend a lot of time doing experiments and learning how to do them properly, and that is valuable. But I wonder if you could learn that from other means?

Given the right motivations and armed with an informative guide on how to become a data scientist (where could you find one of those I wonder?), I have no doubt it is possible to learn by yourself. But it is worth making the point again: there are no shortcuts; it requires a lot of self-study and getting your hands dirty – whatever path you take.

There is also the employability aspect to consider: are you more employable as a PhD graduate vs. spending the same time on self-study? I do not have sufficient proof to comment, but either way, it is more important whether you have truly spent the time building up expert capability (and how you can evidence this). PhD’s are certainly valuable but there are great data scientists with PhD’s and great ones without.