How to Become a Data Scientist – Part 1
Check out this excellent (and exhaustive) article on becoming a data scientist, written by someone who spends their day recruiting data scientists. Do yourself a favor and read the whole way through. You won't regret it!
We only need to briefly touch on programming because it should be obvious: this is an absolute must. What is the point of knowing the mathematical underpinnings of machine learning, if you cannot apply the theory, and do it with speed?
b. Distributed Computing
Not all businesses have massive datasets but considering the modern world, it is advisable to develop the ability to work with BIG DATA (!). In short: the main memory of a single computer is not going to cut it, and if you want to simultaneously train models across hundreds of virtual machines, you need to get to grips with distributed computation and parallel algorithms.
Why the exclamations mark? Personally, I find the misnomer that is “big data” farcical. The term is continually confused and often used as an umbrella term for all analytics. Furthermore, massive data volumes (and the technologies to store and manage these quantities) are not new like they once were, so it is only a matter of time before it expires from our lexicon. For an expanded discussion on this, there is yet another sensible post from Sean McClure: Data Science and Big Data: Two Very Different Beasts(this is getting ridiculous now – I swear I have never even talked to the guy).
c. Software Engineering
For Type A data science, let me be clear: engineering is a separate discipline. So if this is the type of data scientist you want to become, you do not need to be an engineer. However, if you want to put machine learning algorithms into production (i.e. Type B), you will need a strong foundation in software engineering.
4. Data Wrangling
Data cleaning/preparation is a crucial and intrinsic part of data science. And this will take up the majority of your time. If you fail to remove the noise from your dataset (e.g. wrong/missing values, non-standardised categories, etc.), then the accuracy of the model will be affected and will ultimately lead to incorrect conclusions. Therefore, if you are not prepared to spend the time and attention on this step, it renders your advanced technical know-how irrelevant.
It is also important to note that data quality is a persistent issue in commercial organisations and many businesses have complicated infrastructures when it comes to data storage. So if you are not prepared for this environment and you want to work with nice clean datasets, unfortunately commercial data science is not for you.
5. Tools and Technology
As you should have realised by now, developing your ability as a problem solving data scientist should take precedence over everything else: technologies constantly change and can ultimately be learnt in a relatively short timeframe. But we shouldn’t ignore them altogether, so it is useful to be aware of the most widespread tools in use today.
Starting with programming languages, R and Python are the most common; so if you have a choice, perhaps use one of these when you are experimenting.
Particularly in Type A data science, having the ability to visualise data in intuitive dashboards is very powerful for communicating with non-technical business stakeholders. You might have the best model and the best insights, but if you cannot present/explain the findings effectively, what use is it? It really doesn’t matter what tool you use for visualisation – it could be R, or Tableau (which seems to be the most prevalent at the moment), but honestly – the tool is unimportant.
Finally, SQL is significant, as it is the most common language used to interact with databases in industry; whether we are talking about relational databases, or derivatives of SQL used with big data technologies. And it is the bread and butter of data wrangling – at least when working at larger scales (i.e. not in memory). In summary: it really is worth investing your time into.
6. Communication / Business Acumen
This should not be understated. Unless you are going into something very specific, perhaps pure research (although let’s face it, there aren’t many of these positions around in industry), the vast majority of data science positions involve business interaction, often with individuals who are not analytically literate.
Having the ability to conceptualise business problems and the environment in which they occur is critical. And translating statistical insights into recommended actions and implications to a lay audience is absolutely crucial, particularly for Type A data science. I was chatting to Yanir about this, and this is how he put it:
“I find it weird how some technical people don't pay attention to how non-technical people's eyes glaze over when they start using jargon. It's really important to put yourself in the listener's/reader's shoes”
It probably isn’t clear: I have used this heading ironically. No – data scientists are not rock stars, ninjas, unicorns or any other mythical creature. If you are planning on referring to yourself like this, perhaps take a long look in the mirror. But I digress. The point I want to make here is this: there are some data scientists who possess expert level ability in all of the above, and perhaps more. They are rare and extremely valuable. If you have the natural ability and desire to become one of these, then great – you are going to be hot property. But if not, remember: you can specialise in certain areas of data science, and quite often, good teams are comprised of data scientists with different specialities. Deciding what to focus on goes back to your interests and capability, and this leads us nicely to the next chapter in our journey.