The Art of Data Science: The Skills You Need and How to Get Them
Learn, how to turn the deluge of data into the gold by algorithms, feature engineering, reasoning out business value and ultimately building a data driven organization.
By Joseph Blue, MapR.
The meteoric growth of available data has precipitated the need for data scientists to leverage that surplus of information. This spotlight has caused many industrious people to wonder “can I be a data scientist, and what are the skills I would need?”. The answer to the first question is yes – regardless of your prior experience and education, this role is accessible to motivated individuals looking to meet this challenge. As for the second question, the necessary skills (some formal and some more… artful) and how to acquire them, based on my experience as a data scientist, are enumerated below…
1. Knowing The Algorithms
To be a data scientist, you need to know how and when to apply an appropriate machine-learning algorithm. Period. Here’s where to develop and hone those skills:
- The Basics – one of my favorite books on this subject in Machine Learning in Action. It covers the main categories of algorithms (classification, regression, clustering, etc.) and provides Python code to experiment with the examples for yourself. I recommend working through a similar book first to get a firm foundation.
- Advanced – the best place to find out what’s hot in data science is the website Kaggle. The competitions result in the democratization of data science – all that matters is that the method that achieves the best results wins. Problems and methods are discussed in forums and the winners share their approaches in blogs. If you set up a free account and follow the competitions for 6-8 months, you will get loads of real-life machine learning experience.
2. Extracting Good Features
Applying the appropriate algorithm doesn’t guarantee performance. You need to provide the method with the right inputs (note: in some situations the raw data will suffice). This is commonly referred to as “feature engineering”. You need to be ready for any potential scenario – and practice makes perfect – but here are some common variable types you’ll need to be aware of:
- The Basics – you’ll learn a lot about feature engineering from the Kaggle competitions. But you will want to keep your eye on social media for posts about good feature development. Two twitter feeds I recommend are @DataScienceCtrl and @analyticbridge. You can also read about other data scientists giving their takes on acquiring the skills for free.
- Risk tables – translating non-numeric data into risk can be effective when you’re dealing with many categories of varying size. Here’s a blog on building smoothed risk tables.
- Text – when you’re dealing with a text field supplied by the user, anything goes (spelling, formatting, morphology, etc.) – as an example, think about how many variations of your address will result in you getting your mail. Maybe 50? There are entire books written on how to deal with text (here’s one of my favorites). Techniques are generally classified under the heading of NLP (Natural Language Processing). Common methods for translating raw text into features include TF-IDF, token analyzers (e.g. Lucene) and one-hot encoding.
- Composite Features – data science borrows heavily from other fields, often crafting features from the principles of statistics, information theory, biodiversity, etc. A very handy tool to have in your arsenal is the Log Likelihood Ratio. It’s almost like building a mini-model into one feature – rather than have a method discover this from the raw data, present a calculated test statistic which encapsulates the behavior.
3. Demonstrating the Value
Sites like Kaggle are invaluable for loading your data science toolbox with effective methods. And new tools are becoming available every day, like Google’s TensorFlow. But the real art of data science is in applying these tools to address a business challenge – otherwise they’re just impressive equations on a chalkboard. The only real way to learn this art is practice, practice and more practice. Here are a couple of things to keep an eye on while you’re practicing:
- Reason codes – how will your solution be used? It doesn’t happen in every case, but sometimes interpreting the results of your model is critical to realizing the value. For instance, if a high score gets put into a queue for human review, the reviewer might need to know why this particular transaction scored high. Crafting good features and selecting an algorithm that yields reason codes are critical for interpreting the results.
- Operational considerations – the choice of evaluation metric must be carefully chosen when estimating the value of your solution. Generally, there is a cost associated with taking action on a high score. Is there a limit to how many actions you can take (such as the maximum number of units a human team can manually review)? Is there a cost associated with making a bad decision – false positives can be a problem and should be factored into the metrics. The situation presented by these scenarios may make the standard model metrics obsolete.
- Generalizing the results – a model that is trained to great acclaim is worthless if performance doesn’t hold up in real life. For each of your features, ask yourself “will this information be available when my model needs it”? If not, the impressive performance you attained in the lab won’t be achieved when it counts. For example, if you’re trying to estimate hospital readmission today and you trained a model user notes from the previous day, make sure those notes will be available quickly enough that the features could get a value.
4. Championing the Solution
A solution built with great features and the right algorithm that demonstrates tangible value from a real-life business problem is just the beginning of the data scientist’s job. For that value to be realized, you must manage your solution through its journey in the organization. Here are a few of the internal roles you must be prepared to engage:
- Technical experts – since these are the people who will be using your solution, you’ll want to make sure they’re using it correctly. It’s best to leverage their input and experience when developing the features, so they’ll understand how to interpret the model (if not exactly how it works) later.
- Business stakeholders – these are the people who need to understand how the constraints of the model go into threshold selection. A common scenario you need to address is “if I increase the headcount of the team that handles the queue, what is the increase in profit”? Analysis of the threshold with these individuals can optimize the value of the model long after it’s built.
- Upper management – at a very high level, these individuals want to understand enough of how the model works to feel comfortable that the results are dependable. In some cases, machine learning represents a change to the way in which business decisions are made and that can lead to discomfort among the team. Gaining support for your model at the top can make for a smooth transition to profitability.
I presented these four categories of data science skills in typical chronological order in which they are obtained and practiced. However, to perform your job as data scientist effectively, you should really work from the bottom up. Once you’ve consulted the stakeholders, understood the constraints of the problem, considered the pros and cons of the available methods and outlined the potential features, you are almost assured of success before you’ve even written one line of code.
Bio: Joseph Blue is a Data Scientist at MapR, helping customers to solve their big data problems. Previously, Joe was the Chief Scientist for Optum (a division of UnitedHealth) and the principal innovator in analytics for healthcare.