KDnuggets Home » News » 2016 » Aug » Tutorials, Overviews » How to Become a (Type A) Data Scientist ( 16:n31 )

Gold BlogHow to Become a (Type A) Data Scientist


This post outlines the difference between a Type A and Type B data scientist, and prescribes a learning path on becoming a Type A.



Top KDnuggets Blogger for August 2016.
There is an interesting post on Quora by Michael Hochster about What is Data Science which talks of two types of Data Scientists: Type A Data Scientists and Type B Data Scientists (emphasis mine)

Data scientist

Data Scientists are people with some mix of coding and statistical skills who work on making data useful in various ways. In my world, there are two main types:

Type A Data Scientist: The A is for Analysis. This type is primarily concerned with making sense of data or working with it in a fairly static way. The Type A Data Scientist is very similar to a statistician (and may be one) but knows all the practical details of working with data that aren't taught in the statistics curriculum:  data cleaning, methods for dealing with very large data sets, visualization, deep knowledge of a particular domain, writing well about data, and so on.

The Type A Data Scientist can code well enough to work with data but is not necessarily an expert. The Type A data scientist may be an expert in experimental design, forecasting, modelling, statistical inference, or other things typically taught in statistics departments. Generally speaking though, the work product of a data scientist is not "p-values and confidence intervals" as academic statistics sometimes seems to suggest (and as it sometimes is for traditional statisticians working in the pharmaceutical industry, for example). At Google, Type A Data Scientists are known variously as Statistician, Quantitative Analyst, Decision Support Engineering Analyst, or Data Scientist, and probably a few more.

Type B Data Scientist: The B is for Building. Type B Data Scientists share some statistical background with Type A, but they are also very strong coders and may be trained software engineers.  The Type B Data Scientist is mainly interested in using data "in production."  They build models which interact with users, often serving recommendations (products, people you may know, ads, movies, search results).

In this post, we discuss strategies to transition your career to the role of a (type A) Data Scientist.

The ideas I discuss here are based on my work with the participants of the Data Science for IoT course

When I first read this idea of Type A and Type B Data Scientists, I found it incredibly liberating.

It makes you realize that the ‘unicorn’ Data Scientist (who know it all!) - like their equestrian counterparts - are also largely mythical.

Having acknowledged that, you can then start to make practical progress towards a strategy for transitioning your career towards Data Science.

Here, I discuss twelve uncommon strategies for transitioning to a Type A Data Scientist based on my work with my course participants.

But first, let us discuss one common theme. Yes, you must build an app to learn Data Science. But building itself is not enough. When you start with limited resources and try to build a serious app for Data Science, you find quite quickly that the biggest limitation is the lack of serious data. So, like everyone else you also end up building your app against the UCI datasets (or similar). Hence, you need to think of wider strategies in addition to building. Here are some of the strategies we follow in our teaching:

Notes:

  • The audience here is someone who is exploring Data science on their own.
  • I use the Type A classification as someone who uses Data Science to solve many complex problems – but is not responsible for working with a high performance model in Production.
  1. Start with what you know: This seems obvious but often ignored. For example, imagine you have spent the better part of your career with Oracle and you want to be a Data Scientist. Why not start with Oracle?  There is a whole suite of Oracle BI tools . These cover the whole range from visualization to advanced analytics. They use Pl/SQL or R. That helps to get you started a lot faster instead of learning something entirely new. I would also use the same strategy for Microsoft (Azure and Power BI) and Amazon (AWS IoT) – both these platforms are advanced
  2. Focus on Statistics:  There is a tendency to confuse Data Science (analytics and application of algorithms) with Big Data (Hadoop based distributed infrastructure). To master the former – you must have some background with Statistics and understand the maths behind the predictions. I do not think you can really master both especially in the early stages. But people still try! And they do not get far in either. So, my advice in my teaching is to focus on the use of Statistics(algorithms) to solve problems. It is also worth reading this insightful post - Why Big Data is in trouble? Because they forgot about applied statistics. Also remember that some of the best known Data Scientists today like Gregory Piatetsky-Shapiro and Kirk D Borne come from a strong maths/stats background.
  3. Focus on solving problems with high volumes of Data: This is not necessarily the same as Big Data(because it may not be distributed data processing) – but focusing on very large datasets gives you a richer experience. This is valid even if that dataset is not directly related to the application you may be working with in future. In other words, the experience of handling large datasets in itself is valuable. That explains why many Data Scientists like Kirk D Borne have a background with NASA (or Space research) – because of the abundance of Data. Also, many good data scientists come from a background of weather prediction(for the same reason). In our business (Data Science for IoT), the data volumes matter. You can either do IoT analytics using a simple Raspberry Pi/Arduino or you can analyse data for an entire Airliner or an Oil Rig. The two are entirely different. One of the best books I like on Data Science is Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data Željko Ivezić, Andrew J. Connolly, Jacob T. VanderPlas & Alexander Gray. Even at 550 odd pages it is very readable. But more importantly, it works with very high volumes of Datasets
  4. Use the Zulu Principle: The Zulu Principle is based on the idea of becoming an expert on a clearly defined and a narrow subsection of the theme: I have used this myselves with Data Science for IoT i.e. identify a niche and work on it extensively to demonstrate expertise
  5. Systems Thinking - solve an industry-wide problem: The opposite of micro niches strategy is the Systems thinking approach. This requires wide understanding and experience to see interconnections or gaps in the industry. Again, a strategy I have used with Jean Jacques Bernard in creating an open methodology for IoT analytics
  6. Kaggle algorithms: If you follow Kaggle contests, they tend to use some algorithms more than others – ex Xgboost. It’s better to keep these in mind also from the early stages since you may end up using these soon
  7. SQL and NoSQL – focus on SQL and NoSQL: there is a lot that can be achieved using SQL and NoSQL. See these steps to get started with SQL and NoSQL
  8. Knowledge of analytics in a vertical: This is self-explanatory but often overlooked. Business knowledge will be increasingly important and in every vertical there are specific metrics to be predicted or analysed.
  9. AI and deep learning:  AI and Deep learning are definitely applicable in many domains of Data Science and should be a key focus. But with caution as the post from Yann Le Cun shows
  10. A focus on tools: there are many tools that make your life a lot easier as a Data Scientist. In most cases, they have interfaces to Programming Languages like R. Full lifecycle tools which we have used in the course are: h2o.ai, dataiku, Tibco spotfire and Pentaho
  11. Focus on R: Whatever your view on the R vs. Python debate, R is getting a lot of traction at a corporate level i.e. for vendors targeting Enterprises.  Oracle, Mirosoft, HPE(Vertica), SAP(Hana), Hitachi(Pentaho)  etc
  12. Finally, a focus on emerging domains – IoT is still an emerging domain and Data Science. IoT has unique characteristics such as a greater emphasis on Time series, sensor fusion etc. Even within IoT, new domains emerge – for example we are currently exploring IoT analytics for Apache niFi. Blockchain would also be an emerging domain for analytics currently

To summarize:

  1. Not trying to be a ‘Unicorn’ can be liberating. It can help you to focus on practical steps for mastering Data Science.
  2. There are many possible avenues. If you want to be a data scientist, ideally, you would look to Excel in one area and know  a lot of other things well enough
  3. You must know coding – but you need not be a narrow expert if you want to start with being a Type A data scientist
  4. You must build things – there is no substitute for building services – but you need some thought on what exactly you want to build as above
  5. Understand and apply statistics
  6. Remember that Data Scientist is a generalist role
  7. Data Science is a very rapidly expanding field and many opportunities are yet to be explored in verticals, new domains, new tools etc
  8. It may be relatively easier to start as a type A data scientist and then progress to type B(production)

I explore many of these ideas in the Data Science for IoT course.

Related: