KDnuggets Home » News » 2017 » Feb » Tutorials, Overviews » 5 Career Paths in Big Data and Data Science, Explained ( 17:n05 )

Gold Blog, Feb 20175 Career Paths in Big Data and Data Science, Explained


Sexiest job... massive shortage... blah blah blah. Are you looking to get a real handle on the career paths available in "Data Science" and "Big Data?" Read this article for insight on where to look to sharpen the required entry-level skills.



I have recently had a lot of folks reach out, mainly on LinkedIn, looking for advice on getting started in "Data Science" and/or "Big Data." These people are generally interested in breaking into "the field" and need some direction on how to go about doing so.

A common theme in these requests, however (and I say this with the utmost respect), is a general lack of understanding of what it is they are actually asking. And that's fine; everyone needs to start somewhere, no matter what it is they are learning. Instead of answering these similar requests one by one, this post will serve to lay out some very basic concepts related to "Data Science" and/or "Big Data" career paths, and hopefully provide some advice on how to get one's feet wet in this convoluted field.

Essential Preliminary Reading

 
Before going any further, read the following articles. I mean it. Read. These. Articles.

  1. Data Science Puzzle, Explained
  2. Data Science Puzzle, Revisited
  3. Data Science and Big Data, Explained
  4. Predictive Science vs Data Science

The first article provides a general overview of some of the dominant concepts in data science, with the second being an update to these concepts from earlier this year. The third article provides a deeper treatment of the concepts of data science and Big Data. The fourth and final article is a quick discussion touching on some of the complexities and nuances surrounding the use of the term "data science" versus a number of other terms.

I have broken up the various professional possibilities into an easily manageable set of 5 career paths. While there may be mass outcry and widespread panic related to this particular division of roles, they really serve to categorize skills and professional responsibilities at a high level, and so I believe the following is quite useful for orienting newcomers to the myriad opportunities which exist in this professional realm, myriad opportunities which are often easily conflated and confused.


Back of the envelope analysis of analytics careers (click to enlarge).

Data Management Professional

 
This is essentially an IT role, akin to the database administrator. The data management professional is concerned with managing data and the infrastructure which supports it. There is little to no data analysis that takes place in such a role, and the use of languages such as Python and R is likely not necessary. SQL may be of use, as well as Hadoop-related query languages such as Hive or Pig.

Key technologies and skills to focus on:

  • Apache Hadoop & its ecosystem
  • Apache Spark & its ecosystem
  • SQL & relational databases
  • NoSQL databases

Further reading:

Data Engineer

 
This is the big Big Data non-analytic career path. The data infrastructure mentioned in the previous career path? Well, it needs to be designed and implemented, and the data engineer does that. If the data management professional is the car mechanic, data engineering is the automotive engineer. But don't get it twisted; both of these roles are crucial to both the delivery and continued functioning of your car, and are of equal importance when you are driving from point A to point B.

Truth be told, the technologies and skills required for data engineering and data management are similar; however, they each use and understand these concepts at different levels. I won't repeat the information shared in the role above (all of which is important to the data engineer), and will instead add some further reading specific to the data engineer.

Further reading:

Business Analyst

 
I'm using business analyst in this context to refer to roles related strictly to the analysis and presentation of data. This includes reporting, dashboards, and anything referred to as "business intelligence." The role often requires interaction with (or querying of) databases, both relational and non-relational, as well as with Big Data frameworks.

While the previous pair of roles were related to designing the infrastructure to manage the data, as well as actually managing the data, business analysts are chiefly concerned with pulling from the data, more or less as it currently exists. This can be contrasted with the following 2 roles (machine learning researcher/practitioner and the data-oriented professional), both of which focus on eliciting insight from data above and beyond what it already tells us at face value. As such, business analysts require a unique set of skills among the roles presented.

Key technologies and skills to focus on:

  • SQL & relational databases
  • NoSQL databases
  • Often requires commercial reporting and dashboard package know-how
  • Reporting can often be ad hoc in nature, and mastery of tools for quickly adapting is key
  • Data warehousing

Further reading:

Machine Learning Researcher/Practitioner

 
Machine learning researchers and practitioners are those crafting and using the predictive and correlative tools used to leverage data. Machine learning algorithms allow for the application of statistical analysis at high speeds, and those who wield these algorithms are not content with letting the data speak for itself in its current form. Interrogation of the data is the modus operandi of the machine learning aficionado, but with enough of a statistical understanding to know when one has pushed far enough, and when the answers provided are not to be trusted.

Statistics and programming are the biggest assets to the machine learning researcher and practitioner.

Key technologies and skills to focus on:

  • Statistics!
  • Algebra & calculus (intermediate level for practitioners, advanced for researchers)
  • Programming skills: Python, C++, or some other general-purpose language
  • Learning theory (intermediate level for practitioners, advanced for researchers)
  • An understanding of the inner workings of an arsenal of machine learning algorithms (the more algorithms the better, and the deeper the understanding the better!)

Further reading:

Deep learning? While it is a form of machine learning, I have included a separate list of suggested readings for clarity:

Data-oriented Professional

 
This is the best description I could come up with for what could otherwise be referred to as the "real" data scientist. You know, the unicorns. Except, there are no unicorns, and anyone who says differently is lying.

The data management professional and data engineer were concerned with the infrastructure which houses the data. The business analytics professional is concerned with pulling facts from the data as it exists. The machine learning researcher and practitioner are concerned with advancing and employing the tools available to leverage data for predictive and correlative capabilities, with both roles being algorithm-based (either developing, or utilizing, or both). The data-oriented professional is concerned primarily with the data, and the stories it can tell, regardless of what technologies or tools are needed to carry out that task.

The data-oriented professional may use any of the technologies listed in any of the roles above, depending on their exact role. And this is one of the biggest problems related to "data science;" the term means nothing specific, but everything in general. This role is the Jack Of All Trades of the data world, knowing (perhaps) how to get a Hadoop ecosystem up and running; how to execute queries against the data stored within; how to extract data and house in a non-relational database; how to take that non-relational data and extract it to a flat file; how to wrangle that data in R or Python; how to engineer features after some initial exploratory descriptive analysis; how to select an appropriate machine learning algorithm to perform some predictive analytics on the data; how to statistically analyze the results of said predictive task; how to visualize the results for easy consumption by non-technical folks; and how to tell a compelling story to executives with the end result of the data processing pipeline just described.

And this is but one possible set of skills a data scientist may possess. Regardless, however, the emphasis in this role is on the data, and what can be gleaned from it. Domain knowledge is often a very large component of such a role as well, which is obviously not something that can be taught here.

Key technologies and skills to focus on:

  • Statistics!
  • Programming languages: Python, R, SQL
  • Data visualization
  • Communication skills

Further reading:

 
As an introductory article, I have intentionally left out any mention of the Internet of Things (IoT). This is for 2 reasons: first, I don't want to add any additional confusion for anyone trying to absorb all of this new material, and second, IoT is but a special case of data, and each of these roles can apply to IoT data with, perhaps, some modifications. But the core truths remain.

I hope this overview has been of use to some people looking to start off on a "Data Science" or "Big Data" career path, but weren't quite sure where or how to begin. Keep in mind that this is in no way an exhaustive curriculum for taking on any of the roles mentioned herein. It is a good place to start for individuals with little understanding of data professions, however.

If you are interested in a different take on the topic, read Zachary Lipton's Will the Real Data Scientists Please Stand Up?

Related: