Unicorn Data Scientists vs Data Science Teams
A recent post has generated an intense discussion about finding "unicorn" data scientists with a combination of all the needed skills, or whether that skillset is best filled by a team. Here are the highlights, including a proposal how to train well-rounded data scientists.
By Gregory Piatetsky, Dec 30, 2013.
Guest post by Michael Mout What is Wrong with the Definition of Data Science has generated an intense discussion about whether it is possible to find data scientist that have all the needed skills or whether those roles are best filled by a team.
Michael Mout divided Data Science into 3 areas:
- Advanced Analysis - Math, Stats, Pattern Recognition/Learning, Uncertainty, Data Mining,Visualization
- Computer Systems - Advanced Computing, High Performance Computing, Data Mining, Visualization (?)
- Data Bases - Data Engineering, Data Warehousing.
I note that above Data Science Venn Diagram is different from a famous Data Science Venn Diagram proposed by Drew Conway in 2010, whose 3 main components are Hacking Skills, Math & Statistics Knowledge, and Substantive Expertise (Domain Knowledge).
Mout is missing the "Domain Knowledge part", as Vincent Granville pointed in the comments below.
Here are highlights from the comments on What is Wrong with the Definition of Data Science:
This Venn diagram misses the most important circle: domain expertise / business acumen. You can be a data scientist without computer science, statistics or data base (thought it would be very difficult). You can't be a data scientist without deep domain expertise and horizontal business knowledge.
Thomas Speidel to Vincent Granville
Yes you can. The point of the article is that one cannot be all of those things at the same time. Domain expertise is a crucial component (perhaps forgotten by the Venn diagram), but it needs not to be held by the same person doing data science. The Venn Diagram describe the composition of a team, of which the domain expert ought to be one component.
I disagree ... that specialization is the road ahead. Any respectable modern CS graduate degree will give you exposure to all these areas. Actually, we have more issues getting our Ph.D. statisticians to program than our M.A. CS graduates to pick up common machine learning algorithms, statistics refresher courses, and linear optimization details.
Thomas Speidel to linuxster
Superficial exposure is one thing. In depth knowledge is another. Data Science should not be about programming, [and] ... is not just ML. There's so much more to data science than just CS: experimental design, interpretation, validity, replication, effect size, Bayesian statistics, calibration and so on. This is just on the statistics side. Then you have database design/management, warehousing, visualizations, and so on. It is simply not possible to know about such a vast subject with any reasonable detail that is needed to make sound informed decisions.
I disagree with the assumption that combining these different areas is impossible, and I think in the future you will see more skills overlap within individuals, not less. It will initially be hard to find new graduates with these skill sets, since this is a new discipline and it doesn't line up directly with existing degree programs. Over time, that should improve as it did with software engineering and computer science.
In traditional organizations, many of the skills mentioned in the post were possessed by distinct individuals or teams. This creates friction in the product development process vs. having statistical engineers or data scientists who can both build algorithms and implement them in production. Software engineering is a much higher leverage activity than it was 10 years ago, and a scientist or statistician who can write high performance code to run on large datasets is a valuable asset.
The Data Scientist role itself was a reaction to limits of traditional roles and organizational silos, so a higher degree of specialization is not a great path. Within a team, you will always have individuals who are stronger in certain areas, but we should aim to develop some basic competency across all these skills. I'd actually add more requirements to that list above, not less. Regarding hiring, I wouldn't expect junior candidates to be experts in everything, but that is where training and development come in.
John Rauser of Pinterest has a great talk on the different paths towards becoming a data scientist here: www.youtube.com/watch?v=0tuEEnL61HM
I think a realistic goal to shoot for is (1) a set of core skills, (2) deep expertise in 1 or more focus areas with some basic competency in the rest, and (3) some amount of domain expertise acquired over time through applied work.
These might look something like this:
- Basic CS, Software Development, Tools
- Data Engineering (Distributed Computing, etc.)
- Scientific Training, Mathematics, Modeling, Theory
- Machine Learning
- Business Analytics
- Graph Mining / Network Intelligence
- Text Mining / Information Retrieval
- Data Visualization
- Consumer Internet
- Oil & Gas