KDnuggets Home » News » 2016 » Oct » Opinions, Interviews » Battle of the Data Science Venn Diagrams ( 16:n36 )

Gold BlogBattle of the Data Science Venn Diagrams

First came Drew Conway's data science Venn diagram. Then came all the rest. Read this comparative overview of data science Venn diagrams for both the insight into the profession and the humor that comes along for free.

By David Taylor, Biotechnologist.

Data science is a rather fuzzily defined field; some of the definitions I've heard are:

  • "Work that takes more programming skills than most statisticians have, and more statistics skills than a programmer has."
  • "Applied statistics, but in San Francisco."
  • "The field of people who decide to print 'Data Scientist' on their business cards and get a salary bump."

Personally, I've recently decided to avoid the controversy by calling myself a data spelunker. (Data miners are out of vogue anyway.)

As a field in search of a definition, it's unsurprising that you can find a lot of different attempts to define it.

As a field full of data nerds with a penchant for visualization, it's also unsurprising that a lot of them use Venn diagrams. (Fun fact: John Venn, who invented the eponymous diagrams, and his son filed a patent in 1909 for an lawn bowling machine.)

1. It all started with Drew Conway in 2010 (catching fire when he blogged it in 2013):

For Conway, the center of the diagram is Data Science. There's some controversy over what the bottom circle means (I'll address it farther down); all I can say, is if Conway meant something other than what I would call domain knowledge (e.g. physics), he chose the name Substantive Expertise very poorly. So assuming domain knowledge is at least part of what he meant, the idea is that a physicist, say, would have expertise in physics and math/stats knowledge, but lack hacking knowledge (I've met many physicists and I think that's less true than it used to be). Machine Learning experts tend to apply algorithms without an understanding of the domain they're analyzing (that sure as heck was my case when I first started building models in an industry that was totally new to me; I had to play a lot of catchup). And then people who can program and know their field but have no way to tell a statistically significant result from one arising from sheer coincidence are dangerous; they can arrive at some drastically wrong solutions and, for example, lose their companies lots of money.

Note that this isn't how a Venn diagram works. Hacking Skills, for example, should apply to that entire circle, and the part that doesn't intersect with anything should be labeled, e.g. "hackers". But that's a fairly minor point, it's obvious what he's getting across.

2. After Conway's was made but before it was blogged, Brendan Tierney made a diagram in 2012 that's kinda Venn-ish.

It... sure is busy. KDD stands for Knowledge Discovery and Data Mining, by the way. Despite that, Data Mining also has its own circle. I do appreciate what he did here, though, implying what makes data science worthy of its own field is the breadth of its required skills. Apparently one of those skills is Neurocomputing, which seems a little... specific.

3. Quick on Conway's heels, Ulrich Matter blogged his riff on it later the same month in 2013:

He's flipped it on the diagonal, specified the substantive expertise as Social Sciences (his field), changed hacking to computer science (you can see why someone would object to being characterized as a hacker, although I for one embrace it), and for some reason changed Math & Stats to Quantitative Methods. More importantly, he's moved Data Science where Machine Learning was in Conway's -- that's an interesting distinction, and one I've seen in the field. There are data scientists who specialize in one domain, and then there are generalists (who usually started out in one field but branched out, like me: I started in chemistry and now I'm in insurance). Also, he's apparently not comfortable with Danger Zone, changing it to... a question mark. But apparently what matters to Matter (so to speak) is in the center of the diagram: Data-driven Computational [Social] Science.

A... bit wordy, shall we say? He also made sure to insert Empirical into Traditional Research.

4. After the Edward Snowden news broke, Joel Grus supplied this tongue-in-cheek (or is it?) version. Now we're getting into more rarefied Venn territory, with four circles, the fourth being "evil".

5. In September 2013, Harlan Harris adapted this diagram to deal with data products instead of science.

The slices are no longer comparable to Conway because we've changed from science to products, but the categorizations are noteworthy (and they follow true Venn methodology, not being slices in themselves). Domain Knowledge remains, Computer Science/Hacking remains as Software Engineering, and crucially, Harris has added Predictive Analytics and Visualization to theStatistics circle. But not the actual tools they use, that's in the intersection with Software Engineering. Okay.

6. In January 2014, Steven Geringer provided a tweak that, instead of putting Data Science in the middle three-way intersection like Conway, calls all of it data science and calls the intersection Unicorn (i.e. a mythical beast with magical powers who's rumored to exist but is never actually seen in the wild.)

This is... a little weird, Venn-diagrammatically speaking. I think I know what he's getting at. When I first heard people referred to as data scientists, I often heard the riposte, "Aren't all scientists, by definition, data scientists?" True, there are no sciences that do not deal in data (insert psychiatry joke here), but still, data science, while quite nebulous, isn't just an umbrella term.

Plus, I'm sorry, but you can see the screengrab of his mouse arrow in his diagram.

Edit: An earlier version of this post omitted to give Geringer credit where credit is definitely due: he was the first to remove the Danger Zone! (Great, now that song is going to be in my head all day). Now people with subject matter expertise and computer skills can make Traditional Software without blowing the world up, or whatever. (My apologies to Mr. Geriner, and my thanks for his correction.)