The Infinity Stones of Data Science
Do you love data science 3000? Don't want to be embarrassed in front of the other analytics wizards? Aspire to be one of Earth's mightiest heroes, like Kevin Bacon? Help make data science a snap with these simple insights.
There is an ongoing worldwide pop culture phenomenon which has recently engulfed the entire world, and of course you know what I'm talking about: data science!
Actually, it's Avengers Endgame, the culmination of over a decade of storytelling in the Marvel Cinematic Universe (MCU). But you probably knew that.
While the story in Endgame revolves more or less around the Infinity Stones — as has the entire MCU for quite a while — and their role in saving nothing less than the entire universe (or half of it, anyways), the practice of data science actually also has something to learn from their powers. I know you don't believe me, but let's take a look.
Don't forget, Thanos was a bit of a data scientist himself. He identified a problem and its solution, though you might take issue with his conclusions, not to mention the complete lack of scientific process is his line of thinking. But that's beside the point.
Here is how the Infinity Stones map to the practice of data science, and the lessons they can teach us about our own practice.
The Reality Stone grants the user power to manipulate matter.
Let's look at this first one metaphorically. Presumably, in order to manipulate reality, we have to understand it. This seems to be a good (enough) analogy for the importance of domain knowledge.
Setting off on a data science project without an understanding of the project's domain is not only a bad idea, the end result likely won't be rooted in reality. We don't allow otherwise competent athletes who have never learned the rules of baseball to play professional baseball; likewise, we should not expect to be able to perform competent data science within a domain we do not understand, regardless of our statistical, analytical, technical, and related skills otherwise.
What exactly constitutes sufficient domain knowledge? It's relative. Are you doing a shallow descriptive analysis of some run-of-the-mill dating app? Or are you undertaking some in-depth predictive analytics project in finance for an organization which specializes in some obscure securities investment strategy? The required knowledge of the "dating" domain to perform the first feat is likely negligible, but any useful insights into the second will certainly demand a solid financial understanding.
The Space Stone gives the user power over space.
Power over space, huh? How about power over data space? And how would one get power over their data space? Intimate knowledge, by way of exploratory data exploration.
But how much data exploration, and exactly what kind of exploration are we talking? This is going to surprise you, but... it's relative. If we are interested in descriptive analysis — that is, no prediction, along the lines of a straightforward data analysis — the more intimately we are familiar with the data the better. The end is the means in this case, and so the quality of describing, visualizing, and sharing the data, a la a data analyst, is highly correlated with exploration intimacy.
When it comes to predictive analytics, and machine learning undertakings, there are differing opinions on how much exploratory data exploration is helpful. There are also differing opinions on the level of exploratory analysis of datasets which are not being used for training (i.e. validation and testing sets). This aside, in order to ensure maximum power over your data space is achieved, be sure to guard against the potential pitfalls of poor exploratory data analysis or shoddy visualizations, such as the correlation fallacy, Simpson's paradox, and the ecological fallacy.
When performed properly, exploratory data analysis will provide an understanding of your data in a way which allows successful data science to follow.
The Time Stone grants its owner the power to re-wind or fast-forward time.
If you have studied algorithm complexity, you know that the choice of algorithm can severely impact the time it takes to complete a given computation task, even with the same data, and it is for this reason that algorithm and method selection is our equivalent of being able to fast-forward time.
This will apply to both the outright selection of algorithms, as well as the setting of hyperparameters which also have an impact on run time. Neural network architectures can be incredibly complex, but a pair of equivalently simple neural nets can have vastly different convergence times when using vastly different learning rates.
You know about the bias-variance tradeoff, but there is also a space-time tradeoff, as well as a complexity-speed tradeoff that can be made. One logistic regression model might not perform quite as well as a random forest of thousands of trees, yet that hit to performance may be worth it to you in exchange for speed, to say nothing of the increase in explainability the logistic regression model may provide over the random forest (if that's your thing).
This isn't to say that you must choose a faster (or less complex, or less compute intensive, or more explainable) algorithm, but you need to keep in mind that it is one of the tradeoffs you are making, and one of the best ways we have of controlling the flow of time.
At least, it is one of the best was we have of controlling the flow of time without the real Time Stone.
The Power Stone bestows upon its holder a lot of energy—the sort of energy that you could use to destroy an entire planet.
That sounds like a lot of energy. Where do we find this sort of energy in the data science world? Computational power!
Computational power (or "compute") is the collective computational resources we have to throw at a particular problem. Unlimited compute was once thought to be the be all and end all of computing, and for good reason. Consider how little compute there was one, two, or three decades ago, in comparison to today. Imagine scientists sitting around and thinking about problems they could solve, if only they had more than a handful of MHz worth of compute at their disposal. The sky would be the limit!
Of course, that's not exactly how things have turned out. Sure, we have a lot more compute at our disposal now than we ever have in the past in the form of supercomputers, the cloud, publicly available APIs backed by heaping amounts of compute, and even our own notebooks and smartphones, comparatively. All sorts of problems we could never have imagined would have enough compute to solve are now tractable, and that's a great development. We need to keep in mind, however, that "clever" is a great counterbalance to compute, and lots of advancement in data science and its supportive technologies have been made possible by brain over brawn.
Ideally, a perfect balance of brain and brawn could be used to attack every problem out there, with clever being used to setup a perfect algorithmic approach, and the necessary compute being available to back it up. Perhaps this is an area data science will one day prove to help itself.
Until then, rest assured that there is compute available for even the less than perfect approaches to problem solving.
It’s unclear what the Soul Stone’s powers are in the film universe. In the comics, the gem allows the holder to capture and control others’ souls.
"Capture and control others' souls" sounds ominous, and more than just a little bit devious. But if we take a more positive spin on the concept of the Soul Stone, we could force an equivalence between it and the power of prediction. We are training models to control the innermost essence of unlabeled data — its soul — by making informed predictions as to what it truly holds.
That's not a stretch at all. Right?
The Soul Stone, then, is analogous to the power of prediction, which is to say it lies at the absolute core of data science. What are data scientists trying to accomplish? They are trying to answer interesting questions with available data in order to make predictions which align as closely as possible with reality. That prediction piece seems pretty crucial.
And, given that it is so crucial, it should be evident that the outcomes of data science should be handled with the utmost care. Not to oversell it, but the soul of our work is the value it can create (whether that be to business, to charities, to government, to society in general), and taking this lightly, or just treating the predictive outcomes as another of the steps in the data science process, is ill advised.
Data science is a fight for the soul, coupled with the upcoming battle for the mind.
The Mind Stone allows the user to control the minds of others.
Which of the Infinity Stones allow for the control of others' minds? That's the Mind Stone, of course, and in the world of data science nothing better helps control the minds of others than a well-crafted data presentation, including a compelling story and effective visualizations.
The modeling is complete. The predictions have been made. The insights are... insightful. It's now time to inform the project stakeholders of the outcomes. But the non-data scientists among us don't have the same interests in, or understandings of, data and the data science process, so we need to be effective in presenting our findings to them in a way they will appreciate.
This isn't some form of condescension or "they just don't get it"; this is an acknowledgement of the reality that everyone has different skills, interests, and roles, and those of the individuals who could most benefit from our work don't necessarily share those of the data scientist. So tailor your presentation accordingly.
Remember, if your insights don't translate to being useful, then your work isn't complete. It's up to you to convince others of the value of your work. Once convinced, they can take action, and
change through action is the real pay-off of any data science project.
Infinity Stone descriptions came from this article.
All comic personalities mentioned herein, and images used, are the sole and exclusive property of Marvel Comics.