The Dirty Little Secret Every Data Scientist Knows (but won’t admit)

Most people don’t realize, but the actual “fancy” machine learning algorithm is like the last mile of the marathon. There is so much that must be done before you get there!



By Scott W. Strong, Cloudzero.

Original. Reposted with permission.

In May of last year, I needed surgery to fix my left knee in hopes that I could get back to running, one of my favorite activities. Now, you may be thinking "what the heck does this have to do with Data Science?", but during my recovery I found more parallels than you might think. I started to make the connection a few months into rehab, with all of the squats, icing, balancing drills, and massages. Getting back into running shape was so similar to the tasks I had to do on a daily basis to build a strong data science culture. Even more, I came to realize that data science today is much like preparing for and running a marathon (something few do, and even less do well).

Many herald machine learning and artificial intelligence as the next industrial revolution, a technology that is driving innovation, and the most important general-purpose technology of our era.

Throughout my eight years of experience in this industry I have personally observed the benefits of machine learning and don’t deny that the impacts of machine learning can be revolutionary.

Despite these accolades, over the years there has been a healthy skepticism about whether or not machine learning could actually deliver business value. And this criticism is not without merit. In the ten years that ML/AI has been widely known, only in the past few years have C-level executives stated that they are realizing gains from it. Were the concepts of machine learning just ahead of the computing power that was available ten years ago and it’s only now catching up? Or is there something else at play?

Let's explore what it takes to prepare for this "data science marathon", and uncover the realities behind this much talked about field.

What is Data Science?

Data science, surprisingly perhaps, is not about designing the most advanced machine learning algorithms and training them on all of the data (and then having Skynet). It’s about finding the right data, becoming a quasi-expert on the process, system, or event you are trying to model, and crafting features that will help quirky and sometimes frail statistical algorithms make accurate predictions. Very little time is actually spent on the algorithm itself. Of course, there are exceptions to this, but even in those areas most of the research today is done by a few people (and/or companies) like Google, Facebook, etc. who are advancing the technical aspects of the field. Everyone only starts using the new algorithms after they are neatly packaged in open source software. A recent study of industry data scientists shows that 79% of their time is spent collecting, organizing, and cleaning data (60% cleaning and organizing, and 19% collecting data), empirically confirming that there is a lot of marathon preparation going on. But when does the actual race start? This presents a very different view of what is going on behind the scenes, and points to a possible reason for the delay in value mentioned earlier.

For a better understanding of why this happens, let's take a look at what a typical data scientist encounters in their job.

A Day in the Life…

When you walk into a company that has never thought about using machine learning, some challenges immediately become apparent. After asking a few questions, you quickly find out that it will take a herculean effort to get the right information in the right place so it can be fed to a machine learning algorithm. This was especially true ten years ago, but is still very common today.

With disparate databases managed by the traditional silos of an organization and no ability to link data points across the organization, you start to hear things like, “oh… You need to talk to Jeff about the sales pipeline data” or “Beth, in DevOps… she’s the only one who has that information”. Once you start to learn more about the business, inevitably you discover that some critical pieces of information are not being tracked or stored. Or even worse, you need to develop a way to gather data that previously wasn’t available. This may sound dramatic, but this is the reality of the experience of many data scientists when they start at a company that's just getting into machine learning. It’s no coincidence that the same study mentioned earlier shows that 78% of data scientists say their least favorite part of the job is collecting, organizing, and cleaning data (57% cleaning and organizing, and 21% collecting data).

For many, this means starting to put data collection (and retention) policies in place, or getting approvals throughout the organization to store client data that's needed to do the job they were asked to do.

Data Scientists as Change Agents

By understanding the environment and constraints that most data scientists work under, it becomes more and more evident that machine learning isn’t actually the innovation that is driving companies using machine learning to be successful. Of course, there are industries that can and are benefiting greatly from machine learning (think self-driving cars, natural language processing, voice recognition, and the like). However, for the vast majority of companies, machine learning is an attractive hammer, and everyone thinks they have a shiny nail. Those companies might perceive their success as coming from machine learning magic, but it could be derived from another source.

So why are people claiming success in their machine learning endeavors and what is driving that success? I don’t believe they are making false claims of success, and I do believe machine learning is having an impact, but for the majority of companies, I believe success is being fueled another way.

What companies don’t see is that by having data scientists and machine learning experts in-house, they are fundamentally changing how the company operates. These experts act as a forcing function on companies to organize their data!

For the simple fact that machine learning doesn’t work until all (or most) of the company’s data “ducks” are in a row, these individuals lead a company to take action that would otherwise not be taken. In addition, data scientists begin thinking critically about how both products and business decisions can be informed in a data-driven way. This is a dramatic shift in the way most companies operate, and we haven’t even mentioned machine learning yet! I believe that this shift (organizing a company’s data, thinking critically about what data they have or could have, and making that data actionable and available) is the real innovation that machine learning has catalyzed.

Most people don’t realize, but the actual “fancy” machine learning algorithm is like the last mile of the marathon. There is so much that must be done before you get there! All of the miles of training and preparing, and then you have to run the first 25.2 miles of the race. No big deal… right?

It actually is a big deal just to make it to the start line of a marathon, much less get that far into the race.

Challenges of Marathon Preparation

Getting to the start line of an organization's "data science marathon" and embodying these data-driven principles is quite difficult to do in practice. Let's take a look at some of the road blocks our team at CloudZero have encountered while on the way to building a data-driven product and company.

At CloudZero, the team is putting forth a ton of effort to make sense of disparate data sources from Amazon Web Services (AWS) to provide a holistic view of our clients’ web deployments. Being able to visualize and predict website reliability, our core value, isn’t an easy task and requires a lot of preparation. We are still in the training phase of our marathon, but fortunately we are investing heavily in getting the right data in the right place, a task that is cited as most critical to machine learning success.

Adopting this mentality means developing a canonical representation of events and resources within a cloud environment and making them searchable by machines and humans. It also means figuring out the true actor for events that occur in client environments, tracing activities back through chains of “AssumeRole” events (If you’re familiar with AWS, you’ll know the pain). Finally, it means being able to connect the dots between resources and how they communicate with one another, demonstrating dependencies and areas of potential risk.

We recognize that these critical pieces of information are required to be able to contextualize the reliability of any cloud deployment and must be coordinated with our machine learning and data science efforts. Although these tasks are not easy, we are taking them head-on with the knowledge that a data-first approach will provide us with many opportunities to innovate in the industry.

Ready for the Race?

Machine learning truly is a revolutionary concept that is permeating our society, from the way we interact with our devices, communicate with one another, and conduct business around the world. But when you dissect it a little bit more, you begin to see all of the things that enable its success. It may sound blasphemous to say, but it’s important for companies to recognize that with some basic statistics and meaningful organization of their data, they're already 95% of the way there. Machine learning undeniably adds a little extra flair (and accuracy) to the final product, but here's the dirty little secret, it's not necessary to be successful.

In no way does this trivialize the jobs of data scientists or minimize the efforts of machine learning researchers. In fact, I believe understanding these realities increases the importance of their role as change agents within an organization. Without their constant thirst for clean and meaningful data, and the vision to do something groundbreaking with it, companies would likely never make moves to organize their data beyond the status quo.

My hope is that this article helps to break down some of the barriers holding people (or companies) back from harnessing the power of making data-driven decisions and moves us all closer (even a little bit) to the start line of a better future!

Does this ring true at your company? How are data-driven decisions making a difference for you?

Bio: Scott Strong is the Director of Research at CloudZero where he is bringing together the perfect storm of machine learning, modeling and simulation, and hacking for CloudZero's products. Before CloudZero, Scott worked as a machine learning researcher and data scientist at Pindrop, an Atlanta-based startup working to prevent fraud over the phone.

Original. Reposted with permission.

Related: