KDnuggets Home » News » 2018 » Mar » Opinions, Interviews » Great Data Scientists Don’t Just Think Outside the Box, They Redefine the Box ( 18:n11 )

Great Data Scientists Don’t Just Think Outside the Box, They Redefine the Box

The best data scientists have strong imaginative skills for not just “thinking outside the box” – but actually redefining the box – in trying to find variables and metrics that might be better predictors of performance.

Special thanks to Michael Shepherd, AI Research Strategist, Dell EMC Services, for his co-authorship. Learn more about Michael at the bottom of this post.

Imagine you wanted to determine how much solar energy could be generated from adding solar cells to a particular house. This is what Google’s Project Sunroof does with Deep Learning. Enter an address and Google uses a Deep Learning framework to estimate how much money you could save in energy costs with solar cells over 20 years (see Figure 1).

Figure 1: Google Project Sunroof Project

It’s a very cool application of Deep Learning. But let’s assume there “might” be an even better way to estimate solar energy savings. For example, you want to use Deep Learning to estimate how much solar energy we could generate with solar panels on the Golden Gate Bridge (that probably wouldn’t be a very popular decision in San Francisco). The obvious application would be to analyze several photos of the Golden Gate Bridge and estimate clear skies based upon cloud coverage.

However instead of estimating the potential solar energy generation based upon “cloud coverage,” what if we wanted to use “sunlight reflection” to generate the solar energy estimate (see Figure 2)?

Figure 2: Determining Best Predictive Variables for the Golden Gate Bridge

Or maybe you want to test another metric based upon the “sharpness of the shadows” generated by the bridge? Or another metric based upon how many people in the photo are wearing sunglasses? Or yet another metric based upon…

How do you know which of these variables – clouds or reflection or shadows or sunglasses or anything else – is the better predictor of solar energy generation? You try them all!

This thought process highlights an important behavioral trait of the best data scientists; the best data scientists have strong imaginative skills for not just “thinking outside the box” – but actually redefining the box – in trying to find variables and metrics that might be better predictors of performance.

The word “might” is a powerful enabler. “Might” is used to say or indicate that something is possible. It’s a data scientist’s most important concept, because “might” gives the data scientist the license to explore, be wrong, learn and try again.


“It Can’t Be Done” Is Not a Data Scientist Term

Andrew Ng, artificial intelligence visionary and fearless leader for many of us, wrote a recent article titled, “What Artificial Intelligence Can and Can’t Do Right Now.” In the article, Andrew states the following:

“Surprisingly, despite AI’s breadth of impact, the types of it being deployed are still extremely limited. Almost all of AI’s recent progress is through one type, in which some input data (A) is used to quickly generate some simple response (B). For example:”

Figure 3: What Machine Learning Can Do

While the use cases are limited today, the creativity at which data scientists are leveraging Big Data and existing Machine Learning and Deep Learning technologies is staggering. Let me give you one example of how data scientists from one of our Services teams at Dell EMC are thinking outside the box, to uncover new ways to help our customers avoid issues in their IT environment and create a more effortless support experience.


Predicting Hard Drive Failures

Let’s say that you are capturing over 260+ different pieces of telemetry data several times a minute for the life of a device. Most of these 260+ variables have incomplete or sparse data, the collection timing doesn’t always line up nice and neat, and getting time continuity across the devices is a major challenge.

If you were using a traditional Machine Learning algorithm, the data science team would have to spend an overwhelming amount of time 1) feature engineering new variables based on domain knowledge, and 2) using trial-and-error to determine which combinations of variables should even be included in the Machine Learning model.

Instead, our Dell EMC Services data scientists used a Patent Pending approach to Deep Learning to “pixelate” the data. They turned the over 260+ variables into device performance “images.” Then once they created these “images,” the team leveraged a recurrent neural network to find “shapes” and repeatable patterns out of random pixels (see Figure 3).

Figure 4: Pixelating Telemetry Data

A recurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed cycle. RNNs can use their internal memory to process arbitrary sequences of inputs, which typically makes RNNs ideal for handwriting or speech recognition. Except in this case, instead of trying to decipher handwriting into words, the data science team used the RNN to decipher the seemingly random pixels into a prediction on the state of the device (see Figure 4).

Figure 5: Using RNN’s to Identify Shapes and Patterns Buried in the Telemetry Data

I love this example because the team didn’t feel constrained to try to fit the square peg into the round “Machine Learning” hole. Instead, they used Deep Learning in a different context to decipher seemingly random pixels into a prediction of the health of a device. The data scientists didn’t wait until someone developed a better Machine Learning algorithm. Instead, they looked at the wide variety of Machine Learning and Deep Learning tools and algorithms available to them, and applied them to a different, but related use case. If we can predict the health of a device and the potential problems that could occur with that device, then we can also help customers prevent those problems, significantly enhancing their support experience and positively impacting their environment.



One of a data scientist’s most important characteristics is that they refuse to take “it can’t be done” as an answer. They are willing to try different variables and metrics, and different type of advanced analytic algorithms, to see if there is another way to predict performance.

By the way, I included this image just because I thought it was cool. This graphic measures the activity between different IT systems. Just like with data science, this image shows there’s no lack of variables to consider when building your Machine Learning and Deep Learning models!

Want more information on how Dell EMC Services uses data science?

Check out the “Decoding Customer DNA with Data Science” blog by Doug Schmitt, President, Dell EMC Global Services, and watch for the upcoming podcasts “A Conversation with Two Data Geeks” to hear directly from the data scientists behind our transformative technologies.

Co-authorI would like to thank my co-author Michael Shepherd, AI Research Strategist, Dell EMC Services. Michael holds U.S. patents in both hardware and software and is a Technical Evangelist who provides vision through transformational AI data science. With experience in supply chain, manufacturing and services, he enjoys demonstrating real scenarios with the SupportAssist Intelligence Engine showing how predictive and proactive AI platforms running at the “speed of thought” are feasible in every industry.

Original. Reposted with permission.