How to Balance the Five Analytic Dimensions
When developing a solution one has to consider data complexity, speed, analytic complexity, accuracy & precision, and data size. It is not possible to best in all categories, but it is necessary to understand the trade-offs.
By Damian R. Mingle, (WPC Healthcare).
So many data scientists select an analytic technique in hopes of achieving a magical solution, but in the end, the solution simply may not even be possible due to other limiting factors. It is important for organizations working with analytic capabilities to understand the various constraints of implementation most real-world applications will encounter. When developing a solution one has to consider: data complexity, speed, analytic complexity, accuracy & precision, and data size. Data Scientists, nor the organizations they work for, will be able to be the best in each category simultaneously; however, it will prove necessary to understand the trade-offs of each.
It is important to know as much as possible about the data. Practically, this looks like understanding the data type, formal complexity measures, tab measures of overlap and linear separability, number of dimensions/columns, and linkages between data sets. For example, one must be able to link up healthcare remittances to paid claims that come in all flavors: fully paid, partially paid, and denied over long periods of time. These linkages can be extremely complex.
The speed at which an analytic outcome must be produced (e.g. near real-time, hourly, daily), or the time it takes to develop and implement the analytic solution, is another key consideration. This particular dimension provides a lot of angst for most Data Scientists, primarily because they generally want to come up with an optimal solution regardless of time. However, we can all agree that if an enterprise needs to deploy new predictions every 15 minutes, but it takes 1.5 hours to retrain the algorithm, then it will not be successful.
Algorithm complexity is measured as complexity class and execution resources. This dimension could be limiting if the complexity needs to be low in order for the business to grasp what is going on. Clearly this will limit a Data Scientist's ability to create an optimal outcome. Some industries prefer lower quality prediction if they receive more understanding about the contributing factors to a prediction; this is true in the healthcare industry. A great example of this is the Netflix $1 Million Challenge. A team of Data Scientists put in over 2,000 hours of work to come up with the combination of 107 algorithms that won first place by besting Netflix own algorithm by 10%. However, Netflix never implemented the full benefit of the first-place solution due to the engineering effort needed to bring it into a production environment.
Accuracy & Precision
Most businesses do not understand how to nuance when it comes to predictive accuracy; however, it will be essential for a Data Scientist to help the organization move beyond the simple notion of accuracy. Obviously we all want to hit the proverbial target. At least directionally, as a Data Scientist, you will want to steer the conversation to something more useful, like an algorithm that produces “high accuracy/low precision” or “high accuracy/high precision”. It usually proves beneficial to the business audience to distinguish what is meant by accuracy and precision as they appear to be close in meaning. Help them see that “accuracy” refers to the closeness of a predicted value to the actual value. A good example of this: a data science model predicted the weight of a package to be 19 lbs, but the actual weight of the package is 28 lbs. This would demonstrate “low accuracy”. “Precision” on the other hand refers to the closeness of two or more measurements to each other. For example, if a Data Scientist predicts the value of a package to be 19 lbs - over 5 separate iterations - then it is said to be “precise”. From a business perspective, it is critical to note that a data science model can be extremely precise, but inaccurate in its prediction.
The size of the data set is viewed as the number of rows and the number of fields. Many organizations may not understand when dealing with prediction that the more data you have, the better the output. However, there may be a point that the size of data goes beyond the typical tools skill set of the average Data Scientist. In fact, many of the classic algorithms one might use in smaller datasets may simply vanish as an option once one begins navigating in bigger data waters. As a Data Scientist, it is worth investigating the limits of your skill and tools before you get in front of an executive audience; they are counting on you to be the expert, as well they should.
In almost every case, the business user will dictate one or more constraints in the problem the Data Scientist will face. Once a single dimension is fixed then the hard work begins – the development of wanting to know what else can be done with the other dimensions. Take for example a hospital who needs near real-time analytics to help the physicians make clinical decisions: the speed decision is already fixed and trade offs between the other four dimensions must be made. For many Data Scientists, learning what the right balance is will be developed over the course of a career. It is more of an art than a science, but that does not mean you should not devote significant resources to expedite your learning.
Data Science in the real-world always has to consider the five analytic dimensions; and, Data Scientists should aim to be sure to optimize each dimension for the business it seeks to serve. Whether it is data complexity, analytic complexity, accuracy & precision, speed, or data size, each is important in its own right. As a Data Scientist, it is vital to understand how to guide the business analytically by keeping in balance the five analytic dimensions.
Bio: Damian R. Mingle, MBA is a Chief Data Scientist at WPC Healthcare and an expert in healthcare analytics.