Mind of a Data Scientist – Part 2
First part of this series was about formulation of the business problem and engineering the data points. This is the last part of the series and it tells us about exploratory data analysis and feature engineering.
In our case the basic unit connected with the business problem are the individual batteries themselves. If our business problem were to predict plant sites which can potentially fail, then our basic unit would be each plant site. Talking about the second term, the aggregating metric, it is an aggregated measure of variable associated with the basic unit under consideration. In our case it would be some aggregation of the conductance of each battery. Again the type of aggregation metric would depend on the business problem. So let us take a step back into the problem we set out for ourselves. We were concerned about identifying the batteries which had a falling trend. The more pronounced the falling trend, more likely for it to be a failing battery. So when we think about an aggregating metric we should think about a metric which will accentuate the spread of data. A very handy metric to represent the spread of data would be the standard deviation. So if we aggregate the values of each battery by taking the standard deviation of its conductance we have a very effective method to identify the set of batteries we want. The same is represented in the plot below.
The above figure is a plot of the batteries along x axis and the standard deviation of conductance along y axis. We can clearly see that using our aggregating metric we clearly have two groups of batteries, one with standard deviation less than 100 and the other with more than 300. The second group i.e batteries A & C whose standard deviation is way above the rest are potentially the cases we are looking for. Let us also try and plot the real conductance value of these batteries over time to corroborate our hypothesis.
We can clearly see from the above plot that battery A & C shows a dropping trend which was indicated by the high standard deviation for these batteries. So taking an aggregating metric like this will help us in zeroing on to the cases where we want to further dig our hands into.
Now that we have identified our set of batteries which potentially could be problematic, the next step is to dive deep into those cases and try to identify other indicators which are associated with falling conductance. We need to look closely at some pictorial representation of the data and then ask further questions
- Are there any period of time when such trends are happening ?
- Are there any specific patterns which we can unearth before the falling trend in conductance
- Are there any thing special about the slope of the curve which shows a falling trend… etc
We need to look at all discernible patterns within that variable and build our intuitions on them. Once we build our intuitions on one variable it is time to move further and associate other variables. We can bring in variables like voltage, current, temperature etc and see how they behave with respect to the specific trends which we saw when we analysed only one variable (Conductance) . Some of the trends we can look at are the following
- How has voltage, current or temperature behaved during the period when we saw a drop in conductance ?
- Are there any specific trends for these variables before we saw the trend in falling conductance ?
- How have these variables behaved after the fall in conductance values ?
- Are there any prospects for any more variables other than the ones we have ? … etc
These are the kind of questions we have to ask to help us in unearthing various relationships which exists within the variables in our data set. Asking all these questions and slicing and dicing into each of the variables help us achieve the following
- Helps in determining relative importance of variables
- Provides a rough idea about relationships between variables
- Gives insights into any variables that needs to be derived out of the existing variables
- Gives us intuitions on any new variables which needs to be brought in
All insights we unearth by asking such questions will help us immensely when we get into the downstream modelling activities.
Now that we have seen the business perspective of the data discovery phase, let us encapsulate the main steps in the process
- Identify a variable which potentially give indication of the problem we are trying to solve
- Derive some aggregation metric for the identified variable to help us split the basic unit related to our problem
- Dive down deep into cases we have earmarked and look for trends with respect to the variable we are looking for
- Introduce other variables and look for association of the newly introduced variables with the trends we saw in the first variable.
- Look for relationship between variables which give clues to the problem statement
- Build intuitions on any new variable that can be introduced which can help in solving the problem.
The above are a set of broad guideline as to how we can structure our thought process for business perspective of the data discovery phase. In the next post we will deal with the statistical perspective of data discovery and how we can connect the dots between both these perspectives so as to give us intuitions for feature engineering and modelling. Watch out this space for more.
Original post. Reposted with permission.
Bio: Thomas Joseph is head of Data Science at Quadrant 4 System Corporation, involved in building the Data Science practice and engaging with customers in solving their business problems leveraging tools and methodologies in Data Science.
- Mind of a Data Scientist – Part 1
- Evolution of the Data Scientist Through the Decade: What’s Changed
- Career Advice to Data Scientists – Go Make More Money