7 Common Data Science Mistakes and How to Avoid Them
Data scientist in business is as similar as to that of a detective: discovering the unknown. But, while venturing onto this journey they do tend to fall into the pitfalls. Understand, how these mistakes are made and how you can avoid them.
By Khushbu Shah, DeZyre.
“Mistakes are the portals of discovery.”- James Joyce(famous Irish novelist). This is true in most cases, but in case of data scientists, making mistakes help them discover new data trends and find more patterns in the data. Having said this, it is imperative to understand that Data Scientists have a very small margin for error. Data Scientists are hired after a lot of deliberation and at a high cost. Organizations cannot afford to disregard bad data practices and repeated mistakes from Data Scientists. Mistakes and bad practices in data science can cost a data scientist her/his career. It is vital for data scientists to track all data science experiments, learn from the mistakes and avoid them in future data science projects.
A famous quote by Sherlock Holmes well defines that the role of a data scientist in business is as similar as to that of a detective-
“My name is Sherlock Holmes. It is my business to know what other people don’t know.”
For a business to stay competitive, it has to do more than just Big Data Analytics. Without assessing the quality of data they have, the kind of outcome they want and how much profit they are expecting from this kind of data analysis – it becomes difficult to correctly figure out which data science projects will be profitable and which will not. When it comes to data science mistakes- it is acceptable once – considering that there is a learning curve but if these mistakes happen more than twice, it can cost the business.
Learn Data Science in Python to become an enterprise data scientist
Common Data Science Mistakes to Avoid
- Confusion between Correlation and Causation
Mistaking Correlation with Causation can lead to a costly affair for any data scientist. The best example here is the analysis of Freakonomics in which getting correlation for causation wrong, led Illinois to send books to every student in the state because the analysis revealed that books available at home are directly correlated to high test marks. Further analysis showed, that students from homes which have several books performed better in their academics even if they have never read the books. This helped make corrections in the earlier assumptions with the insight that houses wherein parents usually buy books have an exhilarated learning environment.
Most of the data scientists when working with big data assume that correlation directly implies causation. It is often a good practice to use big data to understand the correlation between two variables, however, always using “cause and effect” analogy might render false predictions and unproductive decisions. To make use of big data for best results, it is necessary that data scientists understand the difference between correlation and root cause. Correlation means X and Y tend to be observed at the same time whereas Causality means X causes Y. These are two completely different things in data science, however the difference is often ignored by many data scientists. A decision based on correlation might be good enough to take an action on, without having to know the cause; but this is completely dependent on the kind of data and the problem being solved.
A lesson every data scientist must learn is that- “Correlation is not Causation in data science”. If two items appear to be related to each other, it does not mean that one causes the other.
- Not Choosing the Right Visualization Tools
Most of the data scientists concentrate on learning the technical aspects of analysis. They fail to focus on understanding the data using different visualization techniques which can actually make them derive insight much faster. The value of even the best machine learning models is diluted if a data scientist does not choose the right kind of visualizations to model development, to monitor exploratory data analysis or to represent the results. In fact, many data scientists choose the chart type visual based on their aesthetic taste instead of considering the characteristic of their dataset. This can be avoided by defining the goal of the visualization as the first step.
Even if a data scientist develops an optimum and best machine learning model it will not scream out saying “Eureka”- all that is needed is effective visualization of the results to understand the difference between a data pattern and realizing its existence to be utilized for business outcomes. As the popular saying goes “A picture is worth a 1000 words.”- It is necessary that data scientists not only familiarize themselves with data visualization tools but also understand the principles of effective data visualization to render results in a compelling way.
A crucial step towards solving any data science problem is to get an insight on what the data is about, by representing it through rich visuals that can form the foundation for analysis and modelling it.
- Not Choosing the Right Model- Validation Frequency
Some data scientists feel that, to have built a successful machine learning model, is having achieved the maximum level of success. Having built a right model is just half the battle won and it is necessary to ensure that the predictive power of the model is maintained. Many data scientists often forget or tend to ignore the fact that it is necessary to re-validating their models at set intervals. A common mistake that some data scientists often make – is thinking that the predictive model is just ideal since it fits the observational data. Predictive power of the built model can disappear instantaneously based on how often the modelled relationships keep changing. To avoid this, the best practice for any data scientist is to ensure that they score their data models with new data every hour, every day or every month based on how fast the relationships in the model change.
The predictive power of models often suffer decay due to several factors and hence there is a constant need to for data scientists to ensure that the predictive power of the model does not drop below the acceptable level. There could be instances where the data scientists might have to re-build the data model. It is always better to build several models and interpret the distributions of variables rather than considering a single model as the best one.
To retain the predictive power and validity of the built models- selecting the iteration frequency is very important and failing to do so, can lead to false results.
- Analysis without a Question/Plan
“One of the highest uses of data science is to design experiments, posing the right question and collecting the right datasets, and doing it all up to scientific standards. Then you gather the results and interpret it.”- said Michael Walker, President of Data Science Association
Data science is a structured process that begins with well-defined objectives and questions followed by few hypotheses to fulfill the objective. Data scientists often tend to jump on the data without thinking about the questions they need to answer through analysis. It is essential for any data science project to have the project goal and a perfect model goal. Data scientists who do not know what they want – end up with analysis results that they do not want.
Most of the data science projects end up answering the “what” kind of questions because data scientists do not follow the ideal path of doing analysis by having the questions at hand. Data science is all about answering the “why” kind of questions using big data. Data scientists should analyse a given dataset with a motive to answer questions that have not been answered before -by merging data sets that have never been merged. To avoid this, data scientists should focus on getting their analysis results right by defining the design, variable and data accurately and clearly understanding what they want to learn from this analysis. This will ease the process of answering business questions through statistical methods that meet assumptions. As the popular quote by Voltaire goes – “Judge a man by his questions than by his answers.”- Having well defined questions beforehand is extremely important to achieve data science goals for any organization.
- Paying Attention Only to Data
According to Kirk Borne, principal data scientist at Booz Allen Hamilton, “People forget that there really are ethical issues about the use of data, protections of data, and even statistical concerns such as [thinking] correlation is causation. People forget that if you crunch data long enough, it will say anything. If you have a very large collection you’re going to find correlations. People think now if they have big data they can believe anything they see.”
Data scientists often get excited about having data from multiple data sources and start creating charts and visuals to report analysis without developing the required business acumen. This can turn out to be dangerous for any organization. Data scientists often give too much decision making power to the data. They do not give enough importance in developing the business acumen to understand how the analysis can benefit an organization. Data scientists should not merely let their data speak and thrust aside the wisdom they have. Data should be the influencing factor for decision making but not the final voice for any data science project. Organizations hiring data scientists who can bring in a combination of domain knowledge along with the technical expertise are an ideal solution to avoid such mistakes.
- Ignore the probabilities
Often, data scientists tend to ignore the possibilities for a solution which could lead to wrong decisions more often. Data scientists often make the mistake by saying that if the business takes X action it will definitely achieve Y goal. There is no single right answer to a specific problem and hence informed choices have to be made by data scientists from various possibilities. More than one possibility always exists for particular question- each of which has some level of uncertainty. Scenario planning and probability theory are two essential aspects of data science that should not be ignored to ensure that the decisions made are more often correct.
- Building a Model on the Wrong Population
If the goal of a data science project is to model the customer influence patterns, then merely considering the behavioural data of customers who are highly influential, is not a good practice. The model should be built considering the behavioural data of customers who are highly influential and also those who are less influential but are likely to being influenced. Underestimating the predictive power of either group in the population can skew the model and some important variables might fall into the under-represented segment.
These are some of the common mistakes data scientists make doing data science. If you can think of any other common data science mistakes, we would love to hear your thoughts in comments below.
Know More about IBM Certified Data Science Training to become an Enterprise Data Scientist.