Putting the “Science” Back in Data Science
The scientific method to approach a problem, in my point of view, is the best way to tackle a problem and offer the best solution. If you start your data analysis by simply stating hypotheses and applying Machine Learning algorithms, this is the wrong way.
By Rubens Zimbres, Data Scientist & Machine Learning Researcher.
Lately I’ve seen a lot of hype surrounding -- and lots of newcomers to -- the Data Science field. But what exactly is SCIENCE in Data Science? The scientific method to approach a problem, in my point of view, is the best way to tackle a problem and offer the best solution. If you start your data analysis by simply stating hypotheses and applying Machine Learning algorithms, this is the wrong way.
The picture below shows the steps necessary for scientific research, corresponding data analysis and simulation. In fact, it is a sketch of what I did in my PhD thesis. In a few words, I studied the past 27 years of Business Management literature and I tried to develop an epistemologically disruptive approach to measure and predict service quality, mixing Business Administration with Electrical Engineering concepts. Over the course of 4 years I performed quali-quantitative longitudinal research and developed a simulation using Agent-Based Modeling to try to find a 5 State Cellular Automata rule that could mimic human behavior. I approached Complexity concepts, self-organizing systems, emergence of order, and social networks.
A paper was published at Elsevier, Electronic Notes in Theoretical Computer Science (2009), titled Dynamics of Quality Perception in a Social Network: A Cellular Automaton Based Model in Aesthetics Services.
One of the things I learned with the scientific method was to get rid of a priori and a posteriori bias when working on a solution to a problem. A priori bias happens when you start analyzing something with a preconceived idea. In this case, your findings will only confirm what you initially stated, because the whole research process was biased. A posteriori bias happens when you start analyzing something, but in fact you already know what the outcome is, then the whole process is biased too.
Once you get rid of your pre-conceived ideas about the problem, you find new ways to solve it. In the Data Science process this is vital, because creativity allows you to have a clear picture of the whole environment.
First, what is the business problem? What do you want to achieve? Do you want to leverage profit, return on investment? Do you know exactly HOW your business adds value to the customer? What is value? Who exactly is the customer? What are the needs and perceptions of the customers? How are you going to get all these data? Is there market research you can use along with the business data?
To start building a scientific approach to your problem, first define the problem, the gap in the literature (in case you are a Master's or PhD student) or the business need: WHAT is going on, WHAT do we want to achieve, WHO are the beneficiaries (stakeholders) of the strategy and data analysis, WHEN you are going to start and finish, with what kind of resources and algorithms, HOW are you going to achieve your goals and WHY?
After assessing all these variables and making a mind map, ask yourself: What kind of knowledge is involved in the problem? Suppose you are dealing with customer churn. What is making people leave your business? Of course, anyone can have the intuition about the causes, but remember that there are scientific articles that are a valuable source of knowledge, more so than random guesses.
Let’s say customers are leaving because they don’t see VALUE in the business. Value is something unique, usually brought by human resources, that cannot be copied and no competitor can offer. This leads to competitive advantage, more profits, loyalty, word-of-mouth advertising and repurchase.
Note that so far we didn’t even thought about hypotheses and algorithms. Only after knowing exactly which are the variables involved in the problem, we develop hypotheses. Let’s say you suppose that profit is leveraged by a positive customer perception of product quality and high word-of-mouth advertising about your firm. This is the nomological network, where you draw correlations and causalities. In Data Science, you need to know what the customer perception is and also if there is word-of-mouth advertising. Then you realize you are working with different datasets, one Market Research and other Social Media referrals. You also have another dataset with the financial data of your company (contains profit data).
Now it’s time to choose: do you opt for a quantitative approach, structured data in Market Research dataset and in Financial Data? But Social Media is unstructured, so you have to make a qualitative approach using Natural Language Processing. Even worse, you want to make a longitudinal analysis, converting data into time series to analyze with ARIMA. Ah, profit call be predicted using Deep Neural Networks using data from Market Research, Financial Data and word embeddings from Social Media as features!
NOW we got into the pleasure of Data Scientists: algorithms, classification, regression, deep learning, unsupervised learning, accuracy, overfitting, bias-variance trade-off, hyperparameter tuning. The fun begins!
Yes, the fun has begun, but note that in this specific case it was a long journey before we reached the algorithms. There was an entire aspect of planning on the research problem. One cannot simply “apply the algorithm“ and check measures of fit and overfit. Another big issue is that when you have a complete picture of what is going on, you usually will find out there is data you need, and it simply does not exist.
Then comes the validation process of your algorithm. Lots is being said about the external validity of models (power of generalization): Your model performs well on the training set and test set, with little or no overfitting, but do the findings apply to new situations? Does your test set distribution replicate real world scenarios? Yes, but we cannot forget other types of validation, like:
- Empirical Validation: success in comparison with reality
- Conceptual Validation: your Machine Learning model successfully translates the natural system into Mathematical language
- Internal Validation: there are no errors in your code
- and also your model presents ergodicity (stability in different time steps when artificial intelligence and complex behavior are not present) and homoscedasticity
After validating your data analysis result, you will confirm or reject the hypotheses and suggest strategic moves to upper management. Note that Data Scientists need the complete involvement of Business Managers to succeed. Findings from data analysis and modeling must generate insights for strategic decisions, market positioning, product launch, brand image and so many other areas.
So, the SCIENCE in Data Science is not only about Machine Learning, Deep Learning, Natural Language Processing, A.I. algorithms and formulas. It’s not only STEM. It’s about an interdisciplinary and rigorous approach we borrow from academia to bring profits above average for businesses, frequently involving Psychology, Game Theory, Business Administration, Complexity, Non linear effects and complex causality.
Bio: Rubens Zimbres is a Data Scientist, and has both a Master's and PhD in Business Administration with focus in Electrical Engineering. His research focuses are Machine Learning, Deep Learning and NLP.
- Machine Learning Applied to Big Data, Explained
- The Practical Importance of Feature Selection
- Teaching the Data Science Process