Hypothesis Vetting: The Most Important Skill Every Successful Data Scientist Needs

A well-thought hypothesis sets the direction and plan for a Data Science project. Accordingly, a hypothesis is the most important item for evaluating whether a Data Science project will be successful.

By AbderRahman Sobh, Data Scientist, Machine Learning Engineer, Software Developer, Mentor

The most successful Data Science starts with good hypothesis building. A well-thought hypothesis sets the direction and plan for a Data Science project. Accordingly, a hypothesis is the most important item for evaluating whether a Data Science project will be successful.

This skill is unfortunately often neglected or taught in a hand-wavy fashion, in favor of hands-on testing for feature significance and applying models on data in order to see if they are able to predict anything. While there is certainly always a need for feature engineering and model selection, doing so without true understanding of a problem can be dangerous and inefficient.

Through experience, I’ve derived a systematic way of approaching Data Science problems which guarantees both a relevant hypothesis and a strong signal as to whether a Data Science approach will be successful. In this article, I outline the steps used to do this: defining a data context, examining available data, and forming the hypothesis. I also take a look at a few completed competitions from Kaggle and run them through the process in order to provide real examples of this method in action.


Crafting the Perfect Hypothesis

1. Mapping the Data Context

Much like forming a Free Body Diagram in a physics problem or utilizing Object-Oriented Design in Computer Science, describing all valid entities in a given data context helps to map out the expected interplay between them. The goal of this step is to completely designate all the data that could possibly be collected about anything in the context i.e. a description of the perfect dataset. If all of this data were available, then the interactions between the components would be completely defined and heuristic formulas could be used to define every type of cause and effect.

2. Examining Overlap of Available Data to Data Context

Next, an observation is conducted to determine how much of the available data fits into the perfect dataset defined in the previous step. The more overlap found, the better a solution space for defining interactions between entities. While not a numeric metric, this observation gives a strong signal for intuition of whether the available data is appropriate and relevant enough. If there is a complete overlap, then a heuristic is probably a better solution than fitting a data model. If there is very little overlap, then even the best modeling techniques will be unable to consistently provide accurate predictions. Note that the strongest predictive signals will always be those that are directly related, thus indirect feature relation is given less emphasis in favor of the real thing. The intuition behind focusing mainly on strong signals is that collecting better data with strong signals will guarantee better performance and reliability. Conversely, predictions using weak signals are prime candidates to become obsolete once better data is available.

3. Forming the Hypothesis

Having completed the previous two steps, forming a hypothesis itself becomes trivial. Hypotheses are generally formed by combining a set of available features to predict the outcome of another feature that may be hard to collect in the future or whose value may be needed ahead of its outcome.

Now that the overlap between the data context and available data has been clearly defined, and assuming a reasonable amount of the available data is relevant/class-balanced/etc, a hypothesis can simply be stated as: “The available data can produce significant results predicting ____ in the given data context”. Note that the prediction label is left open-ended since it can potentially be filled in with many different things! This is because a hypothesis can be formed around almost any of the given features as long as it fits within the data context and there is enough overlap with the available data. Picking the specific feature to predict will depend upon what use cases are most beneficial.


Example Projects

Below, I’ve carefully selected a couple different Kaggle projects to analyze using the hypothesis forming technique. These examples illustrate some characteristic setups for hypothesis forming as well as each have a given hypothesis which can be evaluated in its respective data context.

Project 1: Predicting the Winner of a PubG Match (https://www.kaggle.com/c/pubg-finish-placement-prediction)

This project was selected because the data context is simplified, due to the nature that predictions are performed for a virtual world. Predictions of simulations are expected to behave much more consistently and operate along very clearly defined paths, as opposed to real-life systems which may have factors not easily observable. In other words, the data context can very confidently be exhaustively defined. In this video game, players control a single unit which can perform a limited set of actions. The data context has been mapped out below:

Image for post

The four major entities which data can be mapped to are: The Player, the Player-Controlled Character, the Map, and the Match itself. I’ve selected these high-level entities based on the significant interactions between them as well as the uniqueness of each entity. Thus, collectively they define most if not all of the Data Context at play in this scenario.

Next, the goal is to observe how much of the data provided by the competition fits the Data Context:

Image for post

  • For Player data, primarily the information provided is about past match history. There is very little information about the players themselves.
  • There is a complete overlap i.e. an exhaustive amount of data regarding the Player-Controlled Character as well as the Match.
  • There are no features provided directly for the Map.

Using these insights, strong candidates for hypotheses can be defined. Again, this technique for generating hypotheses ignores feature correlation or indirect relation in favor of focusing on the absolute strongest signals i.e. directly related features. In this competition, predictions involving the Player-Controlled Character and the Match are captured with a good amount of detail. Contrary to this, although we might be able to infer things about a Player based on their Player-Controlled Character this is avoided since it makes for a weaker hypothesis.

Some example hypotheses are evaluated below, including the competition’s given hypothesis:

Image for post

Project 2: Predicting Poverty in Costa Rica (https://www.kaggle.com/c/costa-rican-household-poverty-prediction)

For the second example, I selected a project with real-world elements which still maintained a fairly simple context. Specifically, this competition’s hypothesis is that poverty level can be predicted using the given data. The data context surrounding this hypothesis can be defined as household costs versus incomes, as has been mapped out in the diagram below:

Image for post

The features in the given data for this competition are very granular, so I’ve broadly summarized them into categories: 53 features for House Conditions (material, architecture, quality), 6 features for Family Conditions, 23 features for Family Attributes, 32 features for Individual Attributes, 6 features for Rental Conditions, and 8 features for Location. Next, the overlap of the given data with the data context is demonstrated:

Image for post

  • Major poverty-inducing cost factors have not been collected in the data. Life changes such as a recent death in the family are not included. Similarly, outstanding debts have not been mentioned in this dataset at all.
  • Day-to-day cost factors such as grocery and luxury purchases are not included in this dataset.
  • There is only one broad feature available for describing household income i.e. the poverty level. In the diagram, I included a few indirect features that might be leveraged given the scarcity of direct features. However, these will never be as strong signals as having proper data about jobs/salaries/investments.

The biggest insight to take away from observing the overlap in data is that there are many aspects related to predicting poverty which have not been provided in the competition’s dataset. The information that is provided is mostly centered around housing and family attributes.

Some example hypotheses are evaluated below, including the competition’s given hypothesis:

Image for post


Further Reading

My goal in writing this article was to help others gain a practical understanding of the essence of Data Science at the highest level and a starting point for when to use it. Data Science undertakings should aim to truly explain the world around us, helping us fill in the blanks with numerical methods as a means of discovery.

Returning back to the powerful models and algorithms behind it all, they are a deep study of their own which help us derive even more understanding from complex data scenarios. If you are interested in knowing more about how those algorithms work, I encourage you to check out my book on Kindle: Essential Data Science: A Guide to the Applied Machine Learning Toolkit.

Bio: AbderRahman Sobh is a Data Scientist, a Machine Learning Engineer, a Software Developer, and a Mentor.

Original. Reposted with permission.