Design of Experiments in Data Science

Read this overview of the process of designing experiments for collecting data.

By Benjamin Obi Tayo, Ph.D., KDnuggets on September 3, 2020 in Design of Experiments, Experimentation

I. Introduction

Data plays a central role in data science and machine learning. Most often, we assume that the data to be used for analysis or model building is readily available and free. Sometimes we may not have the data and getting the full dataset either isn’t possible or would take too long to collect. In this case, we need to design a way to try to collect the best subset of data that we can get quickly and efficiently. The process of designing an experiment for collecting data is called the design of experiments. Some examples of the design of experiments include surveys and clinical trials.

In this article, we will discuss 4 main factors to keep in mind when designing and executing experiments for data collection.

II. Factors to keep in mind when designing experiments for data collection

In this section, we discuss 4 main factors to consider when designing experiments for data collection.

1. Time

We need to make sure the experiment can be designed and implemented within a reasonable period of time. For example, suppose the customer service department of a certain organization is experiencing exponential growth in the number of calls. The organization can design surveys in which employees and customers can participate. This has to be done in a prompt and timely manner so that data collected could be analyzed and used for data-driven decision making that could help improve the customer experience. If the design of the experiment and analysis of data collected is not executed in a timely manner, it could negatively impact sales and profits.

2. Quantity of Data

In designing experiments, we need to make sure the data collected from the experiment will be sufficient for us to answer the questions we need to. The amount of data collected has to be small compared to the population, otherwise it would take too long to collect. The sample data must be representative of the whole population. For example, an experiment designed to study the efficacy of a medication should be demographically representative (should include different age groups, gender, ethnicity, etc.).

3. Determine Important Factors

In designing experiments for data collection, you need to decide what your dependent variables or predictors are. For example, if the goal of the experiment is to collect data that would enable you to estimate house prices in a given neighborhood, you may decide to predict housing prices based on predictors or features such as a number of bedrooms, a number of bathrooms, square footage, zip code, school district, year built, HOA, etc. It is important to understand the important features and control features.

4. Cost

Designing an experiment for collecting data can be very costly. Executing the experiment can also involve cost. For example, participants engaging in a survey could be compensated as an incentive to encourage participation. It is important that before designing an experiment, you estimate what the cost of executing the experiment would be, and if the benefits from the experiment outweigh the risk. For instance, if the results from the survey can improve customer experience and increase profits, then the investment would be worthwhile.

III. Summary

In summary, we’ve discussed several factors that have to be considered when designing an experiment for data collection. The key goal is to design a way to collect the best subset of data quickly and efficiently.

Additional Data Science/Machine Learning Resources

For questions and inquiries, please email me: benjaminobi@gmail.com

Bio: Benjamin Obi Tayo Ph.D is a physics professor at the University of Central Oklahoma as well as a Data Science educator and writer with interests in data science, machine learning, AI, Python and R, predictive analytics, materials science, and biophysics.

Original. Reposted with permission.

Related: