How to Design Experiments for Data Collection

Several factors must be taken into consideration when designing experiments for data collection.

How to Design Experiments for Data Collection
Photo by Science in HD on Unsplash


Key Takeaways


  • Designing experiments for data collection is important when the data required for analysis isn’t available.
  • The key goal is to design a way to collect the best subset of data quickly and efficiently.


Data plays a central role in data science and machine learning. Most often, we assume that the data to be used for analysis or model building is readily available and free. Sometimes we may not have the data and getting the full dataset either isn’t possible or would take too long to collect. In this case, we need to design a way to try to collect the best subset of data that we can get quickly and efficiently. The process of designing an experiment for collecting data is called design of experiments. Some examples of design of experiments include surveys and clinical trials.

We now discuss 4 main factors to keep in mind when designing and executing experiments for data collection.


Factors For Designing Experiments



We need to make sure the experiment can be designed and implemented within a reasonable period of time. For example, suppose the customer service department of a certain organization is experiencing exponential growth in the number of calls, and long call center wait times. The organization can design surveys in which employees and customers can participate. This has to be done in a prompt and timely manner so that data collected could be analyzed and used for data-driven decision making that could help improve the customer service experience. If the design of the experiment and analysis of data collected is not executed in a timely manner, it could negatively impact sales and profits.


Quantity of Data

In designing experiments, we need to make sure the data collected from the experiment will be sufficient for us to answer the questions we need to. The amount of data (sample data) collected has to be small compared to the total expected data (population data), otherwise it would take too long to collect. The sample data must be representative of the whole population. For example, an experiment designed to study the efficacy of a medication should be demographically representative (should include different age groups, gender, ethnicity, etc.).


Determine Important Features

In designing experiments for data collection, you need to decide what your dependent variables or predictor variables are. For example, if the goal of the experiment is to collect data that would enable you to estimate house prices in a given neighborhood, you may decide to predict house prices based on predictors or features such as the number of bedrooms, the number of bathrooms, square footage, zip code, school district, year built, home owners association fee (HOA), etc. It is important to understand the important features and control features.



Designing an experiment for collecting data can be very costly. Executing the experiment can also involve cost. For example, participants engaging in a survey could be compensated as an incentive to encourage participation. Also, data scientists and data analysts will have to be compensated for analyzing the data collected from the survey. It is important that before designing an experiment, you estimate what the cost of executing the experiment would be, and if the benefits from the experiment outweigh the risk. For instance, if the results from the survey can improve customer experience and increase profits, then the investment would be worthwhile.

In summary, we’ve discussed several factors that have to be considered when designing an experiment for data collection. The key goal is to design a way to collect the best subset of data quickly and efficiently.

Benjamin O. Tayo is a Physicist, Data Science Educator, and Writer, as well as the Owner of DataScienceHub. Previously, Benjamin was teaching Engineering and Physics at U. of Central Oklahoma, Grand Canyon U., and Pittsburgh State U.