Stop Hard Coding in a Data Science Project – Use Config Files Instead

How to efficiently interact with config files in Python.

By Khuyen Tran, Data Science Intern on June 23, 2023 in Python

Problem

In your data science project, certain values tend to change frequently, such as file names, selected features, train-test split ratio, and hyperparameters for your model.

Stop Hard Coding in a Data Science Project - Use Config Files Instead

It is okay to hard-code these values when writing ad-hoc code for hypothesis testing or demonstration purposes. However, as your code base and team expand, it becomes essential to avoid hard-coding because it can give rise to various issues:

Maintainability: If values are scattered throughout the codebase, updating them consistently becomes harder. This can lead to errors or inconsistencies when values must be updated.

Reusability: Hardcoding values limits the reusability of code for different scenarios.

Security concerns: Hardcoding sensitive information like passwords or API keys directly into the code can be a security risk. If the code is shared or exposed, it could lead to unauthorized access or data breaches.

Testing and debugging: Hardcoded values can make testing and debugging more challenging. If values are hard-wired into the code, it becomes difficult to simulate different scenarios or test edge cases effectively.

Solution – Configuration Files

Configuration files solve these problems by offering the following benefits:

Separation of configuration from code: A config file allows you to store parameters separately from the code, which improves code maintainability and readability.

Flexibility and modifiability: With a config file, you can easily modify project configurations without modifying the code itself. This flexibility allows for quick experimentation, parameter tuning, and adapting the project to different scenarios or environments.

Version control: Storing the config file in version control allows you to track changes to the configuration over time. This helps maintain a historical record of the project’s configurations and facilitates collaboration among team members.

Deployment and productionization: When deploying a data science project to a production environment, a config file enables easy customization of settings specific to the production environment without the need for code modifications. This separation of configuration from code simplifies the deployment process.

Introduction to Hydra

Among the numerous Python libraries available for creating configuration files, Hydra stands out as my preferred configuration management tool because of its impressive set of features, including:

Convenient parameter access
Command-line configuration override
Composition of configurations from multiple sources
Execution of multiple jobs with different configurations

Let’s dig deeper into each of these features.

Feel free to play and fork the source code of this article here:

View on GitHub

Convenient parameter access

Suppose all configuration files are stored under the conf folder and all Python scripts are stored under the src folder.

.
├── conf/
│   └── main.yaml
└── src/
    ├── __init__.py
    ├── process.py
    └── train_model.py

And the main.yaml file looks like this:

Accessing a configuration file within a Python script is as simple as applying a single decorator to your Python function.

To access a specific parameter from the configuration file, we can use the dot notation (.e.g., config.process.cols_to_drop), which is a cleaner and more intuitive way compared to using brackets (e.g., config['process']['cols_to_drop']).

This straightforward approach allows you to effortlessly retrieve the desired parameters.

Command-line configuration override

Let’s say you are experimenting with different test_size. It is time-consuming to repeatedly open your configuration file and modify the test_size value.

Luckily, Hydra makes it easy to directly overwrite configuration from the command line. This flexibility allows for quick adjustments and fine-tuning without modifying the underlying configuration files.

Composition of configurations from multiple sources

Imagine you want to experiment with various combinations of data processing methods and model hyperparameters. While you could manually edit the configuration file each time you run a new experiment, this approach can be time-consuming.

Hydra enables the composition of configurations from multiple sources with config groups. To create a config group for data processing, create a directory called process to hold a file for each processing method:

.
└── conf/
    ├── process/
    │   ├── process1.yaml
    │   └── process2.yaml
    └── main.yaml

If you want to use the process1.yaml file by default, add it to Hydra’s default list.

Follow the same procedures to create a config group for training hyperparameters:

.
└── conf/
    ├── process/
    │   ├── process1.yaml
    │   └── process2.yaml
    ├── train/
    │   ├── train1.yaml
    │   └── train2.yaml
    └── main.yaml

Set train1 as the default config file:

Now running the application will use the parameters in process1.yaml file and model1.yaml file by default:

This capability is particularly useful when different configuration files need to be combined seamlessly.

Multi-run

Suppose you want to conduct experiments with multiple processing methods, applying each configuration one by one can be a time-consuming task.

Luckily, Hydra allows you to run the same application with different configurations simultaneously.

This approach streamlines the process of running an application with various parameters, ultimately saving valuable time and effort.

Conclusion

Congratulations! You have just learned about the importance of using configuration files and how to create ones using Hydra. I hope this article will give you the knowledge needed to create your own configuration files.

Khuyen Tran is a prolific data science writer, and has written an impressive collection of useful data science topics along with code and articles. Khuyne is currently looking for a machine learning engineer role, a data scientist role, or a developer advocate role in Bay Area after May 2022, so please reach out if you are looking for someone with her set of skills.

Original. Reposted with permission.