Best Practices for Version Control in Data Science Projects
Improve the project reproducibility and collaboration in your environment.
Data science projects in the modern era have been made easier with all the programming tools we have. Data scientists are then taught to use all these tools to analyze the data and swiftly develop machine learning models efficiently. However, much data scientist education could focus more on software development practices as it felt different. That’s why there is still a lack of knowledge for data scientists to develop data science projects in the best way.
One concept that every data science project should have is versioning. Versioning could help the project be reproducible and facilitate efficient collaboration. It also provides a way to trace past results and understand the changes that happen over time in the project. There are many valuable benefits of versioning, but you would be surprised how many teams haven’t yet employed versioning.
In this article, I would still assume that you are already beginning to understand how versioning works and using Git as the version control tool. We would explore several versioning best practices for data science project.
Let’s get into it.
Versioning Best Practices for Data Science Projects
As I have mentioned, this article assumes you have basic versioning knowledge. You don’t necessarily need to be adept at it, but at least you already have a Git version tool in the environment. If you haven’t, please follow the instructions for installation on the Git website. Also, you can use GitHub/GitLab for your repository for the versioning platform.
With that, we would get into the best practices you must follow.
1. Organize Your Repository Properly
Working with data can be messy, considering where to secure our data, separate the code for each analysis, and where the model should be stored. It’s easier to save everything in one place and focus on the technical work. However, the technical debt increases the more you neglect to organize your work.
Organizing your project and, subsequently, the repository would improve the efficiency of the project's work. Organizing is essential when working with teams and knowing the project will evolve continuously.
A clear structure for your project and repository would help you and all the team members understand the project flow. It also helps to make it consistent the more the project grows.
For example, you can have a simple structure like below.
project-name/
│
├── data/
│ └── processed/
│
├── src/
│ └── main.py
│
├── notebooks/
│ └── analysis.ipynb
│
├── models/
│ └── model.pkl
│
├── README.md
│
└── requirements.txt
As long as the structure exists and you (and your team) can navigate the project, the organization should be considered a success. You can always change the structure depending on your project needs.
2. Use DVC to Version Your Data
Data is our main priority in a data science project. Without it, we can’t do any kind of project. Even when the data exists, the project will only succeed if the quality is good. That’s why we always need to ensure our data is handled properly.
Our data could constantly change during the project courses. This can happen because of time, a change in business direction, or any kind of event. So, we must think about how to version our data properly.
However, Git is not necessarily optimized to handle data versioning, especially large datasets such as images, model objects, etc. With how big the data science project dataset was, the repository could be bigger and make the repository smaller, which is why DVC came to help.
DVC, or Data Version Control, is a tool designed specifically to version large datasets and models related to data science projects. It complements Git, so DVC can’t work by itself. It works by tracking the data versions rather than storing them directly in the repository. DVC would version the metadata while the actual data is stored somewhere else, such as local or cloud storage.
With Git and DVC, you can quickly and efficiently switch versions of your dataset and code.
3. Utilize .gitignore File
When working on our data science project, we often perform experiments to see which model works best and store many temporary files before we evaluate them. Furthermore, we might use sensitive data and credentials during our project. Ideally, we don’t want to version the unnecessary and sensitive files, especially if we push them to the cloud repository.
Using the .gitignore
file, we can instruct Git to understand which files and directories to ignore. This means that Git would not track and include them during versioning. It’s an important part of the versioning system as we want to make sure our repository is clean and doesn’t contain any files that are not required.
You can generate the .gitignore
file by creating the file in your IDE with the exact name. Inside, you can specify the file to ignore by Git. For example, below are the files and directories we want to ignore.
.env
config/secret.yaml
Try to use the .gitignore
properly to maintain the integrity of your repository.
4. Commit with a Meaningful Message
Commit is a phrase in the Git versioning tools that finalizes the process. By committing, you agree that the files and directories can be tracked and that a unique identifier will be generated to return to the current commit version. It’s a best practice to commit frequently and early to ensure the project development runs correctly.
We can use the message optional parameter during the commit process to describe what was changed between versions. However, many just randomly throw the message without considering the intended changes, which is wrong.
Let me set the scene. You were in the middle of the development of Machine Learning API for the data science projects. Every hour, you might commit the new code as there are few bugs here and there to fixed. A week later, there are changes in the business direction, and you need to start the analysis, but not from the beginning. So, you turn to the versioning system, but you are still determining which version you should use because there are no clear descriptions. That is the problem you would encounter if the commit happened without a meaningful message.
A meaningful message doesn’t mean you need a long message to describe what happens. It should be concise yet get to the point of what happens in the code and analysis changes. Having a clear message would help you and the team members understand what you are doing in the data science project.
The example message is similar to the sample below.
git commit -m "Update README with setup instructions for the new contributors."
It’s simple, but the message is meaningful enough so people know what the changes were. You can get into a little bit more detail if necessary, but keep it concise yet clear.
5. Version the Dependencies
One thing I see that many beginner data scientists neglect is maintaining an isolated environment when working on a data science project. Basically, all the dependencies are installed in the main environment without creating any virtual environment. The problem is that every project requires different dependencies, and merging them would create dependency conflicts.
The best practice for the data science project is to create an isolated environment that is free from the other dependencies for other projects. The dependencies themselves need to be tracked, as our analysis and model usually depend on certain versions of the libraries and even the operating system. Without proper versioning, we would run into reproducibility problems as the environment would be different when we deployed the model somewhere else or when someone wanted to try to perform our analysis.
To ensure our data science project is working properly, we should try to track the dependencies in the requirements consistently.txt file, especially whenever there are changes in the dependencies. Don’t forget to specify the version of each library to prevent conflict in case library updates.
6. Use Branch Prominently
We have spoken about environment isolation, and that is a good bridging for branching. The branch is a Git feature to create an isolated version for feature experimentation without impacting the main base. Branching is often used in software development to avoid introducing bugs into the main project when developing something new.
Branching is also important in data science projects. Whenever we develop a new model and perform a fresh analysis, the branch will help us effectively experiment and version the changes without worrying about messing up our stable main projects.
I use the branch a lot when working with the developer and API development of my data science project. As there are many bugs in my projects and features, I want to make sure that my experiment does not disturb production.
Conclusion
Versioning is an activity that tracks the changes that happen in our code within the project. It’s an important concept for data science projects as versioning allows reproducibility and helps collaboration between the teams.
In this article, I have outlined several best practices for version control in data science projects:
- Organize Your Repository Properly
- Use DVC to Version Your Data
- Utilize .gitignore file
- Commit with a Meaningful Message
- Version the Dependencies
- Use Branch Prominently
I hope it helps! Let’s continue the discussion in the comment section.
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.
Our Top 3 Partner Recommendations
1. Best VPN for Engineers - 3 Months Free - Stay secure online with a free trial
2. Best Project Management Tool for Tech Teams - Boost team efficiency today
4. Best Password Management Tool for Tech Teams - zero-trust and zero-knowledge security