Gold BlogHow to organize your data science project in 2021

Maintaining proper organization of all your data science projects will increase your productivity, minimize errors, and increase your development efficiency. This tutorial will guide you through a framework on how to keep everything in order on your local machine and in the cloud.



It's always good idea to maintain two versions of your project, one locally and the other on Github.

This article will discuss some useful tips that will enable you to better organize your data science projects. Before delving into some tips for data science project management, let’s first discuss why it is important to organize your project.

 

4 Reasons why it is important to organize your project

 

  1. Organization increases productivity. If a project is well organized, with everything placed in one directory, it makes it easier to avoid wasting time searching for project files such as datasets, codes, output files, and so on.
  2. A well-organized project helps you to keep and maintain a record of your ongoing and completed data science projects.
  3. Completed data science projects could be used for building future models. If you have to solve a similar problem in the future, you can use the same code with slight modifications.
  4. A well-organized project can easily be understood by other data science professionals when shared on platforms such as Github.

For illustrative purposes, we will use the cruise ship data set. We assume that we would like to build a machine learning model for recommending cruise ship crew size based on predictor variables such as age, tonnage, passengers, length, cabins, etc. In section I, we describe how the project should be organized locally. Then in section I, we describe how to create a Github repository for the project. It is always recommended that you maintain two versions of your project, one locally and the other on Github. The advantage of this is that you can access the Github version of your project from anywhere in the world and at any time, as long as you have an internet connection. Another advantage is that if something were to happen with your local computer that could impact your computer adversely, such as viruses in the computer, then you can always be confident that you still have your project files on Github that can serve as a backup.

 

I. Local Project Directory

 

It is good to have a project directory for each project that you are working on.

Directory Name

When creating a project directory for your project, it’s good to select a directory name that reflects your project, for example, for the machine learning model for recommending crew size, one may choose a directory name such as ML_Model_for_Predicting_Ships_Crew_Size.

Directory content

Your project directory should contain the following:

(1) Project plan: This could be a world document where you describe what your project is all about. You may start by providing a brief synopsis followed by step by step plan of what you would like to accomplish. For example, before building a model, you may ask yourself:

  1. What are the predictor variables?
  2. What is the target variable? Is my target variable discrete or continuous?
  3. Should I use classification or regression analysis?
  4. How do I handle missing values in my dataset?
  5. Should I use normalization or standardization when bringing variables to the same scale?
  6. Should I use Principal Component Analysis or not?
  7. How do I tune hyperparameters in my model?
  8. How do I evaluate my model to detect biases in the dataset?
  9. Should I use ensemble methods where I train using different models then perform an ensemble average, e.g. using classifiers such as SVM, KNN, Logistic Regression, then average over 3 models?
  10. How do I select the final model?

(2) Project datasets: You should include the comma separated value (CSV) files for all the datasets to be used for the project. In this example, there is just one CSV file: cruise_ship_info.csv.

(3) Project Codes: Once you’ve figured out what your project plans and objectives are, it is time to start coding. Depending on the type of problem you are solving, you may decide to use a jupyter notebook or an R script for writing your code. Let’s just assume we are going to be using a jupyter notebook.

On your jupyter notebook, start by adding a project heading or title, for example:

Machine Learning Model for Predicting a Ship’s Crew Size

 

Then you may provide a brief synopsis of your project, followed by the author’s name, and date, for example:

We build a simple model using the cruise_ship_info.csv data set for predicting a ship’s crew size. 
This project is organized as follows: 
(a) data preprocessing and variable selection; 
(b) basic regression model; 
(c) hyper-parameters tuning; and 
(d) techniques for dimensionality reduction.

Author: Benjamin O. Tayo
Date: 4/8/2019

 

As you develop the code, you want to make sure that the jupyter notebook is well organized into sections that highlight the machine learning model building workflow, such as:

Importation of necessary python libraries

Importation of dataset

Exploratory data analysis

Feature selection and dimensionality reduction

Feature scaling and data partitioning into train and test sets

Model building, testing, and evaluation

 

For sample project jupyter notebook and R script files, please see the following Github reps, bot13956/ML_Model_for_Predicting_Ships_Crew_Size and bot13956/weather_pattern.

(4) Project Outputs: You may also store key project outputs in your local directory. Some key project outputs could be data visualizations, graphs illustrating model error as a function of different parameters, or tables containing key outputs such as R2 values, mean square errors, or regression coefficients. Project outputs are very handy because they can be used to prepare project reports or PowerPoint presentation slides to be presented to your data science team or to the business administrators in your company.

(5) Project report: In some cases, you may have to put together a project report to describe project accomplishments and provide prescribed actions to be taken based on the findings and insights from your model. In this case, you need to put together a project report using MS word. When writing a project report, you can make good use of some visualizations produced from your main code. You want to add these to the report. Your main code may be added as an appendix to the project report.

An example of a project report file can be found at bot13956/Monte_Carlo_Simulation_Loan_Status.

 

II. Github Project Directory

 

Once you’ve solved the problem of interest, you then have to create a project repository on GitHub and upload project files such as datasets, jupyter notebooks, R program scripts, and sample outputs. Creating a GitHub repository for any data science project is extremely important. It enables you to have access to your code at all times. You get to share your code with a community of programmers and other data scientists. Also, it is a means for you to showcase your data science skills.

Tips for creating a Github repository: Make sure you choose a suitable title for your repository. For example:

Repository Name: bot13956/ML_Model_for_Predicting_Ships_Crew_Size

 

Then include a README file to provide a synopsis of what your project is all about.

Author: Benjamin O. Tayo
Date: 4/8/2019

We build a simple model using the cruise_ship_info.csv data set for predicting a ship's crew size. 
This project is organized as follows: 
(a) data preprocessing and variable selection; 
(b) basic regression model; 
(c) hyper-parameters tuning; and 
(d) techniques for dimensionality reduction.

cruise_ship_info.csv: dataset used for model building.

Ship_Crew_Size_ML_Model.ipynb: the jupyter notebook containing code.

 

Then you may upload your project files, including the dataset, jupyter notebook, and sample outputs.

Here is an example of a Github repository for a machine learning project:

Repository URL: https://github.com/bot13956/ML_Model_for_Predicting_Ships_Crew_Size.

In summary, we’ve discussed how a data science project has to be organized. Good organization leads to better productivity and efficiency. When next you have to work on a new project, please take the time to organize your project. This will not only help with increasing efficiency and productivity, but it will also help to minimize errors. Also, keeping good records of all your current and completed projects enables you to create a repository where you can save all your projects for future use.

Original. Reposted with permission.

 

Related: