The 5 Rules For Good Data Science Project Documentation

Once data scientist finishes building the project, they will need to do the task that most of us hate that is documenting the code.

By Sara A. Metwalli, Associate Editor at Towards Data Science on December 8, 2022 in Programming

The 5 Rules For Good Data Science Project Documentation

Image by Editor

So you finished your project, got some excellent data, processed it, cleaned it, trained your model, applied it to your data, and got terrific results. That's it.

Not really.

Often, the software is developed for others to use, so once the programmer or data scientist finishes building the project, they will need to do the task that most of us hate…

Documenting the code.

In software engineering, in general, writing documentation refers to the process where the programmer of the main developer of the code writes a script explaining in detail what the code does, its goal, and how it achieves that. The main reason programmers hate to write documentation is that, as a programmer, you would instead write code than an explanation of it.

Not just that, for documentation to be good, it needs to be simple for anyone to understand, even if they are not professional programmers. And as we all know — maybe not all — programmers are good at writing code but bad at explaining its theory.

1. A Good Description

At the beginning of your documentation, there must be a short, concise description. This description should be only a few sentences long and clearly explain what the project does and how it does it.

You already used some open-source projects in your previous work, so here are some excellent descriptions from well-known data science projects.

Pandas: "pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language." — Pandas documentation

Matplotlib: "Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy and hard things possible." — Matplotlib documentation

Bokeh: "Bokeh is an interactive visualization library for modern web browsers. It provides elegant, concise construction of versatile graphics and affords high-performance interactivity over large or streaming datasets. Bokeh can help anyone who would like to quickly and easily make interactive plots, dashboards, and data applications." — Bokeh documentation.

2. A Clear and Concise Installation Guide

When you develop a project, you often host it on a remote server for people to use. They will probably need to clone it from GitHub, install it using pip, or download it from the official website.

This might seem straightforward for experienced programmers; after all, they installed many libraries during their careers. However, what if a new programmer is trying to use your project? They may need a little help.

After describing your project and what it does, the following should be a "Getting Started" section. That section's primary goal is to provide the user with simple steps to installing the project and running a small example proving it was installed correctly.

Moreover, it often contains the dependencies — other libraries or modules — that this project depends on to function properly.

3. Tutorials

At this point, the user knows what the project does and how to install it. Now, it's time to dig deep and start using it. Good documentation will often contain tutorials about the different use cases of the project and how to get it done using the project's inner functions.

Many tutorials don't equal good documentation; it's not about quantity but quality. However, some projects have a few tutorials highlighting the project's usage, clearly stating how it can be extended to apply to other cases. This means the user only needs to read 2 or 3 tutorials to understand the project's code's inner workings.

The most used tutorial format is to have a few lines of explanation followed by some code, more description, etc.

4. Detailed API Reference

This part usually comes to mind when you hear — or read — the word documentation. This section is where you go through all functions, public variables, and classes within your project, explaining its functionality, attributes briefly, and returns.

The brief explanation is often two or three sentences and directly explains the purpose of the function/class, showing its type, the common types of its attributes, and its return in the form of a function header. This header often includes an embedded link to the source code's function/class definition (wherever it is hosted).

5. Architecture Explanation

So far, your documentation has explained how the inner core of your project works, the main functions, and some use cases. The last section of the documentation should explain why your project works the way it does.

Not all code contains this section; it is, after all, not as essential as the other sections we went through. However, this section may be necessary to make your project open source. In this case, this section can guide the contributors to how they may add new functions to the code without affecting its core functionality.

Takeaways

Nowadays, more than writing robust code is required; to prove your capabilities and how well you're familiar with the project and the field, you will need to provide well-written documentation and highlight how your code work and how it can be used.

Producing good documentation depends on many factors; in this article, we went through 5 sections that, if included, will add value to your documentation and make it as helpful as possible.

You must think the term "good documentation" is very vague and depends on the person. For example, I might see some documentation as good and helpful, while you may think the opposite. If we can't set rigid rules to what makes good documentation, how can we decide if documentation is good?

The simple answer to that is feedback. Most modern documentation often includes a "give us feedback" section. Users can contact the programmer about missing or inaccurate information in the documentation to make it better and more helpful.

Sara Metwalli is a Ph.D. candidate at Keio University researching ways to test and debug quantum circuits. I am an IBM research intern and Qiskit advocate helping build a more quantum future. I am also a writer on Medium, Built-in, She Can Code, and KDN writing articles about programming, data science, and tech topics. I am also a lead in the Woman Who Code Python international chapter, a train enthusiast, a traveler, and a photography lover.