Best Practices for Using Notebooks for Data Science
Are you interested in implementing notebooks for data science? Check out these 5 things to consider as you begin the process.
By Armin Wasicek, Sumo Logic
In the data science world, notebooks have emerged as an important tool - they are active documents created by individuals or groups to write and run code, display results, and share outcomes and insights.
Like every other story, a data science notebook follows a certain structure that is typical for its genre. Usually there are four parts - (1) It starts with defining a data set, (2) continue to clean and prepare the data, (3) perform some modeling using the data and (4) interpret the results. In essence, a notebook should record an explanation of why experiments were initiated, how they were performed and then display the results.
In order to explain notebooks, let's take a step back to understand their anatomy, discuss human speed versus machine speed, explore how notebooks can increase productivity, and outline the top five best practices for writing notebooks.
A notebook segments a computation in individual steps called paragraphs. A paragraph contains an input and an output section. Each paragraph executes separately and modifies the global state of the notebook. State can be defined as the ensemble of all relevant variables, memories, and registers. Paragraphs must not contain computations, but can contain text or visualizations to illustrate the workings of the code.
The power of the notebook roots in its ability to segment and then slow down computation. Common executions of computer programs are done at machine speed. Machine speed suggests that when a program is submitted to the processor for execution, it will run from start to end as fast as possible and only block for IO or user input. Consequently, the state of the program changes so fast that it is neither observable nor modifiable by humans. Programmers would typically attach debuggers to stop programs during execution at so-called breakpoints and read out and analyze their state. Thus, they would slow down execution to human speed.
Notebooks make interrogating the state more explicit. Certain paragraphs are dedicated to make progress in the computation, i.e., advance the state, whereas other paragraphs would simply serve to read out and display the state. Moreover, it is possible to rewind state during execution by overwriting certain variables. It is also simple to kill the current execution, thereby deleting the state and starting anew.
Notebooks increase productivity, by facilitating incremental improvement. It is cheap to modify code and rerun only the relevant paragraph. So when developing a notebook, the user builds up state and then iterates on that state until progress is made. Running a stand-alone program on the contrary will incur more setup time and might be prone to side-effects. A notebook will most likely keep all its state in the working memory whereas every new execution of a stand-alone program will need to build up the state on every time it is run.
This takes more time and the required IO operations might fail. Iterating on a program state in the memory proved to be very efficient. This is particularly true for data scientists, as their programs usually deal with a large amount of data that has to be loaded in and out of memory as well as computations that can be time-consuming.
From an the organizational point of view, notebooks are a valuable tool for knowledge management. They are designed to be self-contained, sharable units of knowledge and amend themselves for:
- Knowledge transfer
- Auditing and validation
So, interested in implementing notebooks? Here are a few things to consider as you begin the process:
A notebook contains a complete record of procedures, data, and thoughts to pass on to other people. For that purpose, they need to be focused. Although it is tempting to put everything in one place, this might be confusing for readers. Better write two or more notebooks than overloading a single notebook.
A common source of confusion is when program state gets passed on between paragraphs through hidden variables. The set of variables that represent the interface between two subsequent paragraphs should be made explicit. Referencing variables from other paragraphs than the previous one should be avoided.
A notebook integrates code, it is not a tool for code development. The tool for code development is an Integrated Development Environment (IDE). Therefore, a notebook should one contain glue code and maybe one core algorithm. All other code should be developed in an IDE, unit tested, version controlled, and then imported via libraries into the notebook. Modularity and all other good software engineering practices are still valid in notebooks. As in practice number one too much code clutters the notebook and distracts from the original purpose or analysis goal.
Notebooks are meant to be shared and read by others. Others might not have an easy time following our thought process, if we did not come up with good, self-explaining names. Tidying up the code goes a long way, too. Notebooks impose an even higher standard than traditional code on quality.
A picture is worth a thousand words. A diagram, however, will need some words to label axes, describe lines and dots, and comprehend other important informations such sample size, etc. A reader can have a hard time to seize the proportion or importance of a diagram without that information. Also keep in mind that diagrams are easily copy-pasted from the notebook into other documents or in chats. Then they lose the context of the notebook in which they were developed.
Bottom line here - The segmentation of a thought process is what fuels the power of the notebook. Facilitating incremental improvements when iterating on a problem boosts productivity. A notebook should provide a stat-of-the-art user experience coupled with access to machine learning frameworks to unlock the value of data.
Bio: Armin Wasicek is a senior software engineer at Sumo Logic working on advanced analytics. Previously, he spent many happy years as a researcher in academia and industry. His interests are machine learning, security and the internet of things. Armin holds PhD and MSc degrees from Technical University Vienna, Austria and he was a Marie Curie Fellow at University of California, Berkeley.
- Programming Best Practices For Data Science
- How to Set Up a Free Data Science Environment on Google Cloud
- Why do I Call Myself a Data Scientist?