Netflix’s Polynote is a New Open Source Framework to Build Better Data Science Notebooks
The new notebook environment provides substantial improvements to streamline experimentation in machine learning workflows.
I recently started a new newsletter focus on AI education. TheSequence is a no-BS( meaning no hype, no news etc) AI-focused newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:
Notebooks are the data scientist best friend and can also be a nightmare to work with. For someone accustomed to work with modern integrated develop environments(IDEs), working with notebooks feels like going back decades. Furthermore, modern notebook environments is mostly constrained to Python programs and lack first-class support for other programming languages. A few days ago, Netflix open sourced Polynote, a new notebook environment that addresses some of those challenges.
Polynote was born out of the necessity to accelerate data science experimentation at Netflix. Over the years, Netflix has built a world-class machine learning platform mostly based on JVM languages like Scala. The support for those languages in mainstream notebook technologies such as Jupyter is fundamentally basic so they needed a better solutions. Polynote was initiated by that basic requirement but incorporated the lessons learned building one of the most ambitious notebook-based experimentation platforms in the data science world.
Inside Netflix’ Notebook Drive Architecture
Over the last few years, Netflix has transformed its use of data science notebooks from an experimentation artifact to a key component of the lifecycle of machine learning solutions. Initially, Netflix adopted Jupyter Notebooks like a data exploration and analysis tools. However, the engineering team quickly realized that Jupyter offered tangible advantages in terms of runtime abstraction, extensibility, interpretability of the code and debugging that could have a major impact in data science workloads if used correctly. In order to expand the use of Jupyter as a data science runtime, the Netflix team needed to solve a few major challenges:
- The Code-Output Mismatch: Notebooks are frequently changed and, many times, the output you are seeing in the environment does not correspond to the current code.
- The Server Requirement: Notebooks typically require a Notebook server runtime to run which represents an architecture challenge when adopted at scale.
- Scheduling: Most data science models need to be executed on a periodic basics but the tools for scheduling Notebooks are still fairly limited.
- Parametrizing: Notebooks are fairly static code-environments and the processes for passing input parameters are far from trivial.
- Integration Testing: Notebooks are isolated code- environments which notoriously difficult to integrate with other Notebooks. As a result, tasks like integration testing become a nightmare when using Notebooks.
To address those requirements, Netflix built a very ambitious architecture that enable the operationalization of Jupyter notebooks. The initial implementation included technologies such as Papermill which enables the parametrization of notebooks.
While the initial notebook architecture at Netflix was certainly ambitious, it was also constrained Python programs. Now it was time to expand.
Polynote is a multi-language notebook experimentation environment. In addition to Python, the current release supports languages such as SQL, Vega(visualizations) and, of course, Scala. The platform is also integrated with data science infrastructures such as Apache Spark. At its core, Polynote includes the following capabilities:
a) Improved Editing Experience: Polynote tries to enable an editing experience closer to modern IDEs.
b) Multi-Language Support: Polynote introduces first-class support for Scala and other languages used in data science environmenhts.
c) Data Visualization Improvements: Polynote integrates native data visualizations into notebooks’ dataset without the need of adding a lot of code.
d) Configuration and Dependency Management: Languages like Scala require complex package dependencies in its programs. Polynote saves the package dependency configuration within the notebook itself addressing some of the common challenges in this area experienced by JVM developers.
e) Reproducibility: The combination of code, data and execution results into a single document makes notebooks powerful, but also difficult to reproduce. Polynote includes reproducibility as a first-class capability of the framework.
Improved Editing Experience
Polynote includes common features in IDEs such as code auto-completion or syntax error highlighting which improves the experience for data scientists and researchers building Notebooks. More of the editing capabilities are powered by the Monaco editor which powers the experience of Visual Studio Code.
Polynote does not only provide support for multiple languages but it also allows those languages to be combined in a single program. In Polynote, every cell can be based on a different language. When a cell is run, the kernel provides the available typed input values to the cell’s language interpreter. In turn, the interpreter provides the resulting typed output values back to the kernel. This allows cells in Polynote notebooks to operate within the same context. The example below shows a Python library, to compute an isotonic regression of a dataset generated with Scala.
Data Visualization Improvements
Data visualizations are a common component of most notebook environment. However, Polynote takes the visualization value proposition to another level by including it as a native component of the platform which does not require developers to write any code in order to visually explore a dataset.
Configuration and Dependency Management
Most of the time, data scientists working on notebooks can enjoy the efficiency of Python’s package management model to handle the dependencies of a program. However, in JVM-languages like Scala dependency management can become a total night mare. Polynote addresses that challenge by storing the configuration and dependency information directly in the notebook itself, rather than relying on external files. Additionally, Polynote provides a user-friendly Configuration section where users can set dependencies for each notebook.
With Polynote, Netflix a new code interpretation block instead of relying on a REPL model like a traditional notebook. One of the key capabilities of the new interpretation model is that it removes hidden states which allows data scientists to copy cells within a notebook without introducing any state from the previous position.
Polynote is a new release in the ambitious competitive of data science notebooks but one that stands in its own merits. The support for JVM-based languages could make Polynote a favorite of developers working on Spark infrastructures. Also the editing and reproducatility capabilities are definitely welcomed enhancements to traditional notebook environments. Polynote is available in Github and you can also follow the project’s website.
Original. Reposted with permission.
- Uber’s Ludwig is an Open Source Framework for Low-Code Machine Learning
- A Complete guide to Google Colab for Deep Learning
- A Tour of End-to-End Machine Learning Platforms