Data Versioning: Does it mean what you think it means?

Does data versioning mean what you think it means? Read this overview with use cases to see what data versioning really is, and the tools that can help you manage it.



By Einat Orr, PhD., Co founder & CEO at Treeverse

When we first thought about a tagline for lakeFS, our recently released OSS project, we instinctively used terms such as “Data versioning”, “Manage data the way you manage code”, “It’s git for data”, and any random variation of the three that is a grammatically correct sentence in english. We were very pleased with ourselves for 5 minutes, maybe 7, before realizing these phrases don’t really mean anything, or mean too many things to properly describe what value we bring. It is also commonly used by other players in the domain that address completely different use cases.

We decided to map the world of projects declaring “Data Versioning”, “Manage data the way you manage code”, and “It’s Git for Data” according to use cases.

 

Use Case #1: Collaboration over data 

 
The pain: Data analysts and data scientists using many data sets, external and internal, that change over time. Managing access to data sets, and the different versions of each data set over time, is hard and error prone.

The solution: An interface that allows collaboration over the data and version management o. The actual repository may be a proprietary database (e.g. DoltHub), or providing efficient access to data distributed within your systems (e.g. Quilt or Splitgraph). These interfaces grant easy access and management of different versions of the same data set. Most players in this category also provide collaboration of other aspects of the workflow. Most popular is the ability to collaborate over ML models. In this category you can find the likes of DAGsHub, DoltHub, data.worldKaggle, Splitgraph, Quilt, FloydHub and DataLad.

 

Use Case #2: Managing ML pipelines

 
The pain: Running ML pipelines, from input data to tagged data, validation, modeling, optimizing hyperparameters, and introducing the models to production. There’s no simple way to manage this pipeline, and the very many tools used in the process.

 The solution: MLOps tools. At this point you might be asking yourself, why would Ops tools be mentioned in the context of the “Data Versioning”? Well, it’s because managing  data pipelines is a major challenge in ML application life cycle. Since ML is a scientific work, it requires reproducibility, and reproducibility means data + code. There are a few MLOps tools that enable data versioning, and they include: DVCPachyderm and MLflow.

 

Use Case #3: The need for Insert and Delete in immutable data lakes

 
The pain: Data-lakes over object-storage are immutable (both objects and formats), but mutability is essential to:

  1. Comply with GDPR and other privacy regulation (delete records on demand)
  2. Ingest streaming data (requires appends)
  3. Backfills or late data (require updates to already saved data).

The solution: Structured Data Formats that allow Insert, Delete, and Upsert. The formats are columnar, and provide the ability to change an existing object by saving the delta of the changes into another object. The meta data of those objects include the instructions on how to generate the latest version of an object from its saved delta objects. We add data versioning mainly to provide concurrency control. In  this category you can find open source projects Apache IceBergApache Hudi and Delta Lake (by DataBricks).

 

Use Case #4: Data lake manageability and resilience

 
The pain: Managing multiple data producers and consumers of an object storage based data lake. The consumers access the data using different tools, such as Hadoop/Spark, Presto, and analytics data bases. Coordination between the data contributors and data consumers is challenging. It relies on internal processes and manual updates of catalogs or files. In addition, there’s no easy way to provide isolation without copying data, and there is no way to ensure consistency between multiple data collections.

The solution: An interface that allows collaboration over the data and version management. For example, the interface can provide a Git terminology that allows versioning of the lake by branching, committing and merging changes.

We created lakeFS after meeting over 30 companies managing a data lake over an object storage. These pains, that we knew well from our own experience, kept coming up. lakeFS is designed to deliver resilience and manageability to object storage data lakes. It is format agnostic and supports all formats supporting mutability.

Figure

Example of Data Versioning tools

 

 
Bio: Einat Orr, PhD. is Co founder and Chief Executive Officer at Treeverse.

Original. Reposted with permission.

Related: