Design Patterns for Machine Learning Pipelines
ML pipeline design has undergone several evolutions in the past decade with advances in memory and processor performance, storage systems, and the increasing scale of data sets. We describe how these design patterns changed, what processes they went through, and their future direction.
By Davit Buniatyan, CEO of Activeloop, a Y-Combinator alum startup.
Design patterns for ML pipelines have evolved several times in the past decade. These changes are usually driven by imbalances between memory and CPU performance. They are also distinct from traditional data processing pipelines (something like map reduce) as they need to support the execution of long-running, stateful tasks associated with deep learning.
As growth in dataset sizes outpace memory availability, we have seen more ETL pipelines designed with distributed training and distributed storage as first-class principles. Not only can these pipelines train models in a parallel fashion using multiple accelerators, but they can also replace traditional distributed file systems with cloud object stores.
Along with our partners from the AI Infrastructure Alliance, we at Activeloop are actively building tools to help researchers train arbitrarily large models over arbitrarily large datasets, like the open-source dataset format for AI, for instance.
Single machine + Single GPUs
Pre-2012, training ML models was a relatively straightforward exercise. Datasets like ImageNet and KITTI were stored on a single machine and accessed by a single GPU. For the most part, researchers could get decent GPU utilization without prefetching.
Life was easier in this era: there was little need to think about gradient sharing, parameter servers, resource scheduling, or synchronization between multiple GPUs. As long as jobs were GPU-bound rather than IO-bound, this simplicity meant first-generation deep learning frameworks like Caffe running on a single machine were often good enough for most ML projects.
Although it wasn't uncommon to spend weeks training a model, researchers started exploring the use of multiple GPUs to parallelize training.
Single machine + Multiple GPUs
By 2012, larger datasets like COCO, larger model architectures, and empirical results justified the effort required for configuring multi-GPU infrastructure. Consequently, we saw the emergence of frameworks like TensorFlow and PyTorch, along with dedicated hardware rigs with multiple GPUs that could handle distributed training with minimal code overhead.
Behind the seemingly simple requirements of data parallelism (parameter sharing followed by all-reduce operation) were non-trivial problems associated with parallel computing, such as fault tolerance and synchronization. As a result, an entirely new set of design patterns and best practices from the high-performance computing (HPC) world (such as MPI) had to be internalized by the community before multi-GPU training could be reliably put into practice, as we saw with frameworks like Horovod.
During this period, a popular strategy for multi-GPU training was copying and placing data locally with GPUs. Modules such as tf.data excelled at maximizing GPU utilization by combining prefetching with highly efficient data formats while maintaining flexibility for on-the-fly data augmentation.
However, as datasets increased in size (with datasets like Youtube8M, which weighed in at 300 TB+ with more complicated data types) and therefore required cloud-scale storage with random access patterns, distributed file system formats for unstructured, multi-dimensional data like HDF became less useful over time.
Not only were those formats not designed with cloud object stores in mind, but they also were not optimized for data access patterns unique to model training (shuffle followed by sequential traversal) or read types (multi-dimensional data accessed as blocks or chunks without reading the entire file).
This gap between a need for cloud storage to store large datasets and a need to maximize GPU utilization meant the ML pipeline had to be redesigned again.
Object storage + Multiple GPUs
In 2018, we started seeing more distributed training with cloud object stores with libraries like s3fs and AIStore, as well as services like AWS SageMaker’s Pipe Mode. We also saw the emergence of HDF-inspired data storage formats with cloud storage in mind, such as Zarr. Perhaps most interesting, we noticed a number of industry ML teams (usually working with 100TB+) developing in-house solutions to stream data from S3 to models onto VMs.
Even though the problem of transfer speed remained (reading from cloud object storage could be orders of magnitude slower than reading from SSD), this design pattern is widely considered as the most feasible technique for working with petascale datasets.
While current techniques of sharding over EBS volumes or piping data directly to models, along with clever workarounds like WebDataset, are sufficient, the fundamental problem of throughput mismatch remains: cloud object stores can push ~30MB/sec per request while GPUs reads can hit 140GB/sec, which meant costly accelerators are often underutilized.
Accordingly, there are several key considerations that need to be addressed by any new data storage format designed for large-scale ML:
- Dataset Size: datasets often exceed the capacity of node-local disk storage, requiring distributed storage systems and efficient network access.
- Number of Files: datasets often consist of billions of files with uniformly random access patterns, something that often overwhelms both local and network file systems. Reading from a dataset containing a large number of small files takes a long time.
- Data Rates: training jobs on large datasets often use many GPUs, requiring aggregate I/O bandwidths to the dataset of many GBytes/s; these can only be satisfied by massively parallel I/O systems.
- Shuffling and Augmentation: training data needs to be shuffled and augmented prior to training. Repeatedly reading a dataset of files in shuffled order is inefficient when the dataset is too large to be cached in memory.
- Scalability: users often want to develop and test on small datasets and then rapidly scale up to large datasets.
In the current regime of petascale datasets, researchers should be able to train arbitrarily large models on arbitrarily large datasets in the cloud. Just like distributed training separated model architectures from computing resources, cloud object storage has the potential to make ML training independent of dataset sizes.
 Installing and configuring NCCL or MPI are not for the faint of heart.
 A particularly well designed API is PyTorch’s DistributedDataParallel.
 There were innovations such as Parquet, but those primarily dealt with structured tabular data.
 AWS’ Pipe Mode also helpfully optimizes lower level details such as ENA and multipart downloading.
Bio: Davit Buniatyan (@DBuniatyan) started his Ph.D. at Princeton University at 20. His research involved reconstructing the connectome of the mouse brain under the supervision of Sebastian Seung. Trying to solve hurdles he faced analyzing large datasets in the neuroscience lab, David became the founding CEO of Activeloop, a Y-Combinator alum startup. He is also a recipient of the Gordon Wu Fellowship and AWS Machine Learning Research Award.
- The Prefect Way to Automate & Orchestrate Data Pipelines
- Vaex: Pandas but 1000x faster
- Build Pipelines with Pandas Using pdpipe