How Uber manages Machine Learning Experiments with Comet.ml


At Uber, where ML is fundamental to most products, a mechanism to manage offline experiments easily is needed to improve developer velocity. To solve for this, Uber AI was looking for a solution that will potentially complement and extend its in-house experiment management and collaboration capabilities.



Sponsored Post.

By Comet.ml

Originally published here.

Scale is an interesting, often over-simplified challenge in machine learning. Intuitively, most everyone understands that bigger models require large amounts of resources (large datasets, computational firepower, etc.), but cost is just one piece of ML’s scale problem. For example, the complexity of managing offline ML experiments at scale is a particularly thorny problem.

In contrast with online A/B tests, offline ML experiments are attempts to improve the accuracy or performance of an ML model by experimenting with changes to the model offline.

At Uber, where ML is fundamental to most products, a mechanism to manage offline experiments easily is needed to improve developer velocity.  To solve for this, Uber AI was looking for a solution that will potentially complement and extend its in-house experiment management and collaboration capabilities.

 

What is hard about managing experiments at Uber-scale?

 
Take a ML-powered feature of Uber, say estimating the ETA for every ride and delivery. Imagine every market where Uber is active, and then every product offering within each market. The sheer quantity and diversity of product and market segments present logistical challenges right out of the gate. As Olcay Cirit, research scientist at Uber AI, says, “We can’t simply look at a global accuracy metric, rather we must use sliced analysis to gauge the improvements or degradations in each market and product category.”

This results in a huge volume of experiments, conducted by different teams with different processes, all of which must be compared in order to make optimal improvements to the model. Performing deep analysis on model results where each model might behave differently across segments is extremely difficult.

 

Building experiment management infrastructure at Uber

 
Comet supports most of the ML experiment tracking workload out-of-the-box. As Cirit explains, “As my team piloted Comet, it organized my team’s deep learning experiments by tracking all of our hyperparameters, metrics, and code changes… Developers working on improvements to ML infrastructure can more easily gauge, for example, whether improvements in training speed might adversely impact model convergence.”

“We found that Comet had these critical customizability features, including code panels, and were also quite willing to engage with us during our evaluation.”


For Uber, customizability and extensibility are critical concerns, as Uber AI has many different teams working on state-of-the-art models in different domains. As Cirit explains:

  • “We are working on integrating Comet into our in-house framework so that all users of the platform can benefit from experiment tracking… With Comet, it will be very easy for product team members to add new, project-specific metrics to track and visualize without having to make changes to the platform…”
  •  “Uber has its own internally developed Bayesian Optimization package that we use to tune hyperparameters. Comet lets us visualize all of the trials using panels and parallel coordinate charts, so that we can derive high-level insights about the hyperparameter search space in addition to getting a tuned set of hyperparameters.”

What you need, at that stage, are tools with which you can build ML infrastructure suited to your specific needs.