Project Hydrogen, new initiative based on Apache Spark to support AI and Data Science

An introduction to Project Hydrogen: how it can assist machine learning and AI frameworks on Apache Spark and what distinguishes it from other open source projects.



By Reynold Xin, Co-Founder, Databricks
Databricks Project Hydrogen

Question #1 - What is Project Hydrogen?

Project Hydrogen aims at enabling first-class support for all distributed machine learning frameworks on Apache SparkTM, by substantially improving the performance and fault-recovery of distributed deep learning and machine learning frameworks on Spark.

Question #2 - What makes Project Hydrogen different to other open source projects around machine learning and AI?

Most open source projects around machine learning and AI are focused on the algorithms and distributed training frameworks.

Project Hydrogen is a new SPIP (Spark Project Improvement Proposal) introducing one of the largest changes in Spark scheduling since the inception of the project, since the original 600 lines of code.

Big data and AI are joined at the hip: the best AI applications require massive amounts of constantly updated training data to build state-of-the-art models. AI has always been one of the most exciting applications of big data and Apache Spark. In part driven by deep learning, we see Increasingly more Spark users want to integrate Spark with distributed machine learning frameworks built for state-of-the-art training.

The problem is, big data frameworks like Spark and distributed deep learning frameworks don’t play well together due to the disparity between how big data jobs are executed and how deep learning jobs are executed. As an example, on Spark, each job is divided into a number of individual tasks that are independent of each other. This is called "embarrassingly parallel," and this is a massively scalable way of doing data processing that can scale up to petabytes of data.

However, deep learning frameworks use different execution schemes. They assume complete coordination and dependency among the tasks. What that means is this pattern is optimized for constant communication, rather than large-scale data processing to scale to petabytes of data.

Project Hydrogen is positioned as a potential solution to this dilemma.

Question #3 - What will this project deliver to contributors and to users?

Project Hydrogen introduces a new scheduling primitive for Spark - Gang Scheduling.

In this mode, the tasker schedules "all or nothing". Which means either all of the tasks are scheduled in one shot, or none of the tasks are scheduled at all. This actually reconciles the fundamental incompatibilities of how Spark works with the distributed ML frameworks needs.

Now users can use a simple API to introduce a barrier function. The barrier function indicates to Spark whether it should be using the embarrassingly parallel or gang scheduler mode at each stage of the machine learning pipeline.

For example, the new gang scheduler mode can be used for model training using a distributed training framework, while embarrassingly parallel can be used before for data prep, as well as for model inference once the model has been trained.

Question #4 - How do you see Project Hydrogen supporting the wider need for analytics and machine learning?

The goal of Project Hydrogen is really to embrace all the distributed machine learning frameworks as first-class citizens on Spark. It truly unifies data processing with machine learning and specifically distributed training on Spark.

We want to make every other framework as easy to run as MLlib directly on Spark, whether it is TensorFlow or Horovod, or future popular ML frameworks. It significantly expands the ecosystem of ML frameworks that can be used effectively on Spark for deep learning applications.

The new API isn’t final, but it’s expected to be added to the core Apache Spark project soon.

Question #5 - How well do you think developers are bringing together different sets of data and open source projects? Are we closer to a unified theory of analytics, or do we still have a lot of work to do around making AI fit for purpose?

Artificial Intelligence (AI) has massive potential to drive disruptive innovations affecting most enterprises on the planet. However, most enterprises are struggling to succeed with AI​. Why is that? Simply put, AI and Data are siloed in different systems and different organizations.​

Apache Spark was the first unified analytics engine - It kind of sparked this revolution because it's the only engine out there that actually combined data processing and machine learning. With Project Hydrogen, we are expanding Spark’s built-in optimizations well beyond MLlib, so that developers can benefit from a unified approach to data processing and machine learning, using any ML framework on Spark.

The multiplication of ML frameworks is leading to other downstream impacts when it comes to productionizing AI applications, such as sharing and tracking experiments and pushing models in production. That's why we just introduced MLflow, a new cross-cloud open source framework designed to simplify the end to end machine learning lifecycle.

Question #6 - How long do you think it will take for AI to become “business as usual”?

As William Gibson said: “The future is already here — it's just not very evenly distributed.”

Very few companies have been successful at doing AI at scale and company wide. This is what we call the 1% problem. The majority of companies - the 99% - continue to struggle due to having disparate systems and technologies, and organizational divides between data engineering and data scientists. To achieve AI, organizations need to unify data and AI.

Apache Spark was the first step to unify data and AI, but that alone is not enough - organizations still need to manage a lot of infrastructure. To eliminate obstacles for AI, companies have to leverage unified analytics. Unified Analytics brings together data processing with AI technologies, making AI much more achievable for enterprise organizations and enabling them to accelerate their AI initiatives. Unified Analytics makes it easier for enterprises to build data pipelines across various siloed data storage systems and to prepare labelled datasets for model building, which allows organizations to do AI on their existing data and iteratively do AI on massive data sets.

A Unified Analytics Platform provides collaboration capabilities for data scientists and data engineers to work effectively across the entire development-to-production lifecycle. The organizations that succeed in unifying their domain data at scale and unifying that data with the best AI technologies will be the ones that succeed with AI.

For more information, watch Reynold Xin presentation on project Hydroge:

https://databricks.com/session/databricks-keynote-2

Bio: Reynold Xin is a computer scientist and engineer specializing in big data, distributed systems, and cloud computing. He is a co-founder and Chief Architect of Databricks.

Related: