Kedro-Airflow: Orchestrating Kedro Pipelines with Airflow
The Kedro team and Astronomer have released Kedro-Airflow 0.4.0 to help you develop modular, maintainable & reproducible code with orchestration superpowers!
Kedro is an open-source Python framework for creating reproducible, maintainable, and modular data science code. Its focus is on authoring code and not orchestrating, scheduling and monitoring pipeline runs. We emphasise infrastructure independence, and this is crucial for consultancies such as QuantumBlack, where Kedro was born.
Kedro is not an orchestrator. It aims to stay very lean and unopinionated about where and how your pipeline will be run.
You can deploy your Kedro projects virtually anywhere with minimal effort as long as you can run Python. Our users have the freedom to choose their deployment targets. The future of deploying Kedro pipelines is in designing a deployment process with a great developer experience in mind.
One of the benefits of being an open source community is that we can explore partnerships with other, like-minded frameworks and technologies. We are particularly excited to work with the Astronomer team, who helps organisations adopt Apache Airflow, the leading open-source data workflow orchestration platform.
Workflows in Airflow are modelled and organised as DAGs, making it a suitable engine to orchestrate and execute a pipeline authored with Kedro. To keep the workflow seamless, we are pleased to unveil the latest version of the Kedro-Airflow plugin, which simplifies deployment of a Kedro project on Airflow.
Our work with Astronomer provides a simple way for our users to deploy their pipelines. We would like to continue our work and make the process even smoother and eventually achieve a “one-click-deployment” workflow for Kedro pipelines on Airflow.
We have edited the conversation for length and clarity.
Pete DeJoy, you’re a Product Manager at Astronomer. Tell us a little about yourself!
I’m one of the founding team members at Astronomer, where we’ve built a company around the open source orchestration framework Apache Airflow. I’ve done many things here through the years, but have spent most of my energy working on our product as it has developed from an idea on a whiteboard to a high-scale system supporting thousands of users.
What prompted the creation of Airflow 2.0? And what does the success of this version of Airflow look like?
Airflow has evolved quite a lot since its inception in 2014; it now has over 20,000 stars on Github, 600k downloads/month, and tens of thousands of users worldwide. Airflow 1.x solved a lot of first-order problems for developers, but an uptick in enterprise requirements followed Airflow’s widespread adoption, along with increased pressure to improve the developer experience. Airflow 2.0 meets the needs of users with a handful of much anticipated features. These include:
- A highly available, horizontally scalable scheduler
- An upgraded, stable REST API
- Decoupled workflow integrations (called “providers” in Airflow) as independently versioned and maintained python package and much more
We see 2.0 as a major milestone for the project; not only does it significantly improve the scalability of Airflow, but also it sets a foundation upon which we can continuously build new features.
How did you find out about Kedro? When did you realise it was compatible with Airflow for users?
I had chatted with a few data scientists who were using Kedro to author their pipelines and looking for a good way to deploy those pipelines to Airflow. Kedro does an outstanding job of allowing data scientists to apply good software engineering principles to their code and make it modular, but Kedro pipelines need a separate scheduling and execution environment to run at scale. Given this need, there was a natural bond between Kedro pipeline and Airflow: we wanted to do everything we could to build a great developer experience at the intersection of the two tools.
Where do you think Kedro-Airflow could go, in terms of future development?
Airflow 2.0 extends and upgrades the Airflow REST API, allowing it to be robust in the coming years. As the API develops, there will be new opportunities for specific abstraction layers to assist with DAG authoring and deployment, leading to a richer plugin ecosystem. There will be extra opportunity to integrate the
kedro-airflow package with the Airflow API for a great developer experience.
What is the future of Airflow?
As we look towards Airflow 3.0 and beyond, building upon developer love and trust is inevitable. But it won’t stop there. As data orchestration becomes critical to a growing number of business units, we want Airflow to become a medium for making data engineering more approachable. We seek to democratise access such that product owners and data scientists alike can leverage Airflow’s distributed execution and scheduling power without being a master in Python or Kubernetes. Empowering users to author and deploy data pipelines from a framework of their choice will become increasingly important in that journey.
What is the future of workflow orchestration technologies?
Airflow’s inception kicked off a “data pipelines as code” movement that changed the way enterprises thought about workflow orchestration. For many years, job scheduling was handled by a combination of legacy drag-and-drop frameworks and complex networks of cron jobs. As we transitioned into the “big data” era and companies began building dedicated teams to operationalise their siloed data, the need for additional flexibility, control, and governance became apparent.
When Maxime Beauchemin and the folks at Airbnb built and open sourced Airflow with flexible, codified data pipelines as a first-class feature, they propelled code-driven orchestration into the spotlight. Airflow solved many first-order problems for data engineers, which explains its explosive adoption. But with that early adoption came some pitfalls; since Airflow is highly configurable by design, users began applying it to use cases it was not necessarily designed for. This imposed evolutionary stress on the project, pushing the community to add additional configuration options to “mould” Airflow to various use cases.
While the added configuration options helped Airflow extend to accommodate these additional use cases, they introduced a new class of user needs. Data platform owners and administrators now need a way to deliver standard patterns to their pipeline authors to abate business risk. Likewise, pipeline authors need additional guardrails to be sure they don’t “use Airflow wrong”. Finally, engineers with a pythonic background now need to learn how to operationalise big data infrastructure for stable & reliable orchestration at scale.
We see the future of workflow orchestration technology accommodating some of these categorical changes in the needs of the user. If the journey thus far has been one of “The Rise of the Data Engineer”, we see the future as “The Democratisation of Data Engineering”. All users — from the data scientists to the data platform owner — will have access to powerful, distributed, flexible data pipeline orchestration. They’ll benefit as it integrates from the authoring tools that they know and love, but has guardrails to accommodate specific usage patterns that prevent folks from straying off of the happy path.
You can find out more about the Kedro-Airflow plugin in the Kedro documentation and check out the GitHub repository too. This article was edited by Jo Stichbury — Technical Writer and Yetunde Dada — Product Manager, with input from Ivan Danov (Tech Lead at Kedro) and Lim Hoang (Senior Software Engineer at Kedro).
Original. Reposted with permission.
- Simplified Mixed Feature Type Preprocessing in Scikit-Learn with Pipelines
- 5 Step Guide to Scalable Deep Learning Pipelines with d6tflow
- A Tour of End-to-End Machine Learning Platforms