A Beginner’s Guide to Data Engineering – Part I
Data Engineering: The Close Cousin of Data Science.
ETL: Extract, Transform, and Load
All of the examples we referenced above follow a common pattern known as ETL, which stands for Extract, Transform, and Load. These three conceptual steps are how most data pipelines are designed and structured. They serve as a blueprint for how raw data is transformed to analysis-ready data. To understand this flow more concretely, I found the following picture from Robinhood’s engineering blog very useful:
Source: Vineet Goel’s “Why Robinhood uses Airflow?” Medium Post
- Extract: this is the step where sensors wait for upstream data sources to land (e.g. a upstream source could be machine or user-generated logs, relational database copy, external dataset … etc). Upon available, we transport the data from their source locations to further transformations.
- Transform: This is the heart of any ETL job, where we apply business logic and perform actions such as filtering, grouping, and aggregation to translate raw data into analysis-ready datasets. This step requires a great deal of business understanding and domain knowledge.
- Load: Finally, we load the processed data and transport them to a final destination. Often, this dataset can be either consumed directly by end-users or it can be treated as yet another upstream dependency to another ETL job, forming the so called data lineage.
While all ETL jobs follow this common pattern, the actual jobs themselves can be very different in usage, utility, and complexity. Here is a very simple toy example of an Airflow job:
The example above simply prints the date in bash every day after waiting for a second to pass after the execution date is reached, but real-life ETL jobs can be much more complex. For example, we could have an ETL job that extracts a series of CRUD operations from a production database and derive business events such as a user deactivation. Another ETL can take in some experiment configuration file, compute the relevant metrics for that experiment, and finally output p-values and confidence intervals in a UI to inform us whether the product change is preventing from user churn. Yet another example is a batch ETL job that computes features for a machine learning model on a daily basis to predict whether a user will churn in the next few days. The possibilities are endless!
Choosing ETL Frameworks
When it comes to building ETLs, different companies might adopt different best practices. Over the years, many companies made great strides in identifying common problems in building ETLs and built frameworks to address these problems more elegantly.
In the world of batch data processing, there are a few obvious open-sourced contenders at play. To name a few: Linkedin open sourced Azkaban to make managing Hadoop job dependencies easier. Spotify open sourced Python-based framework Luigi in 2014, Pinterest similarly open sourced Pinball and Airbnb open sourced Airflow (also Python-based) in 2015.
Different frameworks have different strengths and weaknesses, and many experts have made comparisons between them extensively (see here and here). Regardless of the framework that you choose to adopt, a few features are important to consider:
Source: Marton Trencseni’s comparison between Luigi, Airflow, and Pinball
- Configuration: ETLs are naturally complex, and we need to be able to succinctly describe the data flow of a data pipeline. As a result, it is important to evaluate how ETLs are authored. Is it configured on a UI, a domain specific language, or code? Nowadays, the concept of configuration as code is gaining popularity, because it allows users to expressively build pipelines programmatically that are customizable.
- UI, Monitoring, Alerts: Long running batch processes inevitably can run into errors (e.g. cluster failures) even when the job itself does not have bugs. As a result, monitoring and alerting are crucial in tracking the progress of long running processes. How well does a framework provide visual information for job progress? Does it surface alerts or warnings in a timely and accurate manner?
- Backfilling: Once a data pipeline built, we often need to go back in time and re-process the historical data. Ideally, we do not want to build two separate jobs, one for backfilling historical data and another for computing current or future metrics. How easy does a framework support backfilling? Can it do so in a way that is standardized, efficient, and scalable? All these are important questions to consider.
Naturally, as someone who works at Airbnb, I really enjoy using Airflow and I really appreciate how it elegantly addresses a lot of the common problems that I encountered during data engineering work. Given that there are already 120+ companies officially using Airflow as their de-facto ETL orchestration engine, I might even go as far as arguing that Airflow could be the standard for batch processing for the new generation start-ups to come.
Two Paradigms: SQL- v.s. JVM-Centric ETL
As we can see from the above, different companies might pick drastically different tools and frameworks for building ETLs, and it can be a very confusing to decide which tools to invest in as a new data scientist.
This was certainly the case for me: At Washington Post Labs, ETLs were mostly scheduled primitively in Cron and jobs are organized as Vertica scripts. At Twitter, ETL jobs were built in Pig whereas nowadays they are all written in Scalding, scheduled by Twitter’s own orchestration engine. At Airbnb, data pipelines are mostly written in Hive using Airflow.
During my first few years working as a data scientist, I pretty much followed what my organizations picked and take them as given. It was not until much later when I came across Josh Will’s talk did I realize there are typically two ETL paradigms, and I actually think data scientists should think very hard about which paradigm they prefer before joining a company.
Video Source: Josh Wills’ Keynote @ DataEngConf SF 2016
- JVM-centric ETL is typically built in a JVM-based language (like Java or Scala). Engineering data pipelines in these JVM languages often involves thinking data transformation in a more imperative manner, e.g. in terms of key-value pairs. Writing User Defined Functions (UDFs) are less painful because one does not need to write them in a different language, and testing jobs can be easier for the same reason. This paradigm is quite popular among engineers.
- SQL-centric ETL is typically built in languages like SQL, Presto, or Hive. ETL jobs are often defined in a declarative way, and almost everything centers around SQL and tables. Writing UDFs sometimes is troublesome because one has to write it in a different language (e.g. Java or Python), and testing can be a lot more challenging due to this. This paradigm is popular among data scientists.
As a data scientist who has built ETL pipelines under both paradigms, I naturally prefer SQL-centric ETLs. In fact, I would even argue that as a new data scientist, you can learn much more quickly about data engineering when operating in the SQL paradigm. Why? Because learning SQL is much easier than learning Java or Scala (unless you are already familiar with them), and you can focus your energy on learning DE best practices than learning new concepts in a new domain on top of a new language.
Wrapping Up Beginner’s Guide — Part I
In this post, we learned that analytics are built upon layers, and foundational work such as building data warehousing is an essential prerequisite for scaling a growing organization. We briefly discussed different frameworks and paradigms for building ETLs, but there are so much more to learn and discuss.
In the second post of this series, I will dive into the specifics and demonstrate how to build a Hive batch job in Airflow. Specifically, we will learn the basic anatomy of an Airflow job, see extract, transform, and load in actions via constructs such as partition sensors and operators. We will learn how to use data modeling techniques such as star schema to design tables. Finally, I will highlight some ETL best practices that are extremely useful.
If you found this post useful, stay tuned for Part II and Part III.
Bio: Robert Chang is a data scientist at Airbnb working on Machine Learning, Machine Learning Infrastructure, and Host Growth. Prior to Airbnb, he was a data scientist at Twitter and have a degree in Statistics from Stanford University and a degree of Operations Research from UC Berkeley.
Original. Reposted with permission.
- Why the Data Scientist and Data Engineer Need to Understand Virtualization in the Cloud
- How to Build a Data Science Pipeline
- Managing Machine Learning Workflows with Scikit-learn Pipelines Part 1: A Gentle Introduction