A Beginner’s Guide to Data Engineering – Part I
Data Engineering: The Close Cousin of Data Science.
By Robert Chang, Airbnb.
Image credit: A beautiful former slaughterhouse / warehouse at Matadero Madrid, architected by Iñaqui Carnicero
The more experienced I become as a data scientist, the more convinced I am that data engineering is one of the most critical and foundational skills in any data scientist’s toolkit. I find this to be true for both evaluating project or job opportunities and scaling one’s work on the job.
In an earlier post, I pointed out that a data scientist’s capability to convert data into value is largely correlated with the stage of her company’s data infrastructure as well as how mature its data warehouse is. This means that a data scientist should know enough about data engineering to carefully evaluate how her skills are aligned with the stage and need of the company. Furthermore, many of the great data scientists I know are not only strong in data science but are also strategic in leveraging data engineering as an adjacent discipline to take on larger and more ambitious projects that are otherwise not reachable.
Despite its importance, education in data engineering has been limited. Given its nascency, in many ways the only feasible path to get training in data engineering is to learn on the job, and it can sometimes be too late. I am very fortunate to have worked with data engineers who patiently taught me this subject, but not everyone has the same opportunity. As a result, I have written up this beginner’s guide to summarize what I learned to help bridge the gap.
Organization of This Beginner’s Guide
The scope of my discussion will not be exhaustive in any way, and is designed heavily around Airflow, batch data processing, and SQL-like languages. That said, this focus should not prevent the reader from getting a basic understanding of data engineering and hopefully it will pique your interest to learn more about this fast-growing, emerging field.
- Part I (this post) is designed to be a high-level introductory post. Using a combination of personal anecdotes and expert insights, I will contextualize what data engineering is, why it is challenging, and how it can help you or your organization to scale. The primary audience of this post is for aspiring data scientists who need to learn the basics to evaluate job opportunities or early-stage founders who are about to build the company’s first data team.
- Part II is more technical in nature. This post will focus on Airflow, Airbnb’s open-sourced tool for programmatically author, schedule, and monitor workflows. Specifically, I will demonstrate how to write a Hive batch job in Airflow, how to design table schemas using techniques such as star schema, and finally highlight some best practices around building ETLs. This post is suitable for starting data scientists and starting data engineers who are trying to hone their data engineering skills.
- Part III will be the final post of this series, where I will describe advanced data engineering patterns, higher level abstractions, and extended frameworks that would make building ETLs a lot easier and more efficient. A lot of these patterns are taught to me by Airbnb’s experienced data engineers who learned the hard way. I find these insights particularly useful for seasoned data scientists and seasoned data engineers who are trying to further optimize their workflows.
Given that I am now a huge proponent for learning data engineering as an adjacent discipline, you might find it surprising that I had the completely opposite opinion a few years ago — I struggled a lot with data engineering during my first job, both motivationally and emotionally.
My First Industry Job out of Graduate School
Right after graduate school, I was hired as the first data scientist at a small startup affiliated with the Washington Post. With endless aspirations, I was convinced that I will be given analysis-ready data to tackle the most pressing business problems using the most sophisticated techniques.
Shortly after I started my job, I learned that my primary responsibility was not quite as glamorous as I imagined. Instead, my job was much more foundational — to maintain critical pipelines to track how many users visited our site, how much time each reader spent reading contents, and how often people liked or retweeted articles. It was certainly important work, as we delivered readership insights to our affiliated publishers in exchange for high-quality contents for free.
Secretly though, I always hope by completing my work at hand, I will be able to move on to building fancy data products next, like the ones described here. After all, that is what a data scientist is supposed to do, as I told myself. Months later, the opportunity never came, and I left the company in despair. Unfortunately, my personal anecdote might not sound all that unfamiliar to early stage startups (demand) or new data scientists (supply) who are both inexperienced in this new labor market.
Image credit: Me building ETL pipelines diligently (guy in blue in the middle)
Reflecting on this experience, I realized that my frustration was rooted in my very little understanding of how real life data projects actually work. I was thrown into the wild west of raw data, far away from the comfortable land of pre-processed, tidy .csv files, and I felt unprepared and uncomfortable working in an environment where this is the norm.
Many data scientists experienced a similar journey early on in their careers, and the best ones understood quickly this reality and the challenges associated with it. I myself also adapted to this new reality, albeit slowly and gradually. Over time, I discovered the concept of instrumentation, hustled with machine-generated logs, parsed many URLs and timestamps, and most importantly, learned SQL (Yes, in case you were wondering, my only exposure to SQL prior to my first job was Jennifer Widom’s awesome MOOC here).
Nowadays, I understand counting carefully and intelligently is what analytics is largely about, and this type of foundational work is especially important when we live in a world filled with constant buzzwords and hypes.
The Hierarchy of Analytics
Among the many advocates who pointed out the discrepancy between the grinding aspect of data science and the rosier depictions that media sometimes portrayed, I especially enjoyed Monica Rogati’s call out, in which she warned against companies who are eager to adopt AI:
Think of Artificial Intelligence as the top of a pyramid of needs. Yes, self-actualization (AI) is great, but you first need food, water, and shelter (data literacy, collection, and infrastructure).
This framework puts things into perspective. Before a company can optimize the business more efficiently or build data products more intelligently, layers of foundational work need to be built first. This process is analogous to the journey that a man must take care of survival necessities like food or water before he can eventually self-actualize. This rule implies that companies should hire data talents according to the order of needs. One of the recipes for disaster is for startups to hire its first data contributor as someone who only specialized in modeling but have little or no experience in building the foundational layers that is the pre-requisite of everything else (I called this “The Hiring Out-of-Order Problem”).
Source: Monica Rogati’s fantastic Medium post “The AI Hierarchy of Needs”
Unfortunately, many companies do not realize that most of our existing data science training programs, academic or professional, tend to focus on the top of the pyramid knowledge. Even for modern courses that encourage students to scrape, prepare, or access raw data through public APIs, most of them do not teach students how to properly design table schemas or build data pipelines. As a result, some of the critical elements of real-life data science projects were lost in translation.
Luckily, just like how software engineering as a profession distinguishes front-end engineering, back-end engineering, and site reliability engineering, I predict that our field will be the same as it becomes more mature. The composition of talent will become more specialized over time, and those who have the skill and experience to build the foundations for data-intensive applications will be on the rise.
What does this future landscape mean for data scientists? I would not go as far as arguing that every data scientist needs to become an expert in data engineering. However, I do think that every data scientist should know enough of the basics to evaluate project and job opportunities in order to maximize talent-problem fit. If you find that many of the problems that you are interested in solving require more data engineering skills, then it is never too late then to invest more in learning data engineering. This is in fact the approach that I have taken at Airbnb.
Building Data Foundations & Warehouses
Regardless of your purpose or interest level in learning data engineering, it is important to know exactly what data engineering is about. Maxime Beauchemin, the original author of Airflow, characterized data engineering in his fantastic post The Rise of Data Engineer:
Data engineering field could be thought of as a superset of business intelligenceand data warehousing that brings more elements from software engineering. This discipline also integrates specialization around the operation of so called “big data” distributed systems, along with concepts around the extended Hadoop ecosystem, stream processing, and in computation at scale.
Among the many valuable things that data engineers do, one of their highly sought-after skills is the ability to design, build, and maintain data warehouses. Just like a retail warehouse is where consumable goods are packaged and sold, a data warehouse is a place where raw data is transformed and stored in query-able forms.
Source: Jeff Hammerbacher’s slide from UC Berkeley CS 194 course
In many ways, data warehouses are both the engine and the fuels that enable higher level analytics, be it business intelligence, online experimentation, or machine learning. Below are a few specific examples that highlight the role of data warehousing for different companies in various stages:
- Building Analytics at 500px: In this post, Samson Hu explains the challenges 500px faced when it tried to grow beyond product-market fit. He describes in details the process how he built the data warehouse from the ground and up.
- Scaling Airbnb’s Experimentation Platform: Jonathon Parksdemonstrates how Airbnb’s data engineering team built specialized data pipelines to power internal tools like the experimentation reporting framework. This work is critical in shaping and scaling Airbnb’s product development culture.
- Using ML to Predict the Value of Homes on Airbnb: Written by myself, I explain why building batch training, offline scoring machine learning models requires a lot of upfront data engineering work. Notably, many tasks associated with feature engineering, building, and backfilling training data resemble data engineering works.
Without these foundational warehouses, every activity related to data science becomes either too expensive or not scalable. For example, without a properly designed business intelligence warehouse, data scientists might report different results for the same basic question asked at best; At worst, they could inadvertently query straight from the production database, causing delays or outages. Similarly, without an experimentation reporting pipeline, conducting experiment deep dives can be extremely manual and repetitive. Finally, without data infrastructure to support label collection or feature computation, building training data can be extremely time consuming.