Data Engineering Landscape in the AI-Driven World

Generative AI has just started to capture the imagination of data engineers, so the impact thus far has been just a fraction of what it will be a year or two from now.

Data Engineering Landscape in the AI-Driven World
Image from Bing Image Creator


One of the biggest impacts has been the wider adoption of “prompt engineering,” essentially the skill of prompting AI to assist in coding-related tasks. I’ve seen Andrej Karpathy joke on Twitter, “The hottest new programming language is English.”

Generative AI has also kick-started a gold rush with dozens of very early start-up companies racing to develop an AI that can query the data warehouse and return an intelligent answer to the ad hoc questions data consumers ask in their natural language. "This would radically simplify the self-service analytics process and further democratize data, but it will be difficult to solve beyond basic "metric fetching," given the complexity of data pipelines for more advanced analytics," commented Monte Carlo CTO Shane Murray.

"When I evaluate data engineering candidates for a role, I’m looking for their track record of making an impact and hitting the ground running," Murray mentioned. That could be in their primary occupation or by contributing to open-source projects. In either case, it’s not that you were there, but what impact did you make?

If you don’t like change, data engineering is not for you. "Little in this space has escaped reinvention," Murray remarked. It’s clear that the process of building and maintaining data pipelines will become much easier, as will the ability for data consumers to access and manipulate data.

However, what hasn’t changed is the data lifecycle. "It is emitted, it is transformed for a use, and then it is archived," Murray pointed out. "While the underlying infrastructure may change and automation will shift time and attention to the right or left, human data engineers will continue to play a crucial role in extracting value from data, whether architecting scalable and reliable data systems or as specialist engineers within a chosen domain of data."


Data Platform Teams Offer Opportunities


I've found that data platform teams, which are now quite common in data teams of various sizes, are great places for data engineers to cut their teeth.

Murray further explained, "Here, you can specialize in a specific domain of data that is central to the business operations, such as customer data or product / behavioral data. In this role, you should aim to gain an understanding of the end-to-end problem—from source to the analytical use case—as it'll make you an asset to the team and the business."

"Alternatively, one might specialize in a specific capability of the data platform, such as reliability engineering, business intelligence, experimentation, or feature engineering." Murray specified. "These types of roles typically give a broader, but shallower, understanding of each business use case, but may be an easier jump from a software engineering role into data."

Another path I'm seeing more often for data engineers is the data product manager role, said Murray. If one is growing data engineering skills but finds they are more compelled by talking to end users, articulating the problems to be solved, and distilling the vision and roadmap for the team, then a product management role may be a future prospect.

Data teams are beginning to invest in this skillset as we move to treat "data as a product," ranging from critical dashboards and decision-support tools to applications of machine learning that are critical to business operations or customer experience. "Great data product managers will have an understanding of how to build a reliable and scalable data product, but also apply product thinking to drive the vision, roadmap, and adoption," Murray affirmed.


Modern Data Stack


The modern data stack is quickly becoming the dominant, trending tech stack in the data engineering field, Murray articulated. This stack has a cloud-based data warehouse or lake at the center and complementary cloud-based solutions for data ingestion, transformation, orchestration, visualization, and data observability.

It’s advantageous because it has a quick time to value, is fundamentally more user-friendly than the prior generation of tools, is extensible to a wide range of analytical and machine learning use cases, and can scale to the size and complexity of data managed in today’s world.

"The exact solutions will vary depending on organizational size and specific data use cases, but generally the most common modern data stack is Snowflake, Fivetran, dbt, Airflow, Looker, and Monte Carlo. There may also be Atlan and Immuta to address data catalog and access, respectively," Murray explained. "Larger organizations or those with more machine learning use cases will typically have data stacks that more heavily utilize Databricks and Spark."


A Potential Disruption


"The modern data stack era kicked off by Snowflake and Databricks hasn’t even reached a point of consolidation yet, and already we are seeing ideas that may further disrupt the status quo of modern data pipelines," Murray reflected. "On the near horizon are the more widespread adoption of streaming data, zero-ETL, data sharing, and a unified metrics layer." Zero-ETL and data sharing are particularly interesting as they have the potential to simplify the complexity of modern data pipelines, which have multiple points of integration and thus failure.


Tech Job Landscape


The tech industry job market is projected to experience a significant shift in 2023, driven by the growth of big data analytics. According to Dice Media's analysis, this shift will occur as the global big data analytics market is expected to grow at an impressive rate of 30.7 percent, reaching a projected value of $346.24 billion by 2030. This growth is anticipated to create numerous opportunities for skilled professionals in the field, such as data engineers, business analysts, and data analysts.

"I strongly believe that data engineering jobs will not be solely about writing code, but rather, they will involve more communication with business stakeholders and designing end-to-end systems," commented Deexith Reddy, an experienced data engineer and open-source enthusiast. "Therefore, to ensure job security, one must focus on both the breadth of data analytics and the depth of data engineering."

Generative AI is likely to make the data engineering field more competitive. However, during our call, Reddy also emphasized that contributing to open-source projects will always be beneficial for building a strong portfolio, considering technological advancements and recent AI breakthroughs.

Reddy shed further light on the critical role data engineers play in enhancing an organization's capabilities by utilizing open-source technologies. For instance, there has been widespread adoption of open-source technologies like Apache Spark, Apache Kafka, and Elasticsearch among data engineers, as well as Kubernetes among data scientists for data science practices. These OSS technologies help meet the computational requirements for deep learning and machine learning workloads, as well as MLOps workflows.

Companies often identify and recruit top contributors from open-source projects like these, fostering an environment that values and encourages open-source contributions. This approach helps retain skilled data engineers and allows organizations to benefit from their expertise.
Saqib Jan is a writer and technology analyst with a passion for data science, automation, and cloud computing.