Data Engineering Technologies 2021

Emerging technologies supporting the field of data engineering are growing at a rapid clip. This curated list includes the most important offerings available in 2021.

By Tech Ninja, @techninjathere, OpenSource, Analytics & Cloud enthusiast.

Top Data Engineering Technologies
A partial list of top engineering technologies, image created by KDnuggets.

Complete curated list of emerging technologies in Data Engineering

  • Abacus AI, enterprise AI with AutoML, similar space to DataRobot.
  • Algorithmia, enterprise MLOps.
  • Amundsen, an open-sourced data discovery and metadata engine.
  • Anodot, monitors all your data in real-time for lightning-fast detection of incidents.
  • Apache Arrow, essential because of non-JVM, in-memory, columnar format and vectorized.
  • Apache Calcite, framework for building SQL databases and data management systems without owning data. Hive, Flink, and others use Calcite.
  • Apache HOP, facilitates all aspects of data and metadata orchestration.
  • Apache Iceberg is an open table format for massive analytic datasets.
  • Apache Pinot, real-time distributed OLAP datastore. Its growth is impressive and it is in a similar space to Druid, but not exactly!
  • Apache Superset, open source BI with many connectors available.
  • Beam, implement batch and streaming data processing jobs that run on any execution engine.
  • Cnvrg, enterprise MLOps.
  • Confluent, Apache Kafka and following ecosystem.
  • Dagster, a data orchestrator for machine learning, very programming-based and in a similar space to Airflow, but emphasizes state flow.
  • DASK, Data Science purely in Python.
  • DataRobot, solid ML platform with a strong focus in enterprise MLOps.
  • Databricks, with new SQL analytics and lakehouse paper, expecting more amazing OSS.
  • DataFrame Whale is a straightforward data discovery tool.
  • Dataiku, enterprise AI/MLOps platform.
  • Delta Lake, ACID on Apache Spark.
  • DVC, open-source version control system for ML projects and desired for MLOps.
  • Feast, open-source feature store, now with Tecton.
  • Fiddler, enterprise explainable AI.
  • Fivetran, data integration pipeline.
  • Getdbt, is hitting the sweet spot of Apache Spark by bringing a simplified SQL-based pipeline.
  • Great Expectations, Data Science testing framework, it’s already amazing!
  • Hopswork, open-sourced MLOps feature store.
  • Hudi brings transactions, record-level updates/deletes, and change streams to data lakes.
  • Koalas, Pandas on Apache Spark.
  • The Kubeflow project is dedicated to making machine learning workflows on Kubernetes that is simple, portable, and scalable.
  • lakeFS enables you to manage your data lake the way you manage your code. Run parallel pipelines for experimentation and CI/CD for your data.
  • maiot-ZenML, open-sourced MLOps Framework, having a bit of everything.
  • Marquez, open-source metadata with a fantastic UI.
  • Metabase, an open-source BI with excellent visualization.
  • MLFlow, a machine learning platform.
  • Montecarlodata, data governance or data discovery or data observability.
  • Nextflow, data-driven computational pipelines designed for BioInformatics, but can go beyond.
  • Pachyderm, MLOps platform, in the space of MLFlow.
  • Papermill, parameterizing a notebook, makes Data Science more exciting and more accessible.
  • Prefect, designed to make workflow management easier and better compared to Apache Airflow.
  • RAPIDS, Data Science on GPUs.
  • Ray, distributed machine learning and now streaming.
  • Starburst, unlock the value of distributed data by making it fast and easy to access.
  • Tecton, enterprise feature store.
  • Trino, aka PrestoSQL, now with a clear separation from Presto, Trino can focus heavily on features.


Reordered alphabetically, based on this original. Reposted with permission.