KDnuggets Top Blog Winner

7 Essential Cheat Sheets for Data Engineering

Learn about the data life cycle, PySpark, dbt, Kafka, BigQuery, Airflow, and Docker.



7 Essential Cheat Sheets for Data Engineering
Image by Author

 

1. GCP Data Engineering Cheat Sheet

 

The Data Engineering with GCP is a complete data life cycle cheat sheet for experienced individuals who want to review the essential concepts of the data engineering ecosystem and tools.

 

7 Essential Cheat Sheets for Data Engineering
Image from Cheat Sheet

 

In this cheat sheet, you will learn:

  1. Basic concepts of Data Engineering
  2. Hadoop Ecosystem
  3. Google compute platform
  4. Identity access management
  5. Key concepts
  6. Compute choices
  7. Stackdriver
  8. Storage, Big table, BigQuery, and Cloud SQL
  9. DataStore, DataProc, and DataFlow
  10. Pub/Sub

 

2. PySpark Cheat Sheet

 

PySpark Cheat Sheet includes handy commands for handling DataFrames in Python with examples. The cheat covers the basic working of Apache Spark DataFrames from initializing the SparkSession to running queries and saving the data. 

 

7 Essential Cheat Sheets for Data Engineering
Image from Cheat Sheet

 

In this cheat sheet, you will learn:

  1. Initializing SparkSession
  2. Creating DataFrames in Python
  3. Filtering
  4. Duplicating values
  5. Running Spark queries
  6. Running queries programmatically
  7. Modifying the columns
  8. Dealing with missing values
  9. Repartitioning
  10. GroupBy and Sorting
  11. Inspecting the data, saving the output, and stopping the session.

 

3. dbt commands Cheat Sheet

 

The dbt(data built tool) commands cheat sheet provides simple examples of various commands that you can use to transform the data. dbt is a transformation tool, it doesn't perform loading or extracting. 

 

7 Essential Cheat Sheets for Data Engineering
Image from Cheat Sheet

 

In this cheat sheet, you will learn:

  1. Introduction to dbt
  2. dbt generic commands
  3. Running based on the model name
  4. Running based on the folder name
  5. Running based on the folder name
  6. Multiple model inputs in the dbt command
  7. Special commands

 

4. Apache Kafka Cheat Sheet

 

Apache Kafka is a command-based cheat sheet that covers the essential commands for distributed data streaming. 

 

7 Essential Cheat Sheets for Data Engineering
Image from Cheat Sheet

 

In this cheat sheet, you will learn:

  1. Display topic Information
  2. Change topic retention
  3. List existing topics
  4. Purge a topic
  5. Delete a topic
  6. Earliest offset still in a topic
  7. Latest offset still in a topic
  8. Consume messages
  9. Get the consumer offsets for a topic
  10. Kafka consumer groups
  11. Kafkacat
  12. Zookeeper

 

5. Google BigQuery Cheat Sheet

 

The Google BigQuery is a command-based cheat sheet that explains every BigQuery feature in detail. BigQuery is a fully managed data warehouse that comes with advanced functionality such as geospatial analysis, BI tooling, and machine learning. 

 

7 Essential Cheat Sheets for Data Engineering
Image from Cheat Sheet

 

In this cheat sheet, you will learn:

  1. Initializing BigQuery resources with DDL
  2. Altering schemas
  3. Altering tables
  4. Altering views
  5. Altering materialized views
  6. BigQuery data types
  7. Numeric Types
  8. Adding and editing BigQuery data
  9. Common queries

 

6. Airflow Commands Cheat Sheet

 

The Airflow is a command-based cheat sheet that covers essential commands for creating, scheduling, and monitoring workflows. Apache Airflow is a widely used data pipeline tool in the industry. It provides scalability, extensibility, and dynamic pipeline generation.

 

7 Essential Cheat Sheets for Data Engineering
Image from Cheat Sheet

 

In this cheat sheet, you will learn:

  1. Miscellaneous commands
  2. Celery components
  3. View configuration
  4. Manage connections
  5. Manage DAGs
  6. Database operations
  7. Tools to help run the KubernetesExecutor
  8. Manage pools
  9. Display providers
  10. Manage roles, tasks, users, and variables

 

7. Docker Cheat Sheet

 

The Docker cheat sheet covers the basic functionality of building, running, and managing Docker images. Docker provides OS-level virtualization to deliver software in packages called containers. It is used for reproducibility and management of available resources. 

 

7 Essential Cheat Sheets for Data Engineering
Image from Cheat Sheet

 

In this cheat sheet, you will learn:

  1. Run a new container
  2. Manage container
  3. Info and Stats
  4. Managing build, configs, images, and services

 

Conclusion

 

Daily, data engineering performs data ingestion, data warehousing, analytical engineering, workflow management, batch processing, and streaming. To perform all the tasks, you need the know-how of the tools and the commands. The 7 cheat sheets help you revise various tools, commands, and concepts. Furthermore, it will help you in acing data engineering technical interview stage with minimum effort. 

I hope you like the cheat sheets. Don’t forget to follow me on Twitter and LinkedIn, where I post engaging blogs on data science.

 
 
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.