7 Essential Cheat Sheets for Data Engineering
Learn about the data life cycle, PySpark, dbt, Kafka, BigQuery, Airflow, and Docker.
Image by Author
1. GCP Data Engineering Cheat Sheet
The Data Engineering with GCP is a complete data life cycle cheat sheet for experienced individuals who want to review the essential concepts of the data engineering ecosystem and tools.
Image from Cheat Sheet
In this cheat sheet, you will learn:
- Basic concepts of Data Engineering
- Hadoop Ecosystem
- Google compute platform
- Identity access management
- Key concepts
- Compute choices
- Stackdriver
- Storage, Big table, BigQuery, and Cloud SQL
- DataStore, DataProc, and DataFlow
- Pub/Sub
2. PySpark Cheat Sheet
PySpark Cheat Sheet includes handy commands for handling DataFrames in Python with examples. The cheat covers the basic working of Apache Spark DataFrames from initializing the SparkSession to running queries and saving the data.Â
Image from Cheat Sheet
In this cheat sheet, you will learn:
- Initializing SparkSession
- Creating DataFrames in Python
- Filtering
- Duplicating values
- Running Spark queries
- Running queries programmatically
- Modifying the columns
- Dealing with missing values
- Repartitioning
- GroupBy and Sorting
- Inspecting the data, saving the output, and stopping the session.
3. dbt commands Cheat Sheet
The dbt(data built tool) commands cheat sheet provides simple examples of various commands that you can use to transform the data. dbt is a transformation tool, it doesn't perform loading or extracting.Â
Image from Cheat Sheet
In this cheat sheet, you will learn:
- Introduction to dbt
- dbt generic commands
- Running based on the model name
- Running based on the folder name
- Running based on the folder name
- Multiple model inputs in the dbt command
- Special commands
4. Apache Kafka Cheat Sheet
Apache Kafka is a command-based cheat sheet that covers the essential commands for distributed data streaming.Â
Image from Cheat Sheet
In this cheat sheet, you will learn:
- Display topic Information
- Change topic retention
- List existing topics
- Purge a topic
- Delete a topic
- Earliest offset still in a topic
- Latest offset still in a topic
- Consume messages
- Get the consumer offsets for a topic
- Kafka consumer groups
- Kafkacat
- Zookeeper
5. Google BigQuery Cheat Sheet
The Google BigQuery is a command-based cheat sheet that explains every BigQuery feature in detail. BigQuery is a fully managed data warehouse that comes with advanced functionality such as geospatial analysis, BI tooling, and machine learning.Â
Image from Cheat Sheet
In this cheat sheet, you will learn:
- Initializing BigQuery resources with DDL
- Altering schemas
- Altering tables
- Altering views
- Altering materialized views
- BigQuery data types
- Numeric Types
- Adding and editing BigQuery data
- Common queries
6. Airflow Commands Cheat Sheet
The Airflow is a command-based cheat sheet that covers essential commands for creating, scheduling, and monitoring workflows. Apache Airflow is a widely used data pipeline tool in the industry. It provides scalability, extensibility, and dynamic pipeline generation.
Image from Cheat Sheet
In this cheat sheet, you will learn:
- Miscellaneous commands
- Celery components
- View configuration
- Manage connections
- Manage DAGs
- Database operations
- Tools to help run the KubernetesExecutor
- Manage pools
- Display providers
- Manage roles, tasks, users, and variables
7. Docker Cheat Sheet
The Docker cheat sheet covers the basic functionality of building, running, and managing Docker images. Docker provides OS-level virtualization to deliver software in packages called containers. It is used for reproducibility and management of available resources.Â
Image from Cheat Sheet
In this cheat sheet, you will learn:
- Run a new container
- Manage container
- Info and Stats
- Managing build, configs, images, and services
Conclusion
Daily, data engineering performs data ingestion, data warehousing, analytical engineering, workflow management, batch processing, and streaming. To perform all the tasks, you need the know-how of the tools and the commands. The 7 cheat sheets help you revise various tools, commands, and concepts. Furthermore, it will help you in acing data engineering technical interview stage with minimum effort.Â
I hope you like the cheat sheets. Don’t forget to follow me on Twitter and LinkedIn, where I post engaging blogs on data science.
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.