Data Engineering Technologies 2021

Emerging technologies supporting the field of data engineering are growing at a rapid clip. This curated list includes the most important offerings available in 2021.

comments

By Tech Ninja, @techninjathere, OpenSource, Analytics & Cloud enthusiast.

A partial list of top engineering technologies, image created by KDnuggets.

Complete curated list of emerging technologies in Data Engineering

Abacus AI, enterprise AI with AutoML, similar space to DataRobot.
Algorithmia, enterprise MLOps.
Amundsen, an open-sourced data discovery and metadata engine.
Anodot, monitors all your data in real-time for lightning-fast detection of incidents.
Apache Arrow, essential because of non-JVM, in-memory, columnar format and vectorized.
Apache Calcite, framework for building SQL databases and data management systems without owning data. Hive, Flink, and others use Calcite.
Apache HOP, facilitates all aspects of data and metadata orchestration.
Apache Iceberg is an open table format for massive analytic datasets.
Apache Pinot, real-time distributed OLAP datastore. Its growth is impressive and it is in a similar space to Druid, but not exactly!
Apache Superset, open source BI with many connectors available.
Beam, implement batch and streaming data processing jobs that run on any execution engine.
Cnvrg, enterprise MLOps.
Confluent, Apache Kafka and following ecosystem.
Dagster, a data orchestrator for machine learning, very programming-based and in a similar space to Airflow, but emphasizes state flow.
DASK, Data Science purely in Python.
DataRobot, solid ML platform with a strong focus in enterprise MLOps.
Databricks, with new SQL analytics and lakehouse paper, expecting more amazing OSS.
DataFrame Whale is a straightforward data discovery tool.
Dataiku, enterprise AI/MLOps platform.
Delta Lake, ACID on Apache Spark.
DVC, open-source version control system for ML projects and desired for MLOps.
Feast, open-source feature store, now with Tecton.
Fiddler, enterprise explainable AI.
Fivetran, data integration pipeline.
Getdbt, is hitting the sweet spot of Apache Spark by bringing a simplified SQL-based pipeline.
Great Expectations, Data Science testing framework, it’s already amazing!
Hopswork, open-sourced MLOps feature store.
Hudi brings transactions, record-level updates/deletes, and change streams to data lakes.
Koalas, Pandas on Apache Spark.
The Kubeflow project is dedicated to making machine learning workflows on Kubernetes that is simple, portable, and scalable.
lakeFS enables you to manage your data lake the way you manage your code. Run parallel pipelines for experimentation and CI/CD for your data.
maiot-ZenML, open-sourced MLOps Framework, having a bit of everything.
Marquez, open-source metadata with a fantastic UI.
Metabase, an open-source BI with excellent visualization.
MLFlow, a machine learning platform.
Montecarlodata, data governance or data discovery or data observability.
Nextflow, data-driven computational pipelines designed for BioInformatics, but can go beyond.
Pachyderm, MLOps platform, in the space of MLFlow.
Papermill, parameterizing a notebook, makes Data Science more exciting and more accessible.
Prefect, designed to make workflow management easier and better compared to Apache Airflow.
RAPIDS, Data Science on GPUs.
Ray, distributed machine learning and now streaming.
Starburst, unlock the value of distributed data by making it fast and easy to access.
Tecton, enterprise feature store.
Trino, aka PrestoSQL, now with a clear separation from Presto, Trino can focus heavily on features.

Reordered alphabetically, based on this original. Reposted with permission.

Related:

Data Engineering Technologies 2021

Complete curated list of emerging technologies in Data Engineering

More On This Topic

Latest Posts

Top Posts