Data Engineering Technologies 2021
Emerging technologies supporting the field of data engineering are growing at a rapid clip. This curated list includes the most important offerings available in 2021.
By Tech Ninja, @techninjathere, OpenSource, Analytics & Cloud enthusiast.
A partial list of top engineering technologies, image created by KDnuggets.
Complete curated list of emerging technologies in Data Engineering
- Abacus AI, enterprise AI with AutoML, similar space to DataRobot.
- Algorithmia, enterprise MLOps.
- Amundsen, an open-sourced data discovery and metadata engine.
- Anodot, monitors all your data in real-time for lightning-fast detection of incidents.
- Apache Arrow, essential because of non-JVM, in-memory, columnar format and vectorized.
- Apache Calcite, framework for building SQL databases and data management systems without owning data. Hive, Flink, and others use Calcite.
- Apache HOP, facilitates all aspects of data and metadata orchestration.
- Apache Iceberg is an open table format for massive analytic datasets.
- Apache Pinot, real-time distributed OLAP datastore. Its growth is impressive and it is in a similar space to Druid, but not exactly!
- Apache Superset, open source BI with many connectors available.
- Beam, implement batch and streaming data processing jobs that run on any execution engine.
- Cnvrg, enterprise MLOps.
- Confluent, Apache Kafka and following ecosystem.
- Dagster, a data orchestrator for machine learning, very programming-based and in a similar space to Airflow, but emphasizes state flow.
- DASK, Data Science purely in Python.
- DataRobot, solid ML platform with a strong focus in enterprise MLOps.
- Databricks, with new SQL analytics and lakehouse paper, expecting more amazing OSS.
- DataFrame Whale is a straightforward data discovery tool.
- Dataiku, enterprise AI/MLOps platform.
- Delta Lake, ACID on Apache Spark.
- DVC, open-source version control system for ML projects and desired for MLOps.
- Feast, open-source feature store, now with Tecton.
- Fiddler, enterprise explainable AI.
- Fivetran, data integration pipeline.
- Getdbt, is hitting the sweet spot of Apache Spark by bringing a simplified SQL-based pipeline.
- Great Expectations, Data Science testing framework, it’s already amazing!
- Hopswork, open-sourced MLOps feature store.
- Hudi brings transactions, record-level updates/deletes, and change streams to data lakes.
- Koalas, Pandas on Apache Spark.
- The Kubeflow project is dedicated to making machine learning workflows on Kubernetes that is simple, portable, and scalable.
- lakeFS enables you to manage your data lake the way you manage your code. Run parallel pipelines for experimentation and CI/CD for your data.
- maiot-ZenML, open-sourced MLOps Framework, having a bit of everything.
- Marquez, open-source metadata with a fantastic UI.
- Metabase, an open-source BI with excellent visualization.
- MLFlow, a machine learning platform.
- Montecarlodata, data governance or data discovery or data observability.
- Nextflow, data-driven computational pipelines designed for BioInformatics, but can go beyond.
- Pachyderm, MLOps platform, in the space of MLFlow.
- Papermill, parameterizing a notebook, makes Data Science more exciting and more accessible.
- Prefect, designed to make workflow management easier and better compared to Apache Airflow.
- RAPIDS, Data Science on GPUs.
- Ray, distributed machine learning and now streaming.
- Starburst, unlock the value of distributed data by making it fast and easy to access.
- Tecton, enterprise feature store.
- Trino, aka PrestoSQL, now with a clear separation from Presto, Trino can focus heavily on features.
Reordered alphabetically, based on this original. Reposted with permission.
Related: