Data Engineer

center

Articles

Awesome Repositories

Blogs

Organization

  • Big Data Europe: Integrating Big Data, software & communicaties for addressing Europe’s societal challenge
  • DataTalksClub: The place to talk about data

Landscape

Topic

Youtube

CDC Pattern

Note

What is CDC?

CDC as a mechanism that constantly monitors the original data system for changes, extracts them, and distributes them to upstream systems. Change Data Capture excludes the process of bulk data loading by implementing incremental loading of data in nearly real-time.

Data Engineer Tools (Curious Version πŸ”­)

Data Orchestration Workflow

  • kestra: ⚑ Workflow Automation Platform
  • prefect: A workflow orchestration framework for building resilient data pipelines in Python.

DataLake / Lakehouse

  • openhouse: An open source control plane designed for efficient management of tables within open data lakehouse deployments

Streaming Process

  • bytewax: Python Stream Processing

Data Engineer Tools

center

Big Data

  • Spark: A multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters 🌟 (Recommended)
  • Trino: The distributed SQL query engine for big data, formerly known as PrestoSQL 🌟 (Recommended)

CDC

  • debezium: Change data capture for a variety of databases 🌟 (Recommended)
  • flink-cdc: Flink CDC is a streaming data integration tool

Data Orchestration Workflow

  • airbyte: The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted. 🌟 (Recommended)

  • airflow: A platform to programmatically author, schedule, and monitor workflows 🌟 (Recommended)

    • Astronomer Registry: Building Blocks for your Apache Airflow Data Pipelines.
    • Astro: A fully-managed SaaS application for data orchestration that helps teams write and run data pipelines with Apache Airflow

DataLake / Lakehouse

  • polaris: Β The interoperable, open source catalog for Apache Iceberg 🌟 (Recommended)
  • iceberg: a high-performance format for huge analytic tables 🌟 (Recommended)