Apache Hudi — Incremental Processing in Data Lake

Apache Hudi is designed for efficient data updates in a data lake. Ideal for CDC pipelines with frequent upserts.

Hudi — Upserts in a Data Lake¶

Uber developed Hudi for billions of records with frequent updates.

Two Table Types¶

Copy-on-Write — rewrites the file; optimal for reads
Merge-on-Read — delta logs; optimal for writes

hudi_opts = {
    'hoodie.table.name': 'orders',
    'hoodie.datasource.write.recordkey.field': 'order_id',
    'hoodie.datasource.write.precombine.field': 'updated_at',
    'hoodie.datasource.write.operation': 'upsert',
}

df.write.format("hudi").options(**hudi_opts)\
    .mode("append").save("/data/hudi/orders")

# Apache Hudi — Incremental Processing in Data Lake
spark.read.format("hudi")\
    .option("hoodie.datasource.query.type", "incremental")\
    .load("/data/hudi/orders")

When to Use Hudi¶

Apache Hudi is an ideal choice for scenarios where you need to regularly update existing records in a data lake — for example, synchronizing with a production database via CDC (Change Data Capture). Unlike the traditional approach of rewriting entire Parquet files, Hudi enables efficient upserts and incremental reads of only changed data.

In practice, Hudi is often deployed alongside Debezium and Apache Kafka, where Debezium captures changes from PostgreSQL or MySQL and Hudi writes them to S3 or HDFS. With support for timeline and rollback mechanisms, Hudi also provides reliability at the level of ACID transactions. If your data pipeline processes millions of records daily and you need near-real-time access to current data, Hudi is the right choice.

Summary¶

Hudi is optimal for CDC and frequent updates. CoW and MoR strategies balance reads vs writes.

apache hudidata lakeupsertcdc

CORE SYSTEMS team

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

All articles

Apache Hudi — Incremental Processing in Data Lake

Hudi — Upserts in a Data Lake¶

Two Table Types¶

When to Use Hudi¶

Summary¶

CORE SYSTEMS team

More know-how

Data Lake — Architecture for Storing Raw Data

Data Partitioning Strategies for Optimal Query Performance

Delta Lake — ACID Transactions for Data Lake

AI Cost Tracking — How to Stop Bleeding on LLM Bills