Apache Hudi is designed for efficient data updates in a data lake. Ideal for CDC pipelines with frequent upserts.
Hudi — Upserts in a Data Lake¶
Uber developed Hudi for billions of records with frequent updates.
Two Table Types¶
- Copy-on-Write — rewrites the file; optimal for reads
- Merge-on-Read — delta logs; optimal for writes
hudi_opts = {
'hoodie.table.name': 'orders',
'hoodie.datasource.write.recordkey.field': 'order_id',
'hoodie.datasource.write.precombine.field': 'updated_at',
'hoodie.datasource.write.operation': 'upsert',
}
df.write.format("hudi").options(**hudi_opts)\
.mode("append").save("/data/hudi/orders")
# Apache Hudi — Incremental Processing in Data Lake
spark.read.format("hudi")\
.option("hoodie.datasource.query.type", "incremental")\
.load("/data/hudi/orders")
When to Use Hudi¶
Apache Hudi is an ideal choice for scenarios where you need to regularly update existing records in a data lake — for example, synchronizing with a production database via CDC (Change Data Capture). Unlike the traditional approach of rewriting entire Parquet files, Hudi enables efficient upserts and incremental reads of only changed data.
In practice, Hudi is often deployed alongside Debezium and Apache Kafka, where Debezium captures changes from PostgreSQL or MySQL and Hudi writes them to S3 or HDFS. With support for timeline and rollback mechanisms, Hudi also provides reliability at the level of ACID transactions. If your data pipeline processes millions of records daily and you need near-real-time access to current data, Hudi is the right choice.
Summary¶
Hudi is optimal for CDC and frequent updates. CoW and MoR strategies balance reads vs writes.