Apache Hudi is designed for efficient data updates in a data lake. Ideal for CDC pipelines with frequent upserts.
Hudi — Upserts in a Data Lake¶
Uber developed Hudi for billions of records with frequent updates.
Zwei Tabellentypen¶
- Copy-on-Write — rewrites the file; optimal for reads
- Merge-on-Read — delta logs; optimal for writes
hudi_opts = {
'hoodie.table.name': 'orders',
'hoodie.datasource.write.recordkey.field': 'order_id',
'hoodie.datasource.write.precombine.field': 'updated_at',
'hoodie.datasource.write.operation': 'upsert',
}
df.write.format("hudi").options(**hudi_opts)\
.mode("append").save("/data/hudi/orders")
# Incremental read
spark.read.format("hudi")\
.option("hoodie.datasource.query.type", "incremental")\
.load("/data/hudi/orders")
Zusammenfassung¶
Hudi is optimal for CDC and frequent updates. CoW and MoR strategies balance reads vs writes.
apache hudidata lakeupsertcdc