Apache Hudi primer
Apache Hudi brings upserts, deletes, and incremental pulls to data lakes built on object storage. This primer walks through Hudi's core concepts, table types, and a minimal Meridian-friendly write path you can wire into your warehouse today.
1.Tables, timelines, and table types
Every Hudi table is a directory of parquet files plus a .hoodie/ metadata folder holding an append-only timeline of commits. Choose Copy on Write for read-heavy analytics, or Merge on Read when you need streaming-style ingest with low write amplification.
2.Writing your first upsert
The snippet below configures a Spark writer to upsert into a partitioned Hudi table. Set a stable recordKey and a monotonic precombine field, then let Hudi handle file sizing and indexing for you.
df.write.format("hudi") \
.option("hoodie.table.name", "orders") \
.option("hoodie.datasource.write.recordkey.field", "order_id") \
.option("hoodie.datasource.write.precombine.field", "updated_at") \
.option("hoodie.datasource.write.partitionpath.field", "region") \
.option("hoodie.datasource.write.operation", "upsert") \
.mode("append") \
.save("s3://meridian-lake/orders")3.Incremental queries downstream
Hudi's killer feature is the incremental query: pass the latest commit instant your pipeline saw, and Hudi returns only the rows written since. Wire this into Meridian to drive lightweight change-data-capture jobs, materialized views, and event fan-out without re-scanning entire partitions on every run.