Recipe

Apache Hudi primer

Apache Hudi brings upserts, deletes, and incremental pulls to data lakes built on object storage. This primer walks through Hudi's core concepts, table types, and a minimal Meridian-friendly write path you can wire into your warehouse today.

1.Tables, timelines, and table types

Every Hudi table is a directory of parquet files plus a .hoodie/ metadata folder holding an append-only timeline of commits. Choose Copy on Write for read-heavy analytics, or Merge on Read when you need streaming-style ingest with low write amplification.

2.Writing your first upsert

The snippet below configures a Spark writer to upsert into a partitioned Hudi table. Set a stable recordKey and a monotonic precombine field, then let Hudi handle file sizing and indexing for you.

df.write.format("hudi") \
  .option("hoodie.table.name", "orders") \
  .option("hoodie.datasource.write.recordkey.field", "order_id") \
  .option("hoodie.datasource.write.precombine.field", "updated_at") \
  .option("hoodie.datasource.write.partitionpath.field", "region") \
  .option("hoodie.datasource.write.operation", "upsert") \
  .mode("append") \
  .save("s3://meridian-lake/orders")

3.Incremental queries downstream

Hudi's killer feature is the incremental query: pass the latest commit instant your pipeline saw, and Hudi returns only the rows written since. Wire this into Meridian to drive lightweight change-data-capture jobs, materialized views, and event fan-out without re-scanning entire partitions on every run.