Recipe

Delta Lake primer

Delta Lake brings ACID transactions, time travel, and schema enforcement to object storage. This recipe walks through the three building blocks Meridian uses when wiring a Delta-backed table into a streaming or batch pipeline, with copy-paste snippets for the common cases.

1. Create a managed Delta table

Start from a Spark session with the Delta extension wired in. The table lands on the configured object store and registers metadata in the catalog. Partitioning on a low-cardinality column keeps file pruning cheap.

from pyspark.sql import SparkSession

spark = (SparkSession.builder
  .appName("meridian-delta")
  .config("spark.sql.extensions",
          "io.delta.sql.DeltaSparkSessionExtension")
  .getOrCreate())

(spark.read.json("s3://meridian/raw/events/")
  .write.format("delta")
  .partitionBy("event_date")
  .mode("overwrite")
  .save("s3://meridian/lake/events"))

2. Merge upserts without rewriting the world

The MERGE statement matches incoming change records against the target by primary key, updating mutated rows and inserting new ones in a single transaction. Useoptimizeafter high-volume merges so the small-file count stays bounded.

MERGE INTO lake.events t
USING staging.cdc s
ON t.event_id = s.event_id
WHEN MATCHED AND s.op = 'D' THEN DELETE
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

3. Time travel for audit and replay

Every commit writes a numbered version into the transaction log. Querying a prior version recovers the exact rows seen by a downstream consumer at that point in time, which makes regression diffs and accidental-delete recovery a one-liner.

-- inspect the log
DESCRIBE HISTORY lake.events;

-- read a prior version
SELECT * FROM lake.events VERSION AS OF 142;

-- read at a wall-clock instant
SELECT * FROM lake.events
TIMESTAMP AS OF '2026-06-01 00:00:00';