RECIPE

Apache Iceberg primer

Apache Iceberg is an open table format for huge analytic datasets. It brings ACID transactions, schema evolution, hidden partitioning, and time travel to data lakes sitting on object storage like S3 or GCS. This recipe walks through wiring Iceberg into a Meridian data pipeline so your warehouse stays cheap, queryable, and reproducible.

1Why Iceberg over raw Parquet

Raw Parquet directories rot fast: no atomic writes, no schema enforcement, and partition evolution requires rewriting your entire dataset. Iceberg layers a metadata tree on top of Parquet so every commit is atomic, schema changes are tracked, and old snapshots are queryable for audits or rollbacks.

  • Snapshot isolation across concurrent writers
  • Hidden partitioning — no more partition columns in WHERE clauses
  • Time-travel queries via snapshot ID or timestamp

2Catalog setup

Iceberg needs a catalog to track table metadata pointers. The simplest options are REST catalog, Glue, or Hive. For a Meridian-managed warehouse we recommend the REST catalog because it is engine-agnostic and keeps credentials server-side.

from pyiceberg.catalog import load_catalog

catalog = load_catalog(
    "meridian",
    **{
        "uri": "https://iceberg.meridian.dev",
        "warehouse": "s3://meridian-lake/warehouse",
        "s3.region": "us-east-1",
    },
)

table = catalog.load_table("analytics.events")
df = table.scan().to_pandas()

3Maintenance and compaction

Iceberg tables accumulate small files from streaming writes. Schedule daily compaction to merge them into larger Parquet objects, and expire old snapshots beyond your retention window so storage costs stay flat. Meridian ships a managed maintenance worker that handles both jobs on a cron.