Recipe

DuckDB primer

DuckDB is the embedded analytical database we lean on across the Meridian stack — from local notebooks to the cost-cockpit workers. It runs in-process, reads Parquet/CSV/JSON natively, and handles billions of rows on a laptop without a server.

1. Install and verify

DuckDB ships as a single binary plus language bindings. The Python package is what Meridian agents call by default, but the CLI is handy for ad-hoc poking at warehouse files.

# Install via pip
pip install duckdb

# Or via the CLI binary
curl https://install.duckdb.org | sh

# Verify
duckdb -c "SELECT 'hello meridian' AS greeting;"

2. Query Parquet without an ETL step

The killer feature for Meridian: point DuckDB at object storage and run SQL directly. No staging tables, no warehouse copy, no Airflow DAG. The query below aggregates the last seven days of gateway usage from S3 in a single round-trip.

import duckdb

con = duckdb.connect("meridian.db")

con.execute("""
  CREATE TABLE IF NOT EXISTS events (
    ts        TIMESTAMP,
    user_id   VARCHAR,
    model     VARCHAR,
    tokens    INTEGER,
    cost_usd  DOUBLE
  );
""")

# Query Parquet directly, no ETL required
df = con.execute("""
  SELECT model, SUM(tokens) AS tokens, SUM(cost_usd) AS spend
  FROM read_parquet('s3://meridian-logs/*.parquet')
  WHERE ts >= now() - INTERVAL 7 DAY
  GROUP BY model
  ORDER BY spend DESC
  LIMIT 10;
""").df()

print(df)

3. When to reach for DuckDB

  • Single-node analytics under ~1 TB — faster than spinning up a cluster.
  • Ad-hoc joins across Parquet, CSV, and Postgres in one SQL statement.
  • Notebook and CI workloads where a long-lived warehouse connection is overkill.
  • Embedded use inside a worker process, the way our cost-cockpit MTD job runs.

Next up: pair DuckDB with the Meridian gateway logs to build a self-serve usage dashboard. See the cost-cockpit recipe for the end-to-end pipeline.