Recipe
DuckDB primer
DuckDB is the embedded analytical database we lean on across the Meridian stack — from local notebooks to the cost-cockpit workers. It runs in-process, reads Parquet/CSV/JSON natively, and handles billions of rows on a laptop without a server.
1. Install and verify
DuckDB ships as a single binary plus language bindings. The Python package is what Meridian agents call by default, but the CLI is handy for ad-hoc poking at warehouse files.
# Install via pip
pip install duckdb
# Or via the CLI binary
curl https://install.duckdb.org | sh
# Verify
duckdb -c "SELECT 'hello meridian' AS greeting;"2. Query Parquet without an ETL step
The killer feature for Meridian: point DuckDB at object storage and run SQL directly. No staging tables, no warehouse copy, no Airflow DAG. The query below aggregates the last seven days of gateway usage from S3 in a single round-trip.
import duckdb
con = duckdb.connect("meridian.db")
con.execute("""
CREATE TABLE IF NOT EXISTS events (
ts TIMESTAMP,
user_id VARCHAR,
model VARCHAR,
tokens INTEGER,
cost_usd DOUBLE
);
""")
# Query Parquet directly, no ETL required
df = con.execute("""
SELECT model, SUM(tokens) AS tokens, SUM(cost_usd) AS spend
FROM read_parquet('s3://meridian-logs/*.parquet')
WHERE ts >= now() - INTERVAL 7 DAY
GROUP BY model
ORDER BY spend DESC
LIMIT 10;
""").df()
print(df)3. When to reach for DuckDB
- • Single-node analytics under ~1 TB — faster than spinning up a cluster.
- • Ad-hoc joins across Parquet, CSV, and Postgres in one SQL statement.
- • Notebook and CI workloads where a long-lived warehouse connection is overkill.
- • Embedded use inside a worker process, the way our cost-cockpit MTD job runs.