Parquet Primer
Parquet is the columnar storage format that powers nearly every modern analytics stack. This primer walks you through why Meridian stores extracted datasets as Parquet by default, how to read them efficiently, and the three common pitfalls that trip up first-time users moving off CSV.
01.Why columnar wins for analytics
Row-oriented formats like CSV and JSONL force the reader to scan every byte even when a query only touches a single column. Parquet stores each column in its own contiguous block, so a SELECT on one field reads roughly one-Nth of the file. Compression also gets dramatically better because adjacent values in the same column share a type and distribution. In production Meridian workloads we routinely see 8x to 30x smaller files and 50x faster scans versus the equivalent CSV.
The cost is that Parquet is a binary format. You cannot tail it, you cannot grep it, and you cannot open it in Excel without a plugin. For staging extracts that is the right tradeoff. For human-in-the-loop debugging, keep a small CSV sample alongside the Parquet.
02.Reading a Meridian Parquet drop
Every Meridian extract lands as a single Parquet file (or a directory of part-files for larger jobs) plus a sidecar manifest. The fastest way to inspect one locally is DuckDB, which can query Parquet without loading it into memory:
# Install once
pip install duckdb pyarrow
# Query directly from disk
import duckdb
con = duckdb.connect()
rows = con.execute("""
SELECT product_id, price, captured_at
FROM 'extracts/run_42.parquet'
WHERE price < 50
ORDER BY captured_at DESC
LIMIT 100
""").fetchall()
for r in rows:
print(r)For pandas users, pd.read_parquet(path) works identically. Polars and Arrow give you lazy scans if the file is too large to materialize.
03.Three pitfalls to avoid
- Schema drift. Parquet files are self-describing. If a later extract adds a column, older readers will not see it unless you union schemas explicitly. Pin the manifest version.
- Tiny files. Writing one Parquet per row is an antipattern. Aim for 64 MB to 512 MB per part-file so column chunks are large enough to benefit from compression.
- Timezone confusion. Meridian always emits timestamps in UTC. Tools that auto-localize (looking at you, pandas) can silently shift values. Read with
tz="UTC".