Apache Arrow primer
Apache Arrow is a language-agnostic columnar memory format designed for flat and hierarchical data. Meridian uses Arrow as the zero-copy interchange layer between our ingest pipeline, the analytics engine, and customer SDKs. This recipe walks through the three things you need to know to ship Arrow-backed analytics on Meridian today.
1.Why columnar matters
Row-oriented formats like JSON or CSV force every column to be parsed even when a query only touches one. Arrow stores values contiguously by column, so a scan over a single field reads cache-aligned bytes and SIMD vectorizes for free. On Meridian, this turns a 2.4 GB nightly export into a 180 ms aggregation.
- Zero-copy reads across Python, Rust, JS, and Go
- Dictionary encoding for low-cardinality strings
- Native null bitmaps without sentinel values
2.Reading an Arrow IPC stream
Meridian exposes every dataset as an Arrow IPC stream behind a signed URL. Point any Arrow client at the URL and you get a typed RecordBatch reader with schema metadata baked in.
import pyarrow.ipc as ipc
import pyarrow as pa
import urllib.request
url = "https://meridian.getnimbus.net/api/v1/datasets/events.arrow"
req = urllib.request.urlopen(url)
reader = ipc.open_stream(req)
table = reader.read_all()
print(table.schema)
print(table.num_rows, "rows")
print(table.column("user_id").to_pandas().head())3.Pushing Arrow back to Meridian
For uploads, wrap your RecordBatch in an IPC writer and POST the buffer to the ingest endpoint. Meridian validates the schema against the dataset contract and rejects mismatches with a typed error. No JSON conversion, no schema drift, no surprises at query time.
Tip: set theContent-Type: application/vnd.apache.arrow.streamheader so the server skips content sniffing.