RECIPE

Apache Beam primer

Apache Beam is a unified programming model for batch and streaming data pipelines. This recipe walks through wiring a Beam pipeline against Meridian, so you can express transformations once and run them on Dataflow, Flink, or Spark without rewrites.

1Install the SDK

Beam ships Python, Java, and Go SDKs. For most Meridian users, the Python SDK is the fastest path. Pin the version to avoid runner mismatches in production.

pip install apache-beam==2.55.0
pip install meridian-sdk

2Build a pipeline

A pipeline is a directed graph of PTransforms over PCollections. The example below reads prompts from a file, scores them with Meridian, and writes the results back.

import apache_beam as beam
from meridian import Client

client = Client(api_key="mrd_live_...")

with beam.Pipeline() as p:
    (
        p
        | "Read" >> beam.io.ReadFromText("prompts.txt")
        | "Score" >> beam.Map(lambda x: client.score(x))
        | "Write" >> beam.io.WriteToText("scored.jsonl")
    )

3Pick a runner

The DirectRunner is fine for local testing. For production, point at Dataflow for autoscaling, Flink for low-latency streaming, or Spark when you already operate a cluster. The same Beam code runs on all three.

  • DirectRunner — local dev, single process
  • DataflowRunner — Google Cloud, fully managed
  • FlinkRunner — self-hosted streaming workloads