Recipe

Apache Flink primer

Flink is a distributed stream processor with exactly-once semantics, event-time windowing, and stateful operators. This primer covers the runtime model, the DataStream API, and how to ship a job to a cluster from a Meridian-backed pipeline.

1. Runtime model

A Flink program is a dataflow graph executed by a JobManager that coordinates TaskManagers. Each TaskManager hosts task slots that run parallel subtasks. State is partitioned by key and checkpointed to durable storage so failures replay from the last consistent snapshot rather than from input head.

  • JobManager: scheduling, checkpoint coordination, failover.
  • TaskManager: slot host, network shuffles, RocksDB state backend.
  • Checkpoint barrier: aligned marker that snapshots operator state.

2. DataStream API

The DataStream API expresses transformations over unbounded streams. Sources read from Kafka or Kinesis, operators map and key the stream, and sinks publish to a warehouse. The example below windows page views by user and emits a per-minute count to a downstream sink.

StreamExecutionEnvironment env =
    StreamExecutionEnvironment.getExecutionEnvironment();

env.fromSource(kafkaSource, WatermarkStrategy
        .forBoundedOutOfOrderness(Duration.ofSeconds(5)),
        "pageviews")
    .keyBy(PageView::userId)
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    .aggregate(new CountAgg())
    .sinkTo(meridianSink);

env.execute("pageview-counter");

3. Shipping a job

Package the job as a fat jar, register it with the Meridian Flink operator, and the control plane provisions a session cluster sized to your parallelism budget. Watermarks, backpressure, and restart strategy are wired from the recipe manifest, so a redeploy does not lose in-flight state.

For production, pin the savepoint path, enable incremental RocksDB checkpoints, and scope the restart strategy to fixed-delay with a circuit breaker on repeated failure.