Recipe

Apache Spark primer

Apache Spark is a unified analytics engine for large-scale data processing. This primer walks through the core mental model so you can read, write, and reason about Spark jobs from a Meridian notebook without surprise. We cover the cluster topology, DataFrames, and a minimal end-to-end job.

1.Driver, executors, and the cluster

A Spark application has one driver and many executors. The driver holds the SparkSession and schedules tasks; executors run those tasks against partitions of your data in parallel. When you call an action, the driver builds a DAG of stages, ships closures to the executors, and collects results. Understanding this split is the single biggest unlock for debugging slow jobs.

2.DataFrames over RDDs

Prefer the DataFrame API over raw RDDs. DataFrames flow through the Catalyst optimizer and Tungsten execution engine, which means predicate pushdown, column pruning, and code generation come for free. Reach for an RDD only when you genuinely need fine-grained control of partitioning or you are working with non-tabular data that resists a schema.

3.A minimal job

The skeleton below reads a Parquet dataset, filters it, aggregates, and writes the result. In a Meridian notebook the SparkSession is already attached as spark, so you skip the builder boilerplate.

from pyspark.sql import functions as F

events = (
    spark.read.parquet("s3://meridian/events/")
         .where(F.col("event_date") == "2026-06-27")
)

per_user = (
    events.groupBy("user_id")
          .agg(
              F.count("*").alias("n_events"),
              F.countDistinct("session_id").alias("n_sessions"),
          )
          .where(F.col("n_events") >= 5)
)

per_user.write.mode("overwrite").parquet(
    "s3://meridian/derived/active_users/dt=2026-06-27/"
)