RECIPE / PROBABILISTIC DATA

Bloom Filter Primer

A Bloom filter is a space-efficient probabilistic data structure that answers the question “is this item possibly in the set?” with zero false negatives and a tunable false-positive rate. Meridian uses them at the edge to skip expensive backend lookups for known-cold keys.

1.How it works

A Bloom filter is a fixed-size bit array of m bits and k independent hash functions. To insert an item, hash it k times, map each hash to a bit position modulo m, and set those bits to 1. To query, hash again and check whether all kbits are set.

If any bit is 0, the item is definitely not in the set. If all bits are 1, the item is probably in the set — with a false-positive probability that grows as the filter fills.

2.Sizing the filter

Given n expected items and target false-positive rate p, optimal parameters are m = -n ln(p) / (ln 2)^2 bits and k = (m/n) ln 2 hashes. For 1M items at 1% FPR, you need ~9.6 bits per item and 7 hash functions.

import math

def optimal_bloom(n: int, p: float):
    m = -n * math.log(p) / (math.log(2) ** 2)
    k = (m / n) * math.log(2)
    return int(math.ceil(m)), int(math.ceil(k))

bits, hashes = optimal_bloom(1_000_000, 0.01)
# bits = 9585059, hashes = 7

3.When to reach for one

Skipping disk or network lookups for likely-missing keys.
Deduplicating high-volume event streams with bounded memory.
Front-of-cache filters that absorb the long tail of cold reads.
Routing decisions where false positives only cost a wasted probe.