Recipe

Count-Min Sketch primer

A Count-Min Sketch is a probabilistic data structure that estimates the frequency of events in a stream using sub-linear space. Trade a small over-count error for constant-time updates and constant-time queries across billions of items.

1. When to reach for it

Use a Count-Min Sketch when you need approximate frequencies for a high-cardinality stream and exact counts are too expensive in memory. Typical fits: heavy-hitter detection, rate-limiting, anomaly scoring, and tail-latency budgeting on routers.

  • Stream cardinality is huge (millions+ of distinct keys).
  • You can tolerate over-counts but never under-counts.
  • You need constant memory regardless of input size.

2. Sizing the sketch

Pick width w = ceil(e / epsilon) and depth d = ceil(ln(1 / delta)). With epsilon = 0.001 and delta = 0.01 you get w = 2719 and d = 5, a 13.6k-counter table that holds for any stream length.

Each row uses an independent hash. On update, increment all d cells. On query, return the minimum across rows: collisions can only inflate counts, never deflate.

3. Reference implementation

The Meridian gateway ships a Count-Min Sketch behind the abuse-detector for per-tenant token spend. Here is the minimal shape:

class CountMinSketch:
    def __init__(self, w=2719, d=5):
        self.w, self.d = w, d
        self.table = [[0] * w for _ in range(d)]
        self.seeds = [hash(("cms", i)) for i in range(d)]

    def _idx(self, key, row):
        return (hash((self.seeds[row], key))) % self.w

    def add(self, key, count=1):
        for r in range(self.d):
            self.table[r][self._idx(key, r)] += count

    def estimate(self, key):
        return min(
            self.table[r][self._idx(key, r)]
            for r in range(self.d)
        )