16 · Bloom Filters — Probabilistic Check for Key Existence

Storage Internals · Topic 16 of 16

What is a Bloom Filter?

A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is in a set.

Returns "definitely not in set" or "possibly in set" (never a false negative)
False positives are possible — but tunable
Uses a fraction of the memory of a hash set

How It Works

A Bloom filter is a bit array of m bits plus k hash functions.

Adding an element: hash it with all k functions → set those k bits to 1

Querying: hash with all k functions → if ALL k bits are 1, element "possibly present". If ANY bit is 0, "definitely absent".

graph LR
    Key["Key: 'user:42'"] --> H1["Hash fn 1 → bit 3"]
    Key --> H2["Hash fn 2 → bit 7"]
    Key --> H3["Hash fn 3 → bit 12"]
    H1 --> BA["Bit Array\n[0,0,0,1,0,0,0,1,0,0,0,0,1,...]"]

False Positive Rate

\[p \approx \left(1 - e^{-kn/m}\right)^k\]

Where: - \(n\) = number of inserted elements - \(m\) = number of bits - \(k\) = number of hash functions

To achieve 1% false positive rate with 1 million elements: ~9.6 MB.

Why Databases Use Bloom Filters

In LSM-Tree storage engines (Cassandra, RocksDB, LevelDB):

SSTables may span many files
A read must check all SSTables for the key
Bloom filter per SSTable lets the engine skip files that definitely don't contain the key

graph TD
    Read["Read key K"] --> BF1["Bloom Filter SSTable 1\n→ Definitely absent, skip"]
    Read --> BF2["Bloom Filter SSTable 2\n→ Possibly present, read"]
    BF2 --> Data["Find or not find K in SSTable 2"]

Cloud Implementations

CassandraDynamoDBRocksDB (used by many DBs)BigQueryPostgreSQL

Each SSTable has a Bloom filter stored alongside it
bloom_filter_fp_chance per table (default: 0.01 = 1% FP rate)

Lower FP = larger filter = more memory

CREATE TABLE events (id UUID PRIMARY KEY, ...)
WITH bloom_filter_fp_chance = 0.001;

Internally uses Bloom filters for partition lookups
Not exposed to users

Bloom filters enabled by default on all levels
NewBloomFilterPolicy(bits_per_key) configures false positive rate

Uses semi-join reduction and runtime filters (similar concept) for join optimization
Not a user-configurable Bloom filter

No built-in Bloom filter index by default
bloom extension available: CREATE INDEX ON table USING bloom(col1, col2)
Useful for multi-column equality searches with high cardinality

Summary

Property	Bloom Filter
Space complexity	O(m) bits
Query time	O(k) — constant
False negatives	Never
False positives	Tunable probability
Can delete elements?	No (standard; counting Bloom filters can)