16 · Bloom Filters — Probabilistic Check for Key Existence
Storage Internals · Topic 16 of 16
What is a Bloom Filter?
A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is in a set.
- Returns "definitely not in set" or "possibly in set" (never a false negative)
- False positives are possible — but tunable
- Uses a fraction of the memory of a hash set
How It Works
A Bloom filter is a bit array of m bits plus k hash functions.
Adding an element: hash it with all k functions → set those k bits to 1
Querying: hash with all k functions → if ALL k bits are 1, element "possibly present". If ANY bit is 0, "definitely absent".
graph LR
Key["Key: 'user:42'"] --> H1["Hash fn 1 → bit 3"]
Key --> H2["Hash fn 2 → bit 7"]
Key --> H3["Hash fn 3 → bit 12"]
H1 --> BA["Bit Array\n[0,0,0,1,0,0,0,1,0,0,0,0,1,...]"]
False Positive Rate
\[p \approx \left(1 - e^{-kn/m}\right)^k\]
Where: - \(n\) = number of inserted elements - \(m\) = number of bits - \(k\) = number of hash functions
To achieve 1% false positive rate with 1 million elements: ~9.6 MB.
Why Databases Use Bloom Filters
In LSM-Tree storage engines (Cassandra, RocksDB, LevelDB):
- SSTables may span many files
- A read must check all SSTables for the key
- Bloom filter per SSTable lets the engine skip files that definitely don't contain the key
graph TD
Read["Read key K"] --> BF1["Bloom Filter SSTable 1\n→ Definitely absent, skip"]
Read --> BF2["Bloom Filter SSTable 2\n→ Possibly present, read"]
BF2 --> Data["Find or not find K in SSTable 2"]
Cloud Implementations
- Each SSTable has a Bloom filter stored alongside it
bloom_filter_fp_chanceper table (default: 0.01 = 1% FP rate)- Lower FP = larger filter = more memory
- Internally uses Bloom filters for partition lookups
- Not exposed to users
- Bloom filters enabled by default on all levels
NewBloomFilterPolicy(bits_per_key)configures false positive rate
- Uses semi-join reduction and runtime filters (similar concept) for join optimization
- Not a user-configurable Bloom filter
- No built-in Bloom filter index by default
bloomextension available:CREATE INDEX ON table USING bloom(col1, col2)- Useful for multi-column equality searches with high cardinality
Summary
| Property | Bloom Filter |
|---|---|
| Space complexity | O(m) bits |
| Query time | O(k) — constant |
| False negatives | Never |
| False positives | Tunable probability |
| Can delete elements? | No (standard; counting Bloom filters can) |