Skip to content

16 · Bloom Filters — Probabilistic Check for Key Existence

Storage Internals · Topic 16 of 16


What is a Bloom Filter?

A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is in a set.

  • Returns "definitely not in set" or "possibly in set" (never a false negative)
  • False positives are possible — but tunable
  • Uses a fraction of the memory of a hash set

How It Works

A Bloom filter is a bit array of m bits plus k hash functions.

Adding an element: hash it with all k functions → set those k bits to 1

Querying: hash with all k functions → if ALL k bits are 1, element "possibly present". If ANY bit is 0, "definitely absent".

graph LR
    Key["Key: 'user:42'"] --> H1["Hash fn 1 → bit 3"]
    Key --> H2["Hash fn 2 → bit 7"]
    Key --> H3["Hash fn 3 → bit 12"]
    H1 --> BA["Bit Array\n[0,0,0,1,0,0,0,1,0,0,0,0,1,...]"]

False Positive Rate

\[p \approx \left(1 - e^{-kn/m}\right)^k\]

Where: - \(n\) = number of inserted elements - \(m\) = number of bits - \(k\) = number of hash functions

To achieve 1% false positive rate with 1 million elements: ~9.6 MB.


Why Databases Use Bloom Filters

In LSM-Tree storage engines (Cassandra, RocksDB, LevelDB):

  • SSTables may span many files
  • A read must check all SSTables for the key
  • Bloom filter per SSTable lets the engine skip files that definitely don't contain the key
graph TD
    Read["Read key K"] --> BF1["Bloom Filter SSTable 1\n→ Definitely absent, skip"]
    Read --> BF2["Bloom Filter SSTable 2\n→ Possibly present, read"]
    BF2 --> Data["Find or not find K in SSTable 2"]

Cloud Implementations

  • Each SSTable has a Bloom filter stored alongside it
  • bloom_filter_fp_chance per table (default: 0.01 = 1% FP rate)
  • Lower FP = larger filter = more memory
    CREATE TABLE events (id UUID PRIMARY KEY, ...)
    WITH bloom_filter_fp_chance = 0.001;
    
  • Internally uses Bloom filters for partition lookups
  • Not exposed to users
  • Bloom filters enabled by default on all levels
  • NewBloomFilterPolicy(bits_per_key) configures false positive rate
  • Uses semi-join reduction and runtime filters (similar concept) for join optimization
  • Not a user-configurable Bloom filter
  • No built-in Bloom filter index by default
  • bloom extension available: CREATE INDEX ON table USING bloom(col1, col2)
  • Useful for multi-column equality searches with high cardinality

Summary

Property Bloom Filter
Space complexity O(m) bits
Query time O(k) — constant
False negatives Never
False positives Tunable probability
Can delete elements? No (standard; counting Bloom filters can)