python-bloomfilter

Build Status

pybloom_live

A fast, scalable Python implementation of Bloom filters - space-efficient probabilistic data structures for set membership testing.

What is a Bloom Filter?

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. It can tell you with certainty that an element is not in the set, or that it might be in the set (with a controllable false positive probability). Bloom filters never produce false negatives.

Use cases:

Features

Installation

Install from PyPI:

pip install pybloom-live

Or install from source:

git clone https://github.com/joseph-fox/python-bloomfilter.git
cd python-bloomfilter
pip install .

Verify installation:

pip show pybloom-live
# Should show version 4.0.0 or higher

Quick Start

Basic BloomFilter

import pybloom_live

# Create a Bloom Filter with capacity for 1000 elements
# and a 0.1% false positive rate
f = pybloom_live.BloomFilter(capacity=1000, error_rate=0.001)

# Add elements
f.add("apple")
f.add("banana")
f.add("orange")

# Check membership
print("apple" in f)      # True
print("grape" in f)      # False

# The add() method returns True if the element might already exist
already_exists = f.add("apple")  # Returns True
already_exists = f.add("mango")  # Returns False

ScalableBloomFilter

The ScalableBloomFilter automatically grows as you add more elements, maintaining the target error rate:

import pybloom_live

# Create a scalable filter with initial capacity of 100
sbf = pybloom_live.ScalableBloomFilter(
    initial_capacity=100,
    error_rate=0.001,
    mode=pybloom_live.ScalableBloomFilter.LARGE_SET_GROWTH
)

# Add many elements - it scales automatically
for i in range(10000):
    sbf.add(i)

print(f"Elements: {len(sbf)}")           # ~10000
print(f"Capacity: {sbf.capacity}")       # Automatically expanded
print(f"Filters: {len(sbf.filters)}")    # Multiple internal filters

# Check membership
print(5000 in sbf)   # True
print(10000 in sbf)  # False

Growth modes:

Examples

Example 1: Web Crawler - Tracking Visited URLs

import pybloom_live

# Track visited URLs to avoid re-crawling
visited = pybloom_live.ScalableBloomFilter(
    initial_capacity=10000,
    error_rate=0.001
)

urls_to_crawl = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page1",  # duplicate
    "https://example.com/page3",
]

for url in urls_to_crawl:
    if url in visited:
        print(f"SKIP (already crawled): {url}")
    else:
        visited.add(url)
        print(f"CRAWL: {url}")
        # crawl_page(url)

Example 2: Set Operations

Combine filters using union and intersection:

import pybloom_live

# Create two filters with same capacity and error rate
f1 = pybloom_live.BloomFilter(capacity=100, error_rate=0.01)
f2 = pybloom_live.BloomFilter(capacity=100, error_rate=0.01)

# Add elements
for i in range(20):
    f1.add(i)

for i in range(10, 30):
    f2.add(i)

# Union - elements in either filter
f_union = f1 | f2  # or f1.union(f2)
print(0 in f_union)   # True (in f1)
print(15 in f_union)  # True (in both)
print(25 in f_union)  # True (in f2)

# Intersection - elements likely in both filters
f_intersect = f1 & f2  # or f1.intersection(f2)
print(5 in f_intersect)   # False (only in f1)
print(15 in f_intersect)  # True (in both)
print(25 in f_intersect)  # False (only in f2)

Example 3: Serialization

Save and load filters from disk:

import pybloom_live

# Create and populate a filter
bf = pybloom_live.BloomFilter(capacity=10000, error_rate=0.001)
for i in range(1000):
    bf.add(f"user_{i}")

# Save to file
with open('bloom_filter.bin', 'wb') as f:
    bf.tofile(f)

# Load from file
with open('bloom_filter.bin', 'rb') as f:
    loaded_bf = pybloom_live.BloomFilter.fromfile(f)

# Verify
print("user_500" in loaded_bf)    # True
print("user_9999" in loaded_bf)   # False

Example 4: Understanding Error Rates

import pybloom_live

# Create a filter with 0.1% error rate
f = pybloom_live.BloomFilter(capacity=1000, error_rate=0.001)

# Add exactly capacity elements
for i in range(1000):
    f.add(i)

# Test false positive rate
false_positives = 0
tests = 10000
for i in range(1000, 1000 + tests):
    if i in f:  # These were never added
        false_positives += 1

actual_error_rate = false_positives / tests
print(f"Requested error rate: {f.error_rate}")
print(f"Actual error rate: {actual_error_rate:.4f}")
# Actual rate should be close to 0.001

API Reference

BloomFilter

Constructor:

BloomFilter(capacity, error_rate=0.001)

Methods:

Properties:

ScalableBloomFilter

Constructor:

ScalableBloomFilter(initial_capacity=100, error_rate=0.001, mode=LARGE_SET_GROWTH)

Methods:

Properties:

Hash Functions

The library automatically selects the optimal hash function based on the number of bits required:

Bits Required Hash Function Type
≤ 128 xxh3_128 Non-cryptographic (fastest)
129-160 SHA-1 Cryptographic
161-256 SHA-256 Cryptographic
257-384 SHA-384 Cryptographic
> 384 SHA-512 Cryptographic

Algorithm Details

This implementation follows the Scalable Bloom Filter algorithm from:

P. Almeida, C.Baquero, N. Preguiça, D. Hutchison, Scalable Bloom Filters, (GLOBECOM 2007), IEEE, 2007.

Key optimizations:

Breaking Changes in Version 4.x

Version 4.0.0 introduces significant performance improvements but is not backward compatible with earlier versions:

Migration: If you have serialized filters from earlier versions, you must regenerate them using version 4.0+.

Performance

Typical performance characteristics (results may vary by system):

Python Version Support

Note: Python 2 support has been dropped in version 4.0+. For Python 2 compatibility, use an earlier version.

Development

We follow the git branching model described by Vincent Driessen.

Running tests:

pip install pytest
pytest pybloom_live/test_pybloom.py

License

MIT License - see LICENSE.txt for details.

Credits