Bloom filter library
Changelog | Setup | Docs | Maven Repo
Version 1 is out with a complete rewrite of almost all functionalities and many new ones.
This is a set of Bloom filters we implemented as we found all existing open-source implementations to be lacking in various aspects. This library takes some inspiration from the simple Bloom filter implementation of Magnus Skjegstad and the Ruby Bloom filters by Ilya Grigorik.
The Bloom filter is a probabilistic set data structure which is very small. This is achieved by allowing false positives with some probability p. It has an add
and contains
operation which both are very fast (time complexity O(1)). The Counting Bloom filter is an extension of the Bloom filter with a remove
operation at the cost of incurring an additional space overhead for counting. There are many good introductions to Bloom filters: the Wikipedia article is excellent, and even better is a survey by Broder and Mitzenmacher. Typical use cases of Bloom filters are content summaries and sets that would usually grow too large in fields such as networking, distributed systems, databases and analytics.
There are 4 types of Bloom filters in the Orestes Bloom filter library:
- Regular Bloom filter, a regular in-memory Java Bloom filter (
MemoryBloomFilter
) - Counting Bloom filter, a Counting Bloom Filter which supports element removal (
MemoryCountingBloomFilter
) - Redis Bloom Filter, a Redis-backed Bloom filter which can be concurrently used by different applications (
RedisBloomFilter
) - Redis Counting Bloom Filter, a Redis-backed Bloom filter which can be concurrently used by different applications, it keeps track of the number of keys added to the filter (
RedisCountingBloomFilter
)
This library if written in and for Java 8. For a Java 7 backport see: https://github.com/Crystark/Orestes-Bloomfilter
Err, Bloom what?
Bloom filters are awesome data structures: fast and space efficient.
BloomFilter<String> bf = new FilterBuilder(10_000_000, 0.01).buildBloomFilter(); //Expect 10M URLs
urls.add("http://github.com"); //Add millions of URLs
urls.contains("http://twitter.com"); //Know in an instant which ones you have or have not seen before
So what's the catch? Bloom filters allow false positives (i.e. URL contained though never added) with some probability (0.01 in the example). If you can mitigate rare false positives (false negatives never happen) then Bloom filters are probably for you.
New in 2.0
- A new expiring Bloom filter which maintains Bloom filter entry expiration completely in Redis (ExpiringBloomFilterPureRedis.java)
- Counting and expiration-based Bloom filters can now be migrated between the in-memory and Redis-backed implementations
- fixed a critical error with the hash entry calculation for counting Bloom filters: Hashes were encoded incorrectly by Jedis in the 1.x-based releases. Therefore, the Redis-backed counting Bloom filter implementation is not backward compatible with older releases. To avoid problems during upgrade, make sure to start with a clean Redis when upgrading to 2.0!
New in 1.0
- Bloom filters are now constructed and configured using a comfortable Builder interface, e.g.
new FilterBuilder(100,0.01).redisBacked().buildCountingBloomFilter()
- All Bloom filters are thread-safe and drastically improved in performance
- Arbitrarily many Bloom filter instances in a single Redis server by using
name("myfilter")
to distinguish filters - Existing filter can be loaded or overwritten and be shared from different processes without concurrency anomalies
- New and improved Hash functions: All cryptographic hash functions, Murmur3 and Murmur3 with the Kirsch&Mitzenmacher trick, advanced distribution and performance testing of hash functions
- Redis Bloom filters are much faster and simpler, fixed a very rare race-condition
- Memory and Redis Bloom filters now share a common interface
BloomFilter
resp.CountingBloomFilter
instead of being subclasses - Extensive JavaDoc documentation, test-coverage increased by a factor of at least 2, cleaner a streamlined (Java 8) design
- Redis read-slaves: allow your Bloom filter to perform reads on slaves to get even higher performance
- Library now available as Maven/Gradle repo and built using Gradle
- Population Estimation: the population of Counting and normal Bloom Filter can now be precisely estimated
- Frequency Estimation: the frequency/count of elements in Counting Bloom filter can now be estimated using the Minimum-Selection algorithm (known from spectral Bloom filters
- All add and remove method variants now return whether the element was added/removed resp. what the element's estimated count is
- Redis Bloom filters now use configurable connection pooling and are thus not limited by round-trip times anymore
- The library is now an important component of our Backend-as-a-Service startup Baqend and thus you can expect far more frequent updates. Don't worry, the Bloom filter library will always remain MIT-licensed and open-source!
Features
There are a many things we addressed as we sorely missed them in other implementations:
- Bloom filter and Counting Bloom filter in both a local and shared variants with the same interface
- Configuration of all parameters: Bit-Array size m, number of hash functions k, counting bits c
- Automatic configuration given the tolerable false positive rate p and expected elements n
- Statistics, e.g. what is my current false positive probability?
- Choice among different hash functions: the better (i.e. uniformly distributed) the hash function, the more accurate the Bloom filter but the better the hash function usually the slower it is -> choose from about 10-15 optimized hash functions, e.g. MD5, SHA, Murmur, LCGs, Carter-Wegman etc. or use a custom one
- Operations on the shared Bloom filter need to be fast (single round-trips to Redis per operation and heavy use of pipelining)
- Generation of the Bloom filter is always fast (on-the-fly pregeneration)
- Support of union and intersection
- Implementation of rejection sampling and chaining of hash values taking into account the avalanche effect (higher hash quality)
- Minimal dependencies: the local Bloom filters have none, the Redis Bloom filters need the jedis client library (in
lib
folder) - Concurrency: the shared Bloom filter can be accessed by many clients simultaneously without multi-user anomalies and performance degradation (which is quite difficult for bitwise counters and a pregnerated Bloom filter - but possible)
Java 8 is required. The recommended way to include the Bloom filter is via the Maven repo (works for Gradle, Ivy, etc., too):
<dependencies>
<dependency>
<groupId>com.baqend</groupId>
<artifactId>bloom-filter</artifactId>
<version>1.0.7</version>
</dependency>
</dependencies>
<repositories>
<repository>
<snapshots>
<enabled>false</enabled>
</snapshots>
<id>central</id>
<name>bintray</name>
<url>http://jcenter.bintray.com</url>
</repository>
</repositories>
or with Gradle:
repositories {
jcenter()
}
dependencies {
compile(
'com.baqend:bloom-filter:1.0.7'
)
}
For the normal Bloom filters it's even sufficient to only copy the source *.java files to your project (not recommended).
## Usage - [Regular Bloom Filter](#a1) - [The Filter Builder](#builder) - [Counting Bloom Filter](#a2) - [Redis Bloom Filters](#a3) - [Redis Counting Bloom Filters](#a4) - [Read Slaves](#slaves) - [JSON Representation](#a5) - [Hash Functions](#a6) - [Performance](#a7) - [Overview of Probabilistic Data Structures](#overview) ### Regular Bloom Filter The regular Bloom filter is very easy to use. It is the base class of all other Bloom filters. Figure out how many elements you expect to have in the Bloom filter ( *n* ) and then which false positive rate is tolerable ( *p* ).//Create a Bloom filter that has a false positive rate of 0.1 when containing 1000 elements
BloomFilter<String> bf = new FilterBuilder(1000, 0.1).buildBloomFilter();
The Bloom filter class is generic and will work with any type that implements the toString()
method in a sensible way, since that String is what the Bloom filter feeds into its hash functions. The hashCode()
method is not used, since in Java it returns integers that normally do not satisfy a uniform distribution of outputs that is essential for the optimal performance of the Bloom filter. Now lets add something:
//Add a few elements
bf.add("Just");
bf.add("a");
bf.add("test.");
This can be done from different threads - all Bloom filters are now thread-safe. An element which was inserted in a Bloom filter will always be returned as being contained (no false negatives):
//Test if they are contained
print(bf.contains("Just")); //true
print(bf.contains("a")); //true
print(bf.contains("test.")); //true
Usually non-inserted elements will not be contained:
//Test with a non-existing element
print(bf.contains("WingDangDoodel")); //false
If we add enough elements, false positives will start occurring:
//Add 300 elements
for (int i = 0; i < 300; i++) {
String element = "Element " + i;
bf.add(element);
}
//test for false positives
for (int i = 300; i < 1000; i++) {
String element = "Element " + i;
if(bf.contains(element)) {
print(element); //two elements: 440, 669
}
}
Let's compare this with the expected amount of false positives:
//Compare with the expected amount
print(bf.getFalsePositiveProbability(303) * 700); //1.74
So our two false positives are in line with the expected amount of 1.74.
and lets "estimate" how many elements are in the filter using statistically sound computations of the amount of bits that are one:
//Estimate cardinality/population
print(bf.getEstimatedPopulation()); //303.6759147801151
This estimation is very good, even though the estimation was performed on a "quite full" Bloom filter (remember, we allowed the false positive probability to be 10% for 1000 elements).
The Bloom filter can be cleared and cloned:
//Clone the Bloom filter
bf.clone();
//Reset it, i.e. delete all elements
bf.clear();
Also elements can be added and queried in bulk:
List<String> bulk = Arrays.asList(new String[] { "one", "two", "three" });
bf.addAll(bulk);
print(bf.containsAll(bulk)); //true
To get the best performance for a given use-case the parameters of the bloom filter must be chosen wisely. So for example we could choose the Bloom filter to use 1000 Bits and then use the best number of hash functions for an expected amount of 6666 inserted elements. We choose Murmur as our hash function which is faster than cryptographic hash functions like MD5:
//Create a more customized Bloom filter
BloomFilter<Integer> bf2 = new FilterBuilder()
.expectedElements(6666) //elements
.size(10000) //bits to use
.hashFunction(HashMethod.Murmur3) //our hash
.buildBloomFilter();
print("#Hashes:" + bf2.getHashes()); //2
print(FilterBuilder.optimalK(6666, 10000)); //you can also do these calculations yourself
Bloom filters allow other cool stuff too. Consider for instance that you collected two Bloom filters which are compatible in their parameters. Now you want to consolidate their elements. This is achieved by ORing the respective Bit-Arrays of the Bloom filters:
//Create two Bloom filters with equal parameters
BloomFilter<String> one = new FilterBuilder(100, 0.1).buildBloomFilter();
BloomFilter<String> other = new FilterBuilder(100, 0.1).buildBloomFilter();
one.add("this");
other.add("that");
one.union(other);
print(one.contains("this")); //true
print(one.contains("that")); //true
The good thing about the union()
operation is, that it returns the exact Bloom filter which would have been created, if all elements were inserted in one Bloom filter from the get-go. There is a similar intersect
operation that ANDs the Bit-Arrays. It does however behave slightly different as it does not return the Bloom filter that only contains the intersection. It guarantees to have all elements of the intersection but the false positive rate might be slightly higher than that of the pure intersection:
other.add("this");
other.add("boggles");
one.intersect(other);
print(one.contains("this")); //true
print(one.contains("boggles")); //false
//Create a Counting Bloom filter that has a FP rate of 0.01 when 1000 are inserted
//and uses 4 bits for counting
CountingBloomFilter<String> cbf = new FilterBuilder(1000, 0.01).buildCountingBloomFilter();
cbf.add("http://google.com");
cbf.add("http://twitter.com");
print(cbf.contains("http://google.com")); //true
print(cbf.contains("http://twitter.com")); //true
If you insert one distinct item multiple times, the same counter always get updated so you should pick a higher c so that 2^c > inserted_copies. When 8, 16, 32, 64 bits are specified as the counter size, internally an optimized short-, byte-, int- resp. long-array will be used, whereas other sizes will use a custom bit vector build on the Java BitSet. For optimal performance in terms of time complexity you should therefore prefer 8, 16, 32, 64 counting bits.
The Counting Bloom Filter extends the normal Bloom Filter by remove
and removeAll
methods:
//What only the Counting Bloom filter can do:
cbf.remove("http://google.com");
print(cbf.contains("http://google.com")); //false
To handle overflows (which is unlikely to ever be an issue) you can set an overflow callback:
//Use the Memory Bloom filter explicitly (for the overflow method):
FilterBuilder fb = new FilterBuilder(1000, 0.01).countingBits(4);
CountingBloomFilterMemory<String> filter = new CountingBloomFilterMemory<>(fb);
filter.setOverflowHandler(() -> print("ups"));
for (int i = 1; i < 20; i++) {
print("Round " + i);
filter.add("http://example.com"); //Causes onOverflow() in Round >= 16
}
To understand the inner workings of the Counting Bloom filter lets actually look at the bits of a small filter:
CountingBloomFilter<String> small = new FilterBuilder(3, 0.2)
.countingBits(4)
.buildCountingBloomFilter();
small.add("One"); small.add("Two"); small.add("Three");
print(small.toString());
This gives:
Bloom Filter Parameters: size = 11, hashes = 3, Bits: {0, 2, 6, 8, 10}
1 0001
0 0000
1 0001
0 0000
0 0000
0 0000
1 0001
0 0000
1 0001
0 0000
1 0101
The Counting Bloom filter thus has a bit size of 11, uses 3 hash functions and 4 bits for counting. The first row is the materialized bit array of all counters > 0. Explicitly saving it makes contains
calls fast and generation when transferring the Counting Bloom Filter flattened to a Bloom filter.
These kind of use-cases are ideal for the Redis-backed Bloom filters of this library. They have the same Java Interfaces as the normal and Counting Bloom filter but store the Bloom filter bits in the in-memory key-value store Redis.
Reasons to use these Redis-backed Bloom filters instead of their pure Java brothers are:
- Distributed Access to on Bloom filter
- Persistence Requirements (e.g. saving the Bloom filter to disk once every second)
- Scalability of the Bloom Filter beyond one machine using replication to speed up all read operations
Using the Redis-backed Bloom filter is straightforward:
- Install Redis. This is extremely easy: see Redis Installation.
- Start Redis with
$ redis-server
. The server will listen on port 6379. - In your application (might be on a different machine) instantiate a Redis-backed Bloom filter giving the IP or host name of Redis and its port and the number of concurrent connections to the FilterBuilder using
redisHost
,redisPort
,redisConnections
.
The Redis-backed Bloom filters have the same interface as the normal Bloom filters and can be constructed through the FilterBuilder, too:
String host = "localhost";
int port = 6379;
String filterName = "normalbloomfilter";
//Open a Redis-backed Bloom filter
BloomFilter<String> bfr = new FilterBuilder(1000, 0.01)
.name(filterName) //use a distinct name
.redisBacked(true)
.redisHost(host) //Default is localhost
.redisPort(port) //Default is standard 6379
.buildBloomFilter();
bfr.add("cow");
//Open the same Bloom filter from anywhere else
BloomFilter<String> bfr2 = new FilterBuilder(1000, 0.01)
.name(filterName) //load the same filter
.redisBacked(true)
.buildBloomFilter();
bfr2.add("bison");
print(bfr.contains("cow")); //true
print(bfr.contains("bison")); //true
The Redis-backed Bloom filters are concurrency/thread-safe at the backend as-well-as in Java. That means you can concurrently insert from any machine without running into anomalies, inconsistencies or lost data. The Redis-backed Bloom filters are implemented using efficient Redis bit arrays. They make heavy use of pipelining so that every add
and contains
call only needs one round-trip. This is the most performance critical aspect and usually not found in other implementations which need one round-trip for every Bit or worse. Moreover, Redis connections are pooled so they are reused, while profiting from concurrent use.
The Redis-backed Bloom filters save their metadata (like number and kind of hash functions) in Redis, too. Thus other clients can easily to connect to a Redis instance that already holds a Bloom filter with a given name and specify whether to use or overwrite it.
## Redis Counting Bloom Filters The Redis Counting Bloom filter saves the counters as separate counters in a compact [Redis hash](http://redis .io/commands#hash) and keeps the materialized flat Bloom filter as a bit array. It is compatatible with Redis 2.4 or higher.//Open a Redis-backed Counting Bloom filter
CountingBloomFilter<String> cbfr = new FilterBuilder(10000, 0.01)
.name("myfilter")
.overwriteIfExists(true) //instead of loading it, overwrite it if it's already there
.redisBacked(true)
.buildCountingBloomFilter();
cbfr.add("cow");
//Open a second Redis-backed Bloom filter with a new connection
CountingBloomFilter<String> bfr2 = new FilterBuilder(10000, 0.01)
.name("myfilter") //this time it will be load it
.redisBacked(true)
.buildCountingBloomFilter();
bfr2.add("bison");
bfr2.remove("cow");
print(cbfr.contains("bison")); //true
print(cbfr.contains("cow")); //false
CountingBloomFilter<String> filter = new FilterBuilder(m,k)
.name("slavetest")
.redisBacked(true)
.addReadSlave(host, port +1); //add slave
.addReadSlave(host, port +2); //and another
.buildCountingBloomFilter() //or a normal one
filter.containsAll(items); //directed to the slave
filter.getEstimatedPopulation(); //that one too
filter.getEstimatedCount("abc"); //dito
filter.getBitSet(); //and again
In the following example the Sentinel Nodes are a simple Set of form "host:port" and the sentinelClusterName is the name of Sentinel you want to connect to
return new BloomFilterRedis<>(new FilterBuilder(n, p).hashFunction(hm)
.redisBacked(true)
.name(name)
.pool(RedisPool.sentinelBuilder()
.master(sentinelClusterName)
.sentinels(getSentinelNodes())
.database(database)
.redisConnections(connections)
.build())
.overwriteIfExists(overwrite)
.redisConnections(connections).complete());
Moreover, the Memory Counting Bloom filter can also be serialized and deserialized in the normal Java way.
## Hash Functions There is a detailed description of the available hash functions in the Javadocs of the HashMethod enum. Hash uniformity (i.e. all bits of the Bloom filter being equally likely) is of great importance for the false positive rate. But there is also an inherent trade-off between hash uniformity and speed of computation. For instance cryptographic hash functions have very good distribution properties but are very CPU intensive. Pseudorandom number generators like the [linear congruential generator](http://en.wikipedia.org/wiki/Linear_congruential_generator) are easy to compute but do not have perfectly random outputs but rather certain distribution patterns which for some inputs are notable and for others are negligible. The implementations of all hash functions are part of the BloomFilter class and use tricks like [rejection sampling](https://en.wikipedia.org/wiki/Rejection_sampling) to get the best possible distribution for the respective hash function type.Here is a Box plot overview of how good the different hash functions perform (Intel i7 with 4 cores, 16 GB RAM). The configuration is 100000 hashes using k = 10, m = 1000 averaged over 20 runs.
Speed of computation doesn't tell anything about the quality of hash values. A good hash function is one, which has a discrete uniform distribution of outputs. That means that every bit of the Bloom filter's bit vector is equally likely to bet set. To measure if and how good the hash functions follow a uniform distribution goodness of fit Chi-Square hypothesis tests are the mathematical instrument of choice.
Here are some of the results. The inputs are random strings. The p-value is the probability of getting a statistical result that is at least as extreme as the obtained result. So the usual way of hypothesis testing would be rejecting the null hypothesis ("the hash hash function is uniformly distributed") if the p-value is smaller than 0.05. We did 100 Chi-Square Tests:
If about 5 runs fail the test an 95 pass it, we can be very confident that the hash function is indeed uniformly distributed. For random inputs it is relatively easy though, so we also tested other input distribution, e.g. increasing integers:
Here the LCG is too evenly distributed (due to its modulo arithmetics) which is a good thing here, but shows, that LCGs do not have a random uniform distribution.
The performance optimization of using two hash functions and combining them through the formula hash_i = hash1 + i * hash2 as suggested by Kirsch and Mitzenmacher is theoretically sound as asymptotically hash values are perfectly uniform given to perfect hash values. In practice however, the distribution grows uneven for some inputs (the Cassandra team which uses this trick should have a look at that).
Now a real example of inserting random words in the Bloom filter with the resulting false positive rate after 30000 inserted elements demanding a false positive probability of 0.01:
Hash function | Speed (ms) | f during insert (%) | f final (%) |
---|---|---|---|
RNG | 80.529251 | 0,138 | 1,024 |
CarterWegman | 1007.66743 | 0,199 | 0,956 |
SecureRNG | 309.619582 | 0,185 | 0,902 |
CRC32 | 67.484589 | 0,121 | 1,007 |
Adler32 | 83.327968 | 10,074 | 22,539 |
Murmur2 | 100.20518 | 0,175 | 1,047 |
Murmur3 | 100.243902 | 0,189 | 0,993 |
Murmur3KirschMitzenmacher | 45.999454 | 0,162 | 0,852 |
FNVWithLCG | 42.25881 | 0,145 | 0,960 |
MD2 | 3635.465565 | 0,138 | 0,869 |
MD5 | 207.823532 | 0,148 | 0,936 |
SHA1 | 213.755936 | 0,189 | 0,923 |
SHA256 | 223.060422 | 0,165 | 1,054 |
SHA384 | 176.003345 | 0,165 | 0,832 |
SHA512 | 172.444648 | 0,152 | 1,064 |
In summary, cryptographic hash functions offer the most consistent uniform distribution, but are slightly more expensive to compute. LCGs, for instance Java Random, perform quite well in most cases and are cheap to compute. The best compromise seems to be the Murmur 3 hash function, which has a good distribution and is quite fast to compute.
It's also possible to provide a custom hash function:
BloomFilter<String> bf = new FilterBuilder(1000, 0.01)
.hashFunction((value, m, k) -> null)
.buildBloomFilter();
//Test the performance of the in-memory Bloom filter
BloomFilter<String> bf = new FilterBuilder(100_000, 0.01).hashFunction(HashMethod.Murmur3).buildBloomFilter();
MemoryBFTest.benchmark(bf, "Normal Bloom Filter", 1_000_000);
This gives over 2 Mio operations per second (on my laptop):
Normal Bloom Filter
hashes = 7 falsePositiveProbability = 3.529780138533512E-281 expectedElements = 1000000 size = 958506
add(): 0.687s, 1455604.0757 elements/s
addAll(): 0.47s, 2127659.5745 elements/s
contains(), existing: 0.472s, 2118644.0678 elements/s
contains(), nonexisting: 0.445s, 2247191.0112 elements/s
100000 hash() calls: 0.008s, 1.25E7 elements/s
Hash Quality (Chi-Squared-Test): p-value = 0.8041807628127277 , Chi-Squared-Statistic = 957318.7388845441
The Redis-backed and Counting Bloom filters can be tested similarly.
## Overview of Probabilistic Data StructuresData Structure | Set membership: „Have I seen this item before?“ | Frequency estimation: “How many of this kind have I seen?” | Cardinality estimation: “How many distinct items have I seen in total” | Item removal | Persistence and distributed access |
---|---|---|---|---|---|
Memory Bloom Filter | Yes, with configurable false positive probability, O(1) | No | Yes, O(#bits) | No | No |
Memory Counting Bloom Filter | Yes, with configurable false positive probability, O(1) | Yes (Minimum Selection Algorithm), O(1) | Yes, O(#bits) | Yes, O(1) | No |
Redis Bloom Filter | Yes, with configurable false positive probability, O(1), single roundtrip, scalable through read slaves | No | Yes, O(#bits), single roundtrip, scalable through read slaves | No | Yes, configurable Redis persistence & replication |
Redis Counting Bloom Filter | Yes, with configurable false positive probability, O(1) , single roundtrip, scalable through read slaves | Yes (Minimum Selection Algorithm), O(1) , single roundtrip, scalable through read slaves | Yes, O(#bits), single roundtrip, scalable through read slaves | Yes, O(1), in average 2 roundtrips | Yes, configurable Redis persistence & replication |
Other sketches (not part of this lib) | Hashsets, Bitvectors, Cuckoo Filter | Count-Min-Sketch, Count-Mean-Sketch | K-Minimum-Values, HyperLogLog |
Up next
- Compatible Javascript implementation which can consume the JSON Bloom filter representation
- Redis Bloom filters that leverage Redis Cluster (once it's ready)
License
This Bloom filter library is published under the very permissive MIT license, see LICENSE file.