A bloom filter is a rapid and memory efficient data structure, used to test and tell whether an element is present in a set or not. However, the efficiency comes for a price i.e. it is a probabilistic data structure. It can help us identify that either element is not in the set or may be present in the set.
Illustrating an example, we will use Fnv and Murmur hash functions here.
When we hash a string “hello”, fnv results 6 and murmur results 0. So, the Bloom Filter looks like the one below.In reality these cells are not colored but the bits are set to 1.
Now to test whether a string is already in the Bloom Filter, we simply hash the string with the same hash functions. Then we check if those values are set in the bit vector. If they aren't, we know that the element isn't in the set. If they are, we only know that it might be, because another element or some combination of other elements could have set the same bits.
Choosing Hash functions for Bloom Filters
The hash functions used in a Bloom filter should be independent, fast and uniformly distributed. Cryptographic hashes such as sha1, though widely used therefore are not very good choices. Examples of fast, simple hashes that are independent enough3 include murmur, the fnv series of hashes, and HashMix.
How big should I make my Bloom filter?
The false positive of a Bloom filter can be modified. A larger filter will have less false positives, and a smaller one will have more.
The false positive of a Bloom filter can be modified. A larger filter will have less false positives, and a smaller one will have more.