Tuesday 6 November 2018

Bloom Filter

A bloom filter is a rapid and memory efficient data structure, used to test and tell whether an element is present in a set or not. However, the efficiency comes for a price i.e. it is a probabilistic data structure. It can help us identify that either element is not in the set or may be present in the set.
The base data structure of a Bloom Filter is a Bit Vector. A Bit Vector looks like this,



It has empty cells in the table, each representing a bit, and the number below it is its index like indices in array. To add an element to the Bloom Filter, we hash the element a few times, which gives us hashes. Then we set the bits in the Bit Vector to 1, which correspond to the hashes.
Illustrating an example, we will use Fnv and Murmur hash functions here.
When we hash a string “hello”, fnv results 6 and murmur results 0. So, the Bloom Filter looks like the one below.In reality these cells are not colored but the bits are set to 1.


Now to test whether a string is already in the Bloom Filter, we simply hash the string with the same hash functions. Then we check if those values are set in the bit vector. If they aren't, we know that the element isn't in the set. If they are, we only know that it might be, because another element or some combination of other elements could have set the same bits.
Choosing Hash functions for Bloom Filters
The hash functions used in a Bloom filter should be independent, fast and uniformly distributed. Cryptographic hashes such as sha1, though widely used therefore are not very good choices.  Examples of fast, simple hashes that are independent enough3 include murmur, the fnv series of hashes, and HashMix.
How big should I make my Bloom filter?
The false positive of a Bloom filter can be modified. A larger filter will have less false positives, and a smaller one will have more.

How many hash functions should I use?
The more hash functions we use, the slower is the bloom filter, and the quicker it fills up. On the other hand, if we have too few, we may suffer too many false positives. So, we need to maintain the right balance.

No comments:

Post a Comment