Code Salty: Bloom Filter

Tuesday, 6 November 2018

Bloom Filter

A bloom filter is a rapid and memory efficient data structure, used to test and tell whether an element is present in a set or not. However, the efficiency comes for a price i.e. it is a probabilistic data structure. It can help us identify that either element is not in the set or may be present in the set.

The base data structure of a Bloom Filter is a Bit Vector. A Bit Vector looks like this,

It has empty cells in the table, each representing a bit, and the number below it is its index like indices in array. To add an element to the Bloom Filter, we hash the element a few times, which gives us hashes. Then we set the bits in the Bit Vector to 1, which correspond to the hashes.

Illustrating an example, we will use Fnv and Murmur hash functions here.

When we hash a string “hello”, fnv results 6 and murmur results 0. So, the Bloom Filter looks like the one below.In reality these cells are not colored but the bits are set to 1.

Now to test whether a string is already in the Bloom Filter, we simply hash the string with the same hash functions. Then we check if those values are set in the bit vector. If they aren't, we know that the element isn't in the set. If they are, we only know that it might be, because another element or some combination of other elements could have set the same bits.

Choosing Hash functions for Bloom Filters

The hash functions used in a Bloom filter should be independent, fast and uniformly distributed. Cryptographic hashes such as sha1, though widely used therefore are not very good choices. Examples of fast, simple hashes that are independent enough3 include murmur, the fnv series of hashes, and HashMix.

How big should I make my Bloom filter?
The false positive of a Bloom filter can be modified. A larger filter will have less false positives, and a smaller one will have more.

How many hash functions should I use?
The more hash functions we use, the slower is the bloom filter, and the quicker it fills up. On the other hand, if we have too few, we may suffer too many false positives. So, we need to maintain the right balance.

Merkle Tree

Merkle Trees

Merkle trees are data structure trees where each non-leaf node is a hash of its respective child node. The leaf nodes are the lowest tier of nodes in the tree. It was named after Ralph Merkle who patented this concept in 1979.

Merkle Trees help in efficient and secure verification of large data structures. These are a fundamental component of blockchains because in the case of blockchains, potentially boundless data sets are used. The implementation of Merkle trees in blockchains has multiple effects. It allows them to scale and provide the hash-based architecture for them to maintain data integrity. Cryptographic hash functions are the underlying technology that allows for Merkle trees to work.

It is easier to understand the concept using the above pictorial view. Notice the non-leaf nodes or branches are represented by hashes of their respective children. The pictorial example is the most common and simple form of a Merkle Tree known as a Binary Merkle Tree. The top hash of the tree is known as the root hash. So, we can also understand Merkle Tree as a data structure which can take ‘n’ number of elements and represent it with a single hash.

How is this convenient for data verification?
The tree structure allows efficient mapping of an arbitrarily large amount of data. And helps in easy identification of where changes in data occur. This is also the underlying basis of Merkle proofs. It can be verified that the data chunk is consistent with the root hash by checking only a small subset of the hashes rather than the entire data set.

Source: https://blockonomi.com/merkle-tree/

Code Salty

Pages

Tuesday, 6 November 2018

Bloom Filter

How many hash functions should I use?
The more hash functions we use, the slower is the bloom filter, and the quicker it fills up. On the other hand, if we have too few, we may suffer too many false positives. So, we need to maintain the right balance.

Merkle Tree

About Me

Pages

Tuesday, 6 November 2018

Bloom Filter

How many hash functions should I use? The more hash functions we use, the slower is the bloom filter, and the quicker it fills up. On the other hand, if we have too few, we may suffer too many false positives. So, we need to maintain the right balance.

Merkle Tree

How many hash functions should I use?
The more hash functions we use, the slower is the bloom filter, and the quicker it fills up. On the other hand, if we have too few, we may suffer too many false positives. So, we need to maintain the right balance.