Bloom filters

Summary

Bloom filters are space-efficient data structures used to determine if a given element is not present in a set. A bloom filter could also be called a sometimes-wrong-much-smaller-and-faster hash table.

So it's a hash table?

In a hash table, the original value is stored for exact match comparison while performing a lookup on the hashed version of the value. In a bloom filter however, we do not care about the original value. We are only interested in the existence of the hash and therefore only store boolean values to denote the hash exists. Storing boolean-only values results in an incredibly smaller data set and quicker executions.

The right tool for the job

The caveat with bloom filters, if it hasn’t come to mind yet, is collisions. In a hash table we hash the value to find the position(s) in the data set. When we’ve found one or more matching hashes we then compare the original stored value to the data we are searching for resulting in a 100% true or false match.

In a bloom filter we cannot perform a comparison on the original value as we have only stored true or false, but that doesn’t really matter. A bloom filter is a specialized data structure to be used when we need lookups to be “good enough for who it’s for”.

When would I use this weak ass hash table?

Spell checking
A spell checker is the go to example for bloom filters. If the purpose of our spell checker is to only tell the user that a word is misspelled (squiggly red line) we do not need to make an exact match in a data set, we only need to know that the word exists somewhere in the data set and because it exists in the set, it is misspelled.

Recently viewed/used data
Let’s say we’ve written a messaging application that shows you a list of recent people you’ve chatted with. Every time a conversation happens the server will update our list of recent contacts making a round trip for each conversation. We can prevent this request by iterating over every person in our list comparing values with a loop, or we could use a bloom filter to quickly match the key only.

Real world applications

  • Google Chrome & Bit.ly malicious URL blocking
  • Yahoo's email contact list
  • Bitcoin payment verification
  • IoT networking

Mitigating false positives

To reduce the amount of false positives for our BF we have a handy algorithm that will calculate a decent sized table to help us get into the range of false positives we're comfortable with. The larger our table the lower number of false positives we will receive.

private static int TableSize(int n, double errorRate)
{
    var m = Math.Ceiling(Math.Abs(n * Math.Log(errorRate)) / Math.Pow(Math.Log(2), 2));
    return (int)m;
}

Indexing

Once our table has been created well have a range of elements set to false. To populate our table with data we need to hash our data and create an index value to flip bits in our array.

In the github example I'm using crypto algorithms; they're slow, never use them in the real world. In the real world we use something like a Jenkin's hash 1 which is light and quick.

I run the value through each hasher with the handy LINQ Aggregate method and then convert the value to a useable integer index value.

public int Index(string value, int m)
{
    var hashValue = _hash.Aggregate(Encoding.Unicode.GetBytes(value), (current, hashAlgorithm) => hashAlgorithm.ComputeHash(current));

    var index = Math.Abs(BitConverter.ToInt32(hashValue, 0) % m);
    return index;
}

With each index we can flip the bit in corresponding position in our array

public void Add(string value)
{
    var index = _indexer.Index(value, _m);
    _table[index] = true;
}

Searching

With our data index we can now perform our usually-correct search by hashing and indexing the provided search criteria and returning the assumed index position's value

public bool Contains(string value)
{
    var index = _indexer.Index(value, _m);
    return _table[index];
}

Conclusion

The use case for bloom filters is basically anywhere you would normally iterate and compare but are 100% ok with false positives and want a very efficient way to determine if something may exist or not.