Bloom filters are some of the simplest and yet the coolest of all probabilistic datastructures. Basically, are a set like datastructure, which you train on your dataset and then later use it to quickly check if it was trained on a certain value. The cool part is, unlike hashtables or trees, if you train a bloom filter on N elements (say 10,000) it doesn’t actually store any of those elements, so its like a compressed representation of your dataset, which can be used to do quick hit testing. But this comes at a price. Bloom filters might lie at times . But its still useful, nonetheless. Every time we query a bloom filter, its like asking it a question “Were you trained with the value X ?”, to which it replies with a yes or a no (a boolean result), so, there are 3 possibilites.
- It tells the truth, which is the most likely case and all is fine.
- It might say it was trained on X, while it really wasn’t. (A false positive)
- It might say it wasn’t trained on X, while it actually was trained on it. (A false negative)
Ok, so the property which makes bloom filters really useful is that it never gives false negatives, so #3 never happens. So, it mostly tells the truth, but in the unlikely event that it lies, it always errs on the conservative side. This, combined with the property that they don’t have to store the actual elements they were trained on, make bloom filters incredibly useful in a lot of of circumstances.
One place I used them in Tagz is to check if a user has voted on a particular link. If the filter replies afirmatively then we can go query the backend (DB, Memcached, whatever) to get the actual score for the vote. Assuming that the bloom filter lies 5% of the times, we still eliminate 95% of the backend queries in this case. And the best part is that the rate at which it lies can be tuned. Its really a compromise between the size of the filter and the rate at which it lies. Simply put, the bigger a bloom filter, the lesser it lies. So, the constructor for the SimpleBloomFilter class in my implementation takes 2 parameters, capacity and err.
Another application for bloom filters is to synchronize stuff. Lets say, you have to update a lot of items with a certain keys between two machines A & B over a slow connection, where each of them have quite a few items between them. So, A, instead of sending all the keys it has to B, can instead build a compact bloom filter out the keys it has and send it to B, B can test its item set against the bloom filter and send only the items it doesn’t find in it. The elaborate setup I just described is called a Bloom Join.
One more thing, this is a variant of a bloom filter called a counting bloom filter. Plain old bloom filters are add only structures. You can train it on an element, but you can’t untrain it on something which its already been trained on. Counting bloom filters gain the ability to be untrained on elements by consuming a little more space. Also, I noticed that the data in the filter tends to be pretty sparse, so I do some simple RLE compression on it before serializing it to JSON which makes it a lot more compact.
PS: Whats next ? BK-Trees anyone ?