Apr 21

Continuing the theme of implementing simple datastructures in python and javascript, here’s a simple counting bloom filter implementation in python and javascript which I’d written for Tagz. I’d almost forgotten about this, until a thread today on compsci.reddit reminded me of it. With this implementation, you can build a bloom filter in python and add/remove/lookup elements. You can also serialize it to JSON, send it over the wire to a javascript client and use the same filter from javascript code, which can come in handy at times.

Bloom filters are some of the simplest and yet the coolest of all probabilistic datastructures. Basically, are a set like datastructure, which you train on your dataset and then later use it to quickly check if it was trained on a certain value. The cool part is, unlike hashtables or trees, if you train a bloom filter on N elements (say 10,000) it doesn’t actually store any of those elements, so its like a compressed representation of your dataset, which can be used to do quick hit testing. But this comes at a price. Bloom filters might lie at times :) . But its still useful, nonetheless. Every time we query a bloom filter, its like asking it a question “Were you trained with the value X ?”, to which it replies with a yes or a no (a boolean result), so, there are 3 possibilites.

  1. It tells the truth, which is the most likely case and all is fine.
  2. It might say it was trained on X, while it really wasn’t. (A false positive)
  3. It might say it wasn’t trained on X, while it actually was trained on it. (A false negative)

Ok, so the property which makes bloom filters really useful is that it never gives false negatives, so #3 never happens. So, it mostly tells the truth, but in the unlikely event that it lies, it always errs on the conservative side. This, combined with the property that they don’t have to store the actual elements they were trained on, make bloom filters incredibly useful in a lot of of circumstances.

One place I used them in Tagz is to check if a user has voted on a particular link. If the filter replies afirmatively then we can go query the backend (DB, Memcached, whatever) to get the actual score for the vote. Assuming that the bloom filter lies 5% of the times, we still eliminate 95% of the backend queries in this case. And the best part is that the rate at which it lies can be tuned. Its really a compromise between the size of the filter and the rate at which it lies. Simply put, the bigger a bloom filter, the lesser it lies. So, the constructor for the SimpleBloomFilter class in my implementation takes 2 parameters, capacity and err.

Another application for bloom filters is to synchronize stuff. Lets say, you have to update a lot of items with a certain keys between two machines A & B over a slow connection, where each of them have quite a few items between them. So, A, instead of sending all the keys it has to B, can instead build a compact bloom filter out the keys it has and send it to B, B can test its item set against the bloom filter and send only the items it doesn’t find in it. The elaborate setup I just described is called a Bloom Join.

One more thing, this is a variant of a bloom filter called a counting bloom filter. Plain old bloom filters are add only structures. You can train it on an element, but you can’t untrain it on something which its already been trained on. Counting bloom filters gain the ability to be untrained on elements by consuming a little more space. Also, I noticed that the data in the filter tends to be pretty sparse, so I do some simple RLE compression on it before serializing it to JSON which makes it a lot more compact.

That concludes my watered down explanation of a counting bloom filter. The Wikipedia link i’d mentioned in the beginning has a much better explanation of bloom filters. As usual, the code can be found on bitbucket. All the code is MIT Licensed except for the 3rd party sha256.js javascript module from Christoph Bichlmeier.

PS: Whats next ? BK-Trees anyone ?

Apr 11

There are a couple of places in which Tagz, where I needed efficient prefix matching. The most obvious way to do this is to use a Trie or a Ternary Search Tree. So, I ended up implementing both in Python. I’ve had this stuff lying around in my mercurial repo for Tagz for quite some time now. I just thought of releasing it today.

Interestingly, some crude benchmarks indicate that the trie implementation is more efficient than the ternary search tree implementation in terms of both speed and space (Look for test.py in the repo), I also ported the trie module to javascript, and its included in the repo.

Here are two GraphViz graphs visualizing both the structures, generated using the excellent GvGen library.

In the Ternary Search Tree graph, Blue vertices = left, Green vertices = middle, Red vertices = right.

The code can be found here.