Posted on April 21, 2009 at 7:13 pm

Counting bloom filters in python and javascript

Continuing the theme of implementing simple datastructures in python and javascript, here’s a simple counting bloom filter implementation in python and javascript which I’d written for Tagz. I’d almost forgotten about this, until a thread today on compsci.reddit reminded me of it. With this implementation, you can build a bloom filter in python and add/remove/lookup elements. You can also serialize it to JSON, send it over the wire to a javascript client and use the same filter from javascript code, which can come in handy at times.

Bloom filters are some of the simplest and yet the coolest of all probabilistic datastructures. Basically, are a set like datastructure, which you train on your dataset and then later use it to quickly check if it was trained on a certain value. The cool part is, unlike hashtables or trees, if you train a bloom filter on N elements (say 10,000) it doesn’t actually store any of those elements, so its like a compressed representation of your dataset, which can be used to do quick hit testing. But this comes at a price. Bloom filters might lie at times :) . But its still useful, nonetheless. Every time we query a bloom filter, its like asking it a question “Were you trained with the value X ?”, to which it replies with a yes or a no (a boolean result), so, there are 3 possibilites.

  1. It tells the truth, which is the most likely case and all is fine.
  2. It might say it was trained on X, while it really wasn’t. (A false positive)
  3. It might say it wasn’t trained on X, while it actually was trained on it. (A false negative)

Ok, so the property which makes bloom filters really useful is that it never gives false negatives, so #3 never happens. So, it mostly tells the truth, but in the unlikely event that it lies, it always errs on the conservative side. This, combined with the property that they don’t have to store the actual elements they were trained on, make bloom filters incredibly useful in a lot of of circumstances.

One place I used them in Tagz is to check if a user has voted on a particular link. If the filter replies afirmatively then we can go query the backend (DB, Memcached, whatever) to get the actual score for the vote. Assuming that the bloom filter lies 5% of the times, we still eliminate 95% of the backend queries in this case. And the best part is that the rate at which it lies can be tuned. Its really a compromise between the size of the filter and the rate at which it lies. Simply put, the bigger a bloom filter, the lesser it lies. So, the constructor for the SimpleBloomFilter class in my implementation takes 2 parameters, capacity and err.

Another application for bloom filters is to synchronize stuff. Lets say, you have to update a lot of items with a certain keys between two machines A & B over a slow connection, where each of them have quite a few items between them. So, A, instead of sending all the keys it has to B, can instead build a compact bloom filter out the keys it has and send it to B, B can test its item set against the bloom filter and send only the items it doesn’t find in it. The elaborate setup I just described is called a Bloom Join.

One more thing, this is a variant of a bloom filter called a counting bloom filter. Plain old bloom filters are add only structures. You can train it on an element, but you can’t untrain it on something which its already been trained on. Counting bloom filters gain the ability to be untrained on elements by consuming a little more space. Also, I noticed that the data in the filter tends to be pretty sparse, so I do some simple RLE compression on it before serializing it to JSON which makes it a lot more compact.

That concludes my watered down explanation of a counting bloom filter. The Wikipedia link i’d mentioned in the beginning has a much better explanation of bloom filters. As usual, the code can be found on bitbucket. All the code is MIT Licensed except for the 3rd party sha256.js javascript module from Christoph Bichlmeier.

PS: Whats next ? BK-Trees anyone ?

Tags:, ,

5 Responses to “Counting bloom filters in python and javascript”

  1. Pat on April 21st, 2009 at 8:06 PM says:

    Cool stuff, I had thought of trying to implement the very same thing; python for the server side and js for the client. My goal was (is) to check for membership of a headword in a large dictionary, so the idea was to check it in the filter with JS and only then to look it up in SQL on the server side.

    Question: does the bloom.py handle Unicode okay? I tried to feed it some non-ASCII utf-8 text and it threw an error. Going to dig around in the code but I was just wondering if that was a design criterion off the top or not.

    Anyway, thanks for sharing.

  2. Jeethu on April 23rd, 2009 at 5:09 AM says:

    @Pat The exception is due to the hashing module. If you’re adding unicode strings, you should do something like “filter.add(s.encode(‘utf-8′))”.

  3. Browser toolbars and privacy | Jeethu's Blog on June 22nd, 2009 at 1:10 PM says:

    […] for a while, thought the toobars should hash the urls before sending it to the servers, and using bloom filters to reduce the number of time the client would have to call home to check if a user has rated a url. […]

  4. Jason Madsen on March 7th, 2010 at 4:33 AM says:

    This is great! Thanks for your article. I am new at development and this was a big help.

  5. Thanh Do on May 8th, 2011 at 3:16 AM says:

    Thanks for the article. I just have a quick question. We need to initialize the capacity of the bloom filter before hand right? Is there anyway to implement a scalable counting bloom filter?

Leave a Reply