Apr 21

Continuing the theme of implementing simple datastructures in python and javascript, here’s a simple counting bloom filter implementation in python and javascript which I’d written for Tagz. I’d almost forgotten about this, until a thread today on compsci.reddit reminded me of it. With this implementation, you can build a bloom filter in python and add/remove/lookup elements. You can also serialize it to JSON, send it over the wire to a javascript client and use the same filter from javascript code, which can come in handy at times.

Bloom filters are some of the simplest and yet the coolest of all probabilistic datastructures. Basically, are a set like datastructure, which you train on your dataset and then later use it to quickly check if it was trained on a certain value. The cool part is, unlike hashtables or trees, if you train a bloom filter on N elements (say 10,000) it doesn’t actually store any of those elements, so its like a compressed representation of your dataset, which can be used to do quick hit testing. But this comes at a price. Bloom filters might lie at times :) . But its still useful, nonetheless. Every time we query a bloom filter, its like asking it a question “Were you trained with the value X ?”, to which it replies with a yes or a no (a boolean result), so, there are 3 possibilites.

  1. It tells the truth, which is the most likely case and all is fine.
  2. It might say it was trained on X, while it really wasn’t. (A false positive)
  3. It might say it wasn’t trained on X, while it actually was trained on it. (A false negative)

Ok, so the property which makes bloom filters really useful is that it never gives false negatives, so #3 never happens. So, it mostly tells the truth, but in the unlikely event that it lies, it always errs on the conservative side. This, combined with the property that they don’t have to store the actual elements they were trained on, make bloom filters incredibly useful in a lot of of circumstances.

One place I used them in Tagz is to check if a user has voted on a particular link. If the filter replies afirmatively then we can go query the backend (DB, Memcached, whatever) to get the actual score for the vote. Assuming that the bloom filter lies 5% of the times, we still eliminate 95% of the backend queries in this case. And the best part is that the rate at which it lies can be tuned. Its really a compromise between the size of the filter and the rate at which it lies. Simply put, the bigger a bloom filter, the lesser it lies. So, the constructor for the SimpleBloomFilter class in my implementation takes 2 parameters, capacity and err.

Another application for bloom filters is to synchronize stuff. Lets say, you have to update a lot of items with a certain keys between two machines A & B over a slow connection, where each of them have quite a few items between them. So, A, instead of sending all the keys it has to B, can instead build a compact bloom filter out the keys it has and send it to B, B can test its item set against the bloom filter and send only the items it doesn’t find in it. The elaborate setup I just described is called a Bloom Join.

One more thing, this is a variant of a bloom filter called a counting bloom filter. Plain old bloom filters are add only structures. You can train it on an element, but you can’t untrain it on something which its already been trained on. Counting bloom filters gain the ability to be untrained on elements by consuming a little more space. Also, I noticed that the data in the filter tends to be pretty sparse, so I do some simple RLE compression on it before serializing it to JSON which makes it a lot more compact.

That concludes my watered down explanation of a counting bloom filter. The Wikipedia link i’d mentioned in the beginning has a much better explanation of bloom filters. As usual, the code can be found on bitbucket. All the code is MIT Licensed except for the 3rd party sha256.js javascript module from Christoph Bichlmeier.

PS: Whats next ? BK-Trees anyone ?

Apr 13

Before moving to YUI about a year ago, I was using Mochikit as my primary JS library. As advertised, Mochikit happens to be one of the most pythonic javascript libraries ever. One of the sweetest parts of Mochikit IMO has been Mochikit.DOM. This is something which I’ve always missed with YUI. innerHTML is fast, but icky and it feels a little inelegant. So, I ended up writing something like Mochikit.DOM for YUI while writing Tagz. Thought it might be useful to others as well. So, here’s the mercurial repo with the code.

The utils.js file contains some utility functions like forEach, map, filter, partial. The only function from utils.js used in dombuilder.js is partial, so you might want to add it to dombuilder.js to remove the dependence on utils.js.

Here’s an obligatory trivial example (included in the repo).

Apr 11

There are a couple of places in which Tagz, where I needed efficient prefix matching. The most obvious way to do this is to use a Trie or a Ternary Search Tree. So, I ended up implementing both in Python. I’ve had this stuff lying around in my mercurial repo for Tagz for quite some time now. I just thought of releasing it today.

Interestingly, some crude benchmarks indicate that the trie implementation is more efficient than the ternary search tree implementation in terms of both speed and space (Look for test.py in the repo), I also ported the trie module to javascript, and its included in the repo.

Here are two GraphViz graphs visualizing both the structures, generated using the excellent GvGen library.

In the Ternary Search Tree graph, Blue vertices = left, Green vertices = middle, Red vertices = right.

The code can be found here.

Apr 06

I’ve been using memcached for all the caching on Tagz. Redis is a relatively new key value database which covers a superset of memcached’s functionality. One of the biggest problems I’ve had with memcached (actually it has nothing to do with memcached) is that whenever I store a large datastructure on memcached, deserializing (unpickling) it takes quite a while (only a couple of milliseconds, but it still counts).

This happens whenever I end up storing large lists or dictionaries on memcached. Redis solves this problem effectively by providing list and set primitives besides storing plain old strings. This effectively solves the  aforementioned problem. Also, the list primitive supports atomic push/pop commands, which could be used to implement efficient queues. And this along with the on disk persistance feature solves another problem I have with beanstalkd, which is the lack of persistance.

Overall, this seems like a great solution to quite a few tiny little problems I have with performance on Tagz. I’m planning on playing with redis a little more tonight and if all goes well, I’ll shift from memcached to redis.

Apr 01

Tagz has come a long way since I launched it last September. Something which began as a clean room django application has been accumulating a lot of cruft. One patch at a time, its turned itself into an unmaintainable mess of a codebase.

In retrospect, I feel Python and Postgres weren’t really the best choices I made for writing Tagz. I believe Tagz would be better written in PHP with MySQL as the DB. I’ve come to learn the hard way that Django with Postgresql can’t quite match the blazing speeds possible using raw PHP with MySQL (and MyISAM DBs).

Starting today, I’ve decided on stopping all development on the current code base of Tagz. I’ve begun a rewrite of Tagz in PHP. The current users may rest assured, since backwards compatibility is an important goal for this rewrite. I’m hoping to finish the rewrite in less than a month. I’m expecting the transition to be a smooth one.

Finally, Thanks to all the current users (no thanks to all the spammers) for all the encouragement and the feature requests, without which Tagz would’ve never have come close to what it is today.