Apr 21

Continuing the theme of implementing simple datastructures in python and javascript, here’s a simple counting bloom filter implementation in python and javascript which I’d written for Tagz. I’d almost forgotten about this, until a thread today on compsci.reddit reminded me of it. With this implementation, you can build a bloom filter in python and add/remove/lookup elements. You can also serialize it to JSON, send it over the wire to a javascript client and use the same filter from javascript code, which can come in handy at times.

Bloom filters are some of the simplest and yet the coolest of all probabilistic datastructures. Basically, are a set like datastructure, which you train on your dataset and then later use it to quickly check if it was trained on a certain value. The cool part is, unlike hashtables or trees, if you train a bloom filter on N elements (say 10,000) it doesn’t actually store any of those elements, so its like a compressed representation of your dataset, which can be used to do quick hit testing. But this comes at a price. Bloom filters might lie at times :) . But its still useful, nonetheless. Every time we query a bloom filter, its like asking it a question “Were you trained with the value X ?”, to which it replies with a yes or a no (a boolean result), so, there are 3 possibilites.

  1. It tells the truth, which is the most likely case and all is fine.
  2. It might say it was trained on X, while it really wasn’t. (A false positive)
  3. It might say it wasn’t trained on X, while it actually was trained on it. (A false negative)

Ok, so the property which makes bloom filters really useful is that it never gives false negatives, so #3 never happens. So, it mostly tells the truth, but in the unlikely event that it lies, it always errs on the conservative side. This, combined with the property that they don’t have to store the actual elements they were trained on, make bloom filters incredibly useful in a lot of of circumstances.

One place I used them in Tagz is to check if a user has voted on a particular link. If the filter replies afirmatively then we can go query the backend (DB, Memcached, whatever) to get the actual score for the vote. Assuming that the bloom filter lies 5% of the times, we still eliminate 95% of the backend queries in this case. And the best part is that the rate at which it lies can be tuned. Its really a compromise between the size of the filter and the rate at which it lies. Simply put, the bigger a bloom filter, the lesser it lies. So, the constructor for the SimpleBloomFilter class in my implementation takes 2 parameters, capacity and err.

Another application for bloom filters is to synchronize stuff. Lets say, you have to update a lot of items with a certain keys between two machines A & B over a slow connection, where each of them have quite a few items between them. So, A, instead of sending all the keys it has to B, can instead build a compact bloom filter out the keys it has and send it to B, B can test its item set against the bloom filter and send only the items it doesn’t find in it. The elaborate setup I just described is called a Bloom Join.

One more thing, this is a variant of a bloom filter called a counting bloom filter. Plain old bloom filters are add only structures. You can train it on an element, but you can’t untrain it on something which its already been trained on. Counting bloom filters gain the ability to be untrained on elements by consuming a little more space. Also, I noticed that the data in the filter tends to be pretty sparse, so I do some simple RLE compression on it before serializing it to JSON which makes it a lot more compact.

That concludes my watered down explanation of a counting bloom filter. The Wikipedia link i’d mentioned in the beginning has a much better explanation of bloom filters. As usual, the code can be found on bitbucket. All the code is MIT Licensed except for the 3rd party sha256.js javascript module from Christoph Bichlmeier.

PS: Whats next ? BK-Trees anyone ?

Apr 11

There are a couple of places in which Tagz, where I needed efficient prefix matching. The most obvious way to do this is to use a Trie or a Ternary Search Tree. So, I ended up implementing both in Python. I’ve had this stuff lying around in my mercurial repo for Tagz for quite some time now. I just thought of releasing it today.

Interestingly, some crude benchmarks indicate that the trie implementation is more efficient than the ternary search tree implementation in terms of both speed and space (Look for test.py in the repo), I also ported the trie module to javascript, and its included in the repo.

Here are two GraphViz graphs visualizing both the structures, generated using the excellent GvGen library.

In the Ternary Search Tree graph, Blue vertices = left, Green vertices = middle, Red vertices = right.

The code can be found here.

Apr 06

I’ve been using memcached for all the caching on Tagz. Redis is a relatively new key value database which covers a superset of memcached’s functionality. One of the biggest problems I’ve had with memcached (actually it has nothing to do with memcached) is that whenever I store a large datastructure on memcached, deserializing (unpickling) it takes quite a while (only a couple of milliseconds, but it still counts).

This happens whenever I end up storing large lists or dictionaries on memcached. Redis solves this problem effectively by providing list and set primitives besides storing plain old strings. This effectively solves the  aforementioned problem. Also, the list primitive supports atomic push/pop commands, which could be used to implement efficient queues. And this along with the on disk persistance feature solves another problem I have with beanstalkd, which is the lack of persistance.

Overall, this seems like a great solution to quite a few tiny little problems I have with performance on Tagz. I’m planning on playing with redis a little more tonight and if all goes well, I’ll shift from memcached to redis.

Dec 12

I use base36 for all most object ids in tagz. I previously used to manually convert all base36 values into integers before doing the lookups. Yesterday it dawned to me that I could subclass QuerySets and remove a lot of boilerplate code. The idea here is that I can do something like

Model.objects.get(pk_base36='2kk')

instead of something like

Model.objects.get(pk=_decode('2kk'))

The Queryset wrapper would internally decode the base36 string into an integer and then lookup by that id. Code follows.

import string

from django.db import models

DIGITS = string.digits + string.ascii_lowercase
BASE = len(DIGITS)

def _encode(n) :
    l = []
    base = len(DIGITS)
    while True :
        n, rem = divmod(n, BASE)
        l.append(DIGITS[rem])
        if n < 1 :
            break
    l.reverse()
    return ''.join(l)

def _decode(s) :
    num = 0
    base = len(DIGITS)
    for c in s.lower() :
        pos = DIGITS.index(c)
        num = (num * base) + pos

class Base36KeyedQuerySet(QuerySet) :
    def _handle_pk_base36(self, kwargs) :
        if 'pk_base36' in kwargs :
            pk_base36 = kwargs['pk_base36']
            assert ('id' not in kwargs and 'pk' not in kwargs)
            kwargs['pk'] = _decode(pk_base36)
            del kwargs['pk_base36']
        return kwargs

    def get(self, *args, **kwargs) :
        kwargs = self._handle_pk_base36(kwargs)
        return super(self.__class__,self).get(*args, **kwargs)

    def filter(self, *args, **kwargs) :
        kwargs = self._handle_pk_base36(kwargs)
        return super(self.__class__,self).filter(*args, **kwargs)

    def exclude(self, *args, **kwargs) :
        kwargs = self._handle_pk_base36(kwargs)
        return super(self.__class__,self).exclude(*args, **kwargs)

    def get_or_create(self, *args, **kwargs) :
        kwargs = self._handle_pk_base36(kwargs)
        return super(self.__class__,self).get_or_create(*args, **kwargs)

class Base36KeyedManager(models.Manager) :
    def get_query_set(self) :
        return Base36KeyedQuerySet(self.model)

class Base36KeyedModel(models.Model) :
    def get_base36_id(self) :
        return _encode(self.id)

    class Meta :
        abstract = True

class ExampleModel(Base36KeyedModel) :
    field1 = models.CharField(max_length=255)
    field2 = models.CharField(max_length=255)

    objects = Base36KeyedManager()
Nov 04

Due to various reasons beyond the scope of this blog, I’ve had to quit my current job as a Programming Specialist at Position2. So, if you’re looking for a python / django hacker in Bangalore (although telecommuting certainly is an option I’d consider), drop me a line.

Aug 29

My dear brother Thilak met with a minor accident this afternoon, and in the confusion the ensued, he’s spilt the beans on Tagz. It must’ve been painful to singlehandedly type the 228 word post (He’s got a cast on his right hand, because of the accident). The UI is kinda crude, but functional. Actually a couple of friends are already using/testing it. Well, we plan to release it sometime soon, but I honestly wish he hadn’t made it public so soon.

We’d been discussing this “`better` delicious reddit chimera” idea for quite some time now. Due to difficult personal circumstances in the past couple of months, I’ve been suffering from a terrible bout of insomnia. When the usual remedies for this (reading Nietzsche, driving through the city all night long etc) didn’t work, I started working on it. Then, on one of my infrequent visits to Mangalore, I showed a very crude prototype to Thilak and he was pretty enthusiastic about it. We setup a redmine instance, moved the mercurial repository to my vps and we were up and running, with a couple of commits every night.

We’ve got a long way to go before I can call it release ready. Until then, all I can say is its written using django and python, with postgresql for the db. And the `undumb` or `not dumb` (or whatever) tags thing he’s hinting about isn’t really all that smart, its just plain old tagging with porter stemming to identify similar tags.

Jun 26

I use Mercurial for all my coding projects. Today I hit upon the idea of using mercurial precommit hooks to run django tests before committing. I didn’t really expect it to be so easy :)

I use postgresql on the server, but I found that on win32, running tests with postgresql is excruciatingly slow.

For comparison, 23 tests with postgresql on Vista laptop take about 193 seconds (The same tests take ~14 seconds on my linux vps). With an sqlite in memory database on the same machine they take about 13.5 seconds. Moral of the story: On Windows, use sqlite in memory databases to test django apps.

Ok, so I had to figure out a way to use sqlite only for testing while using postgresql otherwise. I decided to instrument settings.py

_TEST_DB = False
if 'DJANGO_TEST_DB' in os.environ :
    _TEST_DB = True

if _TEST_DB :
    DATABASE_ENGINE = 'sqlite3'
    DATABASE_NAME =  ':memory:'
    DATABASE_USER = ''
    DATABASE_PASSWORD = ''
    DATABASE_HOST = ''
    DATABASE_PORT = ''
else :
    DATABASE_ENGINE   = 'postgresql_psycopg2'
    DATABASE_NAME     = 'project'
    DATABASE_USER     = 'projectdb'
    DATABASE_PASSWORD = 'dbpassword'
    DATABASE_HOST     = ''
    DATABASE_PORT     = ''

Now I thought of using the inprocess Python hooks feature of Mercurial, but it seemed to be too much of a hassle (need to have the hook script in PYTHONPATH), so I decided to write a simple command line script


import os, sys, subprocess
PROJ_DIR = r'C:\Users\Jeethu\code\project'

def main() :
    proj_dir = os.path.normpath(PROJ_DIR)
    os.chdir(proj_dir)
    os.environ['DJANGO_TEST_DB'] = 'True'
    r = subprocess.call(['python','manage.py','test'])
    return r

if __name__ == '__main__' :
    sys.exit(main())

Then simply add 2 lines to the .hg\hgrc file in the repo.

[hooks]
precommit.tests = c:\Users\Jeethu\code\project\hooks\test_hook.py

Hopefully, this will set the incentives right for me to write more tests.

References:
The hgrc manpage
Chapter 10 of the Mercurial Book

May 31

These days, I’m mostly working on linux and I haven’t had to use the Microsoft dev tools for a long time. I’m using Vista on my new laptop. Yesterday, I had to install pycrypto from source on Windows. The only compiler I had was MinGW, which doesn’t quite cut it with distutils. I figured that I’d have an easier time with the free Visual C++ 2008 Express Edition compiler. Apparently, distutils on Python 2.5.2 doesn’t like this compiler either.
Here’s what I had to do to get distutils to work with it.
First, patch distutils.
File: C:\Python25\Lib\site-packages\msvccompiler.py
One of the things I observed in this file was that if
the environment variables DISTUTILS_USE_SDK and MSSdk
are set and if distutils can find “cl.exe” in the system path, it assumes that everything is set up.
To quote from a comment in the source,

# Assume that the SDK set up everything alright; don’t try to be
# smarter

Find the method load_macros in the MacroExpander class.
Replace the following lines inside the try block

if version > 7.0:
     self.set_macro("FrameworkSDKDir", net, "sdkinstallrootv1.1")
else:
     self.set_macro("FrameworkSDKDir", net, "sdkinstallroot")</pre>

With:

if version > 7.09:
     pass
elif version > 7.0:
     self.set_macro("FrameworkSDKDir", net, "sdkinstallrootv1.1")
else:
     self.set_macro("FrameworkSDKDir", net, "sdkinstallroot")

Now launch a command prompt window and run the following commands.

C:\Program Files\Microsoft Visual Studio 9.0\VC\vcvarsall.bat
SET DISTUTILS_USE_SDK=1
SET MSSdk=1

Now you can run “setup.py build” and it works just fine.
The compiler spits out a couple of warnings about the option ‘GX’ being deprecated, but thats just an innocuous warning.

May 28

Everybody knows “import pdb; pdb.set_trace()”, that’s probably the first thing anyone would do when you’re trying to hunt for a bug or hit an exception (“pdb.pm()” ?). Its just as easy to embed IPython or a plain old Python shell in a python script.

from IPython.Shell import IPShellEmbed
ishell = IPShellEmbed()
ishell()

The IPython manual covers this in detail.
Another way would be to subclass code.InteractiveConsole as mentioned in this Python Cookbook entry.