Jan 07

I was trying to use the flare actionscript decomplier, which I’d never used since I upgraded to Snow Leopard about a month ago. The OSX version of flare is a PPC binary and needs Rosetta to run. Unfortunately, since flare is a command line application, it just spits out a message (“You need the Rosetta software to run flare. The Rosetta installer is in Optional Installs on your Mac OS X installation disc.“) and exits. GUI applications trigger a dialog prompting you to download and install Rosetta. My Snow Leopard upgrade disk is at home and I don’t want to wait until evening to install Rosetta off the disk. So, I tried searching for a link to the Rosetta installer on the web, without any success. The only other recourse was to find a PPC built application to install and run (and eventually uninstall after Rosetta was installed) to trigger the installer dialog. After a little bit of searching, I finally managed to find PPC binaries from the Folding@home project. I installed it, and got the dialog to install Rosetta, which incidentally was only a 2MB download. I wish there was an easier way to download the Rosetta installer straight from the apple website.

Jun 22

Some 6 months ago, out of curiousity, I tried using Wireshark to monitor the http traffic browser was generating while casually browsing the web on Firefox with the StumbleUpon toolbar. The resulting traffic dump was pretty interesting. The Stumbleupon toolbar was calling home for every single URL I visited.

The extension was making an HTTP POST request for every url I’d visited to http://74.201.117.232/getmeta.php?username=<my_user_id>.

Stumbleupon screenshot

The use case for this is to be to check if I’ve rated the url, and if so, what’s my rating. But then, why not cache the data instead of calling home every time I open a page (even in the same browsing session) ? I freaked out and uninstalled the toolbar the very moment, and that was pretty much it. Maybe mentioned it in passing to a couple of friends. Most of them didn’t really care about this since they don’t use StumbleUpon and those who did thought it was du jour with all the social bookmarking/news toolbars, which is true. Almost every other toolbar also does this.

I thought about this for a while, thought the toobars should hash the urls before sending it to the servers, and using bloom filters to reduce the number of time the client would have to call home to check if a user has rated a url. And that was pretty much it, until last week.

kamathln told me that someone wants to write a Jetpack extension for Tagz. That someone turned out to be Yathi, an old mutual friend of ours. He wanted to write a relatively simple Jetpack extension and wanted some server side support to get info on any url (comments, points, number of saves etc). I quickly added the requisite  server side support and he quickly hacked together a nice little jetpack script. It turned out to be one of the first few jetpack extensions on userscripts.org, a couple of people started using it and all was fine. Until I started looking at the server logs, when it felt like a deja vu all over again. Our script was leaking our users’ browsing history quite like I’d observed with the StumbleUpon toolbar 6 months ago.

Eventually, I decided that sending urls in plain text is a bad idea. Also, the lookups should be cached atleast for a short while. An extended form of the idea involved using bloom filters, but that’d have been too much work. So, we now normalize the url to a standard representation, then hash it with sha256 and then send the hash to the server. Although this is not quite a completely bullet proof solution, its certainly better than sending the url to the server every time.

Jun 16

This probably is the most inappropriately titled post on my blog. Maybe this should have been titled “Why I think I waste so much time on proggit and hacker news” or “PR for my new feature on tagz”.

For a long time I thought that people flock to social news sites to find new links pertinent to their interests (programming, compsci, math, economics … in my case). And going by this model, I thought the utility brought on by comments on these sites are only marginal compared to the the utility provided by the inflow of new and interesting links. Lately I realized that this couldn’t be further from the truth. I hadn’t realized that I tend to spend more time reading comments than on reading the linked content. Seems like I’m more interested in what people have to say about the links than the links themselves.

This reminded me of a pain point I’ve always had with social news sites. Social news sites are kinda like walled gardens. When I’m reading the comments on one of them, I’m missing out on a lot of interesting comments on other sites. And there’s no easy way (with a few clicks) to find discussions on all sites.

Since Tagz was written with the intent of solving things which nagged me the most with social news and bookmarking sites, I decided to annotate all posts on Tagz with links to the comments page on a Delicious, Digg, Hacker News, Reddit and Twitter. Now, there are a few little inconsistencies, Delicious doesn’t have anything like a comments page, so I link to the url info page. And due to the use of url shorteners, there isn’t a way to directly search Twitter for links. I link the the Backtweets search results page for the URL.

Initially, I didn’t want to add more clutter to the main page, so I kept these links only on every post’s comments and history page. But yesterday, I thought it’d be a better idea to just include those links on the main page. One of my more marketing oriented friends even recommended against having it on the main page, since that’d likely increase the bounce rate. Honestly, I don’t really care about it and I always thought convenience takes precedence over everything else, So I just added them anyways.

Apr 13

Before moving to YUI about a year ago, I was using Mochikit as my primary JS library. As advertised, Mochikit happens to be one of the most pythonic javascript libraries ever. One of the sweetest parts of Mochikit IMO has been Mochikit.DOM. This is something which I’ve always missed with YUI. innerHTML is fast, but icky and it feels a little inelegant. So, I ended up writing something like Mochikit.DOM for YUI while writing Tagz. Thought it might be useful to others as well. So, here’s the mercurial repo with the code.

The utils.js file contains some utility functions like forEach, map, filter, partial. The only function from utils.js used in dombuilder.js is partial, so you might want to add it to dombuilder.js to remove the dependence on utils.js.

Here’s an obligatory trivial example (included in the repo).

Apr 06

I’ve been using memcached for all the caching on Tagz. Redis is a relatively new key value database which covers a superset of memcached’s functionality. One of the biggest problems I’ve had with memcached (actually it has nothing to do with memcached) is that whenever I store a large datastructure on memcached, deserializing (unpickling) it takes quite a while (only a couple of milliseconds, but it still counts).

This happens whenever I end up storing large lists or dictionaries on memcached. Redis solves this problem effectively by providing list and set primitives besides storing plain old strings. This effectively solves the  aforementioned problem. Also, the list primitive supports atomic push/pop commands, which could be used to implement efficient queues. And this along with the on disk persistance feature solves another problem I have with beanstalkd, which is the lack of persistance.

Overall, this seems like a great solution to quite a few tiny little problems I have with performance on Tagz. I’m planning on playing with redis a little more tonight and if all goes well, I’ll shift from memcached to redis.

Apr 01

Tagz has come a long way since I launched it last September. Something which began as a clean room django application has been accumulating a lot of cruft. One patch at a time, its turned itself into an unmaintainable mess of a codebase.

In retrospect, I feel Python and Postgres weren’t really the best choices I made for writing Tagz. I believe Tagz would be better written in PHP with MySQL as the DB. I’ve come to learn the hard way that Django with Postgresql can’t quite match the blazing speeds possible using raw PHP with MySQL (and MyISAM DBs).

Starting today, I’ve decided on stopping all development on the current code base of Tagz. I’ve begun a rewrite of Tagz in PHP. The current users may rest assured, since backwards compatibility is an important goal for this rewrite. I’m hoping to finish the rewrite in less than a month. I’m expecting the transition to be a smooth one.

Finally, Thanks to all the current users (no thanks to all the spammers) for all the encouragement and the feature requests, without which Tagz would’ve never have come close to what it is today.

Mar 07

Over at Codinghorror, Jeff Atwood ponders if making your pages W3C compliant is really worth all the effort. I’m sure just about everyone who has written more than a couple of html pages has thought about this. As a programmer, I find writing html and css to be a real chore. My productivity drops very close to zero whenever I’ve got to write html. But when you get down to do something as menial (from a lazy developer’s perspective) as writing html, how much harder is it to make it validate ?

But then, no fallacious argument is complete without a strawman or two. Start with more than a dozen sites (including two of his own) which don’t validate. Brilliant, capitalize on the bandwagon effect. Which is kinda like saying “Those big name sites made it big without having valid (x)html. Ipso facto, your site stands a better chance of making it big by not writing valid (x)html”. The second one’s a little better. With HTML 4.01 Strict, you can’t have a target attribute for an anchor tag. So, something like:

<a href="http://www.example.com/" target="_blank">foo</a>

Doesn’t validate. Now, I’m sure everybody these days uses some JS library (for those who don’t, there’s always getElementByClassName). Apparently, stackoverflow uses jQuery. So, here’s a solution. Start with

<a href="http://www.example.com/" class="external">foo</a>

Or something like it. Then do something like

$(a.external).attr("target","_blank");

Yeah, yeah, I know its a dirty hack, but heck it buys you compliance. For what price ? One extra line of Javascript. And the whole point in writing validators mostly is about making it easier to parse for all the programs and bots which crawl/fetch your pages and AFAIK, most of them, don’t grok Javascript. How hard is it to write

<td style="width:80px">

as opposed to

<td width=80>

Fortunately, you don’t have to compile html, so you don’t have a compiler which barks at you and also, the browsers are lax enough to consume almost any kludge that you throw at them.

The bottom line: hell yeah, its worth the effort, and how hard is it, really ? Just because your web site does’t validate, it doesn’t mean that you can brush it off. All I care about is that mine does and everyone’s should, atleast in theory :)

PS:

1: In all honestly, my blog isn’t w3c validator compliant, thanks to the wordpress plugin that I use for syntax hilighting source code.

2: Accesibility matters as well, we web devs keep bitching about having to support IE6 (I do it as well). How many of our web 2.0 apps work with Javascript and css disabled or on lynx or links ?

Nov 07

On my previous post with the same title, frits commented asking wouldn’t it be possible to use filter(tags__in=l) instead of chaining filters.

I’d initially thought that chaining filters would generate a more efficient SQL query. This morning, I decided to test it.

Post.objects.filter(tags__text='python').filter(tags__text='django')

Generates the following SQL query.

SELECT "main_post"."id", "main_post"."link", "main_post"."title"
FROM "main_post"
INNER JOIN "main_post_tags" ON ("main_post"."id" = "main_post_tags"."post_id")
INNER JOIN "main_tag" ON ("main_post_tags"."tag_id" = "main_tag"."id")
INNER JOIN "main_post_tags" T4 ON ("main_post"."id" = T4."post_id")
INNER JOIN "main_tag" T5 ON (T4."tag_id" = T5."id")
    WHERE ("main_tag"."text" = 'python'  AND T5."text" = 'django' )

Notice that it generates a total of 4 joins, and I don’t really know why its joining on the M2M join table main_post_tags twice. Joining twice on main_tag is ok, but the 2nd join on main_post_tags could’ve been avoided.

Now, to try frits’ idea.

t1 = Tag.objects.get(text='python')
t2 = Tag.objects.get(text='django')
Post.objects.filter(tags__in=(t1,t2))

This generates a total of 3 queries, 2 to select the tags and one to select the posts. Now, if we ignore the cost of the first two queries (which are cheap, and the results can and should be cached in memory anyways), the final query has become way simpler and cheaper.

SELECT "main_post"."id", "main_post"."link", "main_post"."title"
FROM "main_post"
INNER JOIN "main_post_tags" ON ("main_post"."id" = "main_post_tags"."post_id")
    WHERE "main_post_tags"."tag_id" IN (1, 2)

The number of joins is down from 4 to 1, this is way more efficient in every way.
Thanks for the tip, frits.

Aug 29

My dear brother Thilak met with a minor accident this afternoon, and in the confusion the ensued, he’s spilt the beans on Tagz. It must’ve been painful to singlehandedly type the 228 word post (He’s got a cast on his right hand, because of the accident). The UI is kinda crude, but functional. Actually a couple of friends are already using/testing it. Well, we plan to release it sometime soon, but I honestly wish he hadn’t made it public so soon.

We’d been discussing this “`better` delicious reddit chimera” idea for quite some time now. Due to difficult personal circumstances in the past couple of months, I’ve been suffering from a terrible bout of insomnia. When the usual remedies for this (reading Nietzsche, driving through the city all night long etc) didn’t work, I started working on it. Then, on one of my infrequent visits to Mangalore, I showed a very crude prototype to Thilak and he was pretty enthusiastic about it. We setup a redmine instance, moved the mercurial repository to my vps and we were up and running, with a couple of commits every night.

We’ve got a long way to go before I can call it release ready. Until then, all I can say is its written using django and python, with postgresql for the db. And the `undumb` or `not dumb` (or whatever) tags thing he’s hinting about isn’t really all that smart, its just plain old tagging with porter stemming to identify similar tags.

Jan 30

This one’s straight off slashdot. A giant octopus tried attacking a 200k remotely controlled submarine. Whats next ? Nessie making an appearance (Not that I believe in it) ?