Some 6 months ago, out of curiousity, I tried using Wireshark to monitor the http traffic browser was generating while casually browsing the web on Firefox with the StumbleUpon toolbar. The resulting traffic dump was pretty interesting. The Stumbleupon toolbar was calling home for every single URL I visited.
The extension was making an HTTP POST request for every url I’d visited to http://74.201.117.232/getmeta.php?username=<my_user_id>.

The use case for this is to be to check if I’ve rated the url, and if so, what’s my rating. But then, why not cache the data instead of calling home every time I open a page (even in the same browsing session) ? I freaked out and uninstalled the toolbar the very moment, and that was pretty much it. Maybe mentioned it in passing to a couple of friends. Most of them didn’t really care about this since they don’t use StumbleUpon and those who did thought it was du jour with all the social bookmarking/news toolbars, which is true. Almost every other toolbar also does this.
I thought about this for a while, thought the toobars should hash the urls before sending it to the servers, and using bloom filters to reduce the number of time the client would have to call home to check if a user has rated a url. And that was pretty much it, until last week.
kamathln told me that someone wants to write a Jetpack extension for Tagz. That someone turned out to be Yathi, an old mutual friend of ours. He wanted to write a relatively simple Jetpack extension and wanted some server side support to get info on any url (comments, points, number of saves etc). I quickly added the requisite server side support and he quickly hacked together a nice little jetpack script. It turned out to be one of the first few jetpack extensions on userscripts.org, a couple of people started using it and all was fine. Until I started looking at the server logs, when it felt like a deja vu all over again. Our script was leaking our users’ browsing history quite like I’d observed with the StumbleUpon toolbar 6 months ago.
Eventually, I decided that sending urls in plain text is a bad idea. Also, the lookups should be cached atleast for a short while. An extended form of the idea involved using bloom filters, but that’d have been too much work. So, we now normalize the url to a standard representation, then hash it with sha256 and then send the hash to the server. Although this is not quite a completely bullet proof solution, its certainly better than sending the url to the server every time.


