Firefox/Projects/Page thumbnail service

Intro

What with the home tab and bookmarks and history in content, we need a way to take and persistently store thumbnails of Web pages.

There are at least a couple of interesting problems:

What makes a good thumbnail? It would be easy to snap the top portion the page as the user sees it. Can we do better?
How can thumbnails be economically stored and quickly retrieved?

STALLED, GOING DOWN, PASSENGERS EERILY QUIET

Spent some time thinking about and prototyping problem 1 above. Tactics tried:

Zoom in on the most relevant parts of the page. The intuition is that a thumbnail of a Wikipedia page should focus on the page's header, maybe including the logo in the corner or images in right-hand sidebar. Similarly, a Flickr photo page should focus on the photo. I tried a few text retrieval techniques (the Jaccard index, cosine similarity of tf-idf vectors) using the page's title (or URL) as the query and DOM nodes as the documents, combined with upweighting particular nodes. It works well for many pages, but the problem is that it works poorly for some pages too. Working poorly here is worse than the naive thumbnail.
Make the window smaller. The bigger the user's window, the smaller everything is inside the naive thumbnail. For text-heavy pages such as Wikipedia, most of the thumbnail might be text. So shrink the window so that the text reflows and moves down the page, and then snap the top portion. The advantage here is that it's never worse than the naive thumbnail and sometimes better.

To do:

File separate bugs for each of the two goals below. bug 497543 already exists. It should probably be converted into the bug for goal 2, since it's mostly concerned with storage.