User:Bhashem/WildOnAddons
From MozillaWiki
Contents
Overview
One of the important questions for the Mozilla and Firefox platform is:
- How many add-ons make up the Mozilla add-ons eco-system?
It's important to get the answer to this so that:
- We can understand how pervasive Mozilla add-ons are
- We can help users find ALL the add-on available on the net (not just those on AMO)
- We can use this number to show a groundswell of support for the platform to encourage others to develop to it
- We can start to index this information in a central location (AddonSearch)
This actually turns out to be quite hard to answer. AMO is one of the main distribution point for add-ons but it's certainly not the only one. The goals of this project is to gather and index information about add-ons "in the wild".
Here are a few ideas about where add-ons can be hiding.
Aggregation Sources
- Mozilla AMO (public & sandboxed)
- Mozilla AMO Update Service (some authors don't include an update URL which means that Firefox attempts to get updates from AMO and the GUID is logged)
- AMO-like sites: AMI, Sociz, China, Mozilla Japan Addons, Addons.pl, other locale-specific sites?
- Source Repos: MozDev projects, Google Code & SourceForge
- Search results: Google ("filetype:xpi", "firefox add-ons", "firefox extensions"), Yahoo, etc...
- Those mentioned in Google Alerts (blogs & news) on a regular basis
- Blog aggregators: Foxiewire
- Addon-specific sites for XUL Apps (Songbird Nest, Flock Extensions, ...)
Individual Sources
- Corporations (Google Toolbar, Google Labs)
- Inside of Installers (Symantec Anti-Virus, McAfee, Skype, Java)
- Individual authors' blogs and websites
Project Definition
- Write a crawler that gathers info from some of the sources named above
- Index the collected info and try to extract metadata from page context and the install.{js/rdf}
- Allow "manual entries" to be entered into the index (e.g. for add-ons bundled in Installers)
- Build a search/advanced search UI on top of the index
- Initial focus should be on Firefox, Thunderbird, SeaMonkey, Flock, Songbird and Nvu only
Tech Notes
- Thankfully most add-on have a .xpi file extension, so they might be easier to identify
- .xpi files are ZIP files and usually contain either an install.{js/rdf} which has info about what the add-on does
- The install.{js/rdf} contains a GUID which uniquely identifies the add-on (hopefully) - we may be able to use this as the primary index
- TargetApplication id's and versions
Crawling
Crawling and parsing would probably be an intensive and time consuming process
- Google search results (filetype:xpi) is limited. For example, a Google search for (filetype:xpi site:addons.mozilla.org) only returns 62 hits. Probably best to supplement our data rather than be primary.
- How much do we crawl? How deep?
- Aggregate Sites
- Two Kinds
- Hosting (AMI, AMO)
- Linking (FoxieWire)
- Site specific. Maybe only second-level domains (eg. addons.mozilla.org/* instead of all of mozilla.org). Addon authors sometimes have links on their addons page to their personal website with a more up-to-date addon.
- Mozdev/others/..
- Two Kinds
- Individual Sites
- Wordpress/Blogspot (can extensions be uploaded here?)
- Google/Yahoo search
- Rich sources of information. But too much information, or lacking quality
- Bouncer
- What kind of information does bouncer collect?
- Does not give context/rating/url probably
- Good/Bad source?
GUID Collisions
- Same extension different version
- Same extension, same version, different website (hash comparisons?)
- Different extension, possibly malicious or coincidence
What to Track
- Addon url (where did we find it?)
- Filename
- GUID
- Supported Applications and versions
- locals it supports
- context (entire paragraph)
- Ratings? (Site-specific)
- Categories (How?)
- Addon version
- Operating System (using install.rdf targetPlatform. But can be null if we don't know)
Tools
- Something to extract a zippy.
- Look for chrome.manifest
- Look for install.{rdf|js}
- Parse those files (rdf is xml, chrome.manifest should be simple, but what about install.js?)
- Something to crawl
- Something to store (database for better querying?)
- List of websites to crawl
- Crawler's settings (eg. How deep)
Technical Resources
Manual Extensions
Extensions that are bundled with an install, and therefore must be added manually