Websites/Taskforce/Proposals/Abandoned Sites/Archive
The following steps may be followed in order to archive a Mozilla website that has been abandoned or is being retired.
Archived Mozilla websites will be made available at http://website-archive.mozilla.org and the Retired Sites list.
Contents
Web Dev Actions
The following actions will be performed by a member of the web development team.
Subversion
The subversion repository for the Mozilla Website Archive is available at http://svn.mozilla.org/projects/website-archive.mozilla.org
Follow the svn instructions for Mozilla subversion access. Once you have access, you may checkout the website-archive repository.
svn checkout svn+ssh://svn.mozilla.org/projects/website-archive.mozilla.org
Initial Archive
The initial archive can be performed using wget. This will scrape and the entire site into html, javascript and css files. It will also save each index file with an .html extension.
Change into the root of the checkout directory and exectute the following wget command.
cd website-archive.mozilla.org; wget -rpEkH -nc -t 25 -w 2 --random-wait --retry-connrefused --no-check-certificate -R *.pdf -R *.bz2 -R *.gz -R *.mov -R *.fla -R *.xml -R *.json -R *.rss -D mozillaservice.org http://mozillaservice.org
This method scrapes and archives most of the website. It excludes all files that we don't want to download due to space issues, such as PDF files and zipped files (this may vary on a site-by-site basis).
For a site that is approximately 1,500 pages in size, this process took about 2 hours finish the archive. If you're not concerned about server usage for this particular site, you can remove the --wait=n and --random-wait flags to be more aggressive towards the server.
Privacy Actions
Once the site has been downloaded locally in its entirety, you will need to remove all code that refers to or collects user identifiable information.
Forms
Forms that request user information like email addresses and passwords will need to be removed from the codebase. Currently, we are handling this process manually by grepping for the forms.
grep -rn form * | grep action | grep -v svn
Email Addresses
All of the user identifiable information, such as email addresses, will need to be removed from the code. To locate email addresses in the code base, you may use the following egrep statement.
egrep -rn "\w+([._-]\w)*@\w+([._-]\w)*\.\w{2,4}" * | grep -v svn | grep -v "mozilla.org"
Resolving Redirects
After all files have been downloaded, you will need to test urls to ensure that they are redirecting properly. More than likely, you will find that you'll need to alter the .htaccess file to append .html to extension-less urls.
Options None +FollowSymLinks Order Allow,Deny Allow from all <IfModule mod_rewrite.c> RewriteEngine On RewriteBase /mozillaservice.org RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule (.*) $1.html [L] </IfModule>
Archived Header
Finally, you will need to add a note to the top of each page making the user aware that this is an archived website. We're doing this by including a javascript on each page that will print the archive statement atop each page.
Create a new file - js/archive.js. This file will allow us to append text to each of the pages, and will allow us to easily edit one file if their is a text change in the future. Use something resembling the following text:
(Open to copy, document.write contains divs)
Use sed to append each of the .html files on the site with the javascript include:
find . -name "*.html" -print | xargs sed -i "s/<\/head/<script type=\"text\/javascript\" src=\"\/mozillaservice.org\/js\/archive.js\"><\/script><\/head/g"
Now, update the site's global css file accordingly, ensuring that the changes are congruent with the site's design, eg:
#archive { margin: 0; padding: 5px; position: relative; text-align: center; padding: 14px 10px 15px 10px; color: #f5f3ed; background-color: #4d5151; } #archive_text { margin-left: auto ; margin-right: auto ; width: 740px; text-align: center; font: bold 1.143em/1 Arial, Calibri, Helvetica, "Helvetica Neue", sans-serif; line-height: 1.2em;} #archive_text a, #archive_text a:hover { color: #fff; text-decoration: underline; }
Commit
Once this site has been downloaded and all privacy concerns have been handled, you will need to commit the site to subversion. Then you will need to file an IT request in Bugzilla to have this code pushed to production.
Bugzilla
- File a bug to decommission all related staging servers.
- File a bug to move all open bugs to Website Graveyard or Webtools Graveyard.
Systems Operations Actions
A member of the Mozilla Systems Operations team will need to perform the following actions.
Subversion
Perform an update of subversion for http://website-archive.mozilla.org .
Ensure that the archived site is accessible by entering the site domain name into the uri. Using mozillaservice.org as an example, the archived site will be available at http://website-archive.mozilla.org/mozillaservice.org
GitHub
- Make a note on the GitHub repo that the website is retired.
Website
First, expunge all user-specific data from the database, namely email addresses.
Second, backup the database for the existing website.
Next, in Apache, redirect the visitor (301) accessing any page of the website to the archived page on the website-archive.mozilla.org website.
For example, a user accessing the retired website: http://mozillaservice.org/activity/stories/en_US
Should be redirected to: http://website-archive.mozilla.org/mozillaservice.org/activity/stories/en_US
Finally, remove the website code from the server.