Services/Meetings/2011-09-20
- Time: Tuesday at 9:15 AM PST / 12:15 PM EST / 5:15 PM UTC.
- Place: Mozilla HQ, North Bridge
- Phone (US/Intl): 650 903 0800 x92 Conf: 8616#
- Phone (Toronto): 416 848 3114 x92 Conf: 8616#
- Phone (US): 800 707 2533 (pin 369) Conf: 8616#
Who's away?: ally at CMU; mconnor.
NO MEETING TODAY. Wiki only.
Ops
MySQL upgrades
Upgrading mysql in production starting today (Tuesday, Sep 20). Over the next ten days, we plan to upgrade all 130 servers. This is a Q2 goal deferred to Q3, and is also expected to resolve the ongoing SCL2 crash issues.
Would you like to know more?
Currently we're running InnoDB databases using MySQL 5.1.44, build maria75, dated March 2010. We're upgrading to MySQL 5.1.58, build percona12.9, dated September 2011.
The new version passes load tests and will be deployed gradually in case something happens.
User data remains intact during the upgrade. The desired user experience is 10 minutes of 503 Server Error. As currently implemented, 503 shows a Server Issue error bar to users. This will be more of that. Firefox 7+ improve the 503 experience, ask Philikon or Marina for details.
Server responsiveness may fluctuate for the first few hours as MySQL adapts to what we've done to it. We saw an interesting pattern over several hours as it calibrates to the write performance of the server.
The approximate timeline is:
- Tue Sep 20 - Upgrade 30 database servers.
- Thu Sep 22 - Upgrade 50 database servers.
- Tue Sep 27 .. Thu Sep 29 - Upgrade the rest.
We've discussed user notification a bit tonight on IRC, but no final decision has been made. I expect there will be a single tweet from the Mozilla Services account.
There are three classes of database hardware: 6-drive iX (old PHX), 24-drive iX (new PHX), 9-drive Cuil (SCL2), with a variety of Linux md layouts.
Some of each will be upgraded each time to improve our chances of catching any issues. Rolling back will either take a few minutes – same as the upgrade – or require a user migration.
We have a variety of pencil graphs tracking Sync, MySQL, InnoDB statistics on all production databases. There are many days of historical data.
Upgrading any single database will likely require a user-facing downtime of about 10 minutes. On any single server, MySQL may choose to make us wait up to 60 minutes for the mandatory clean shutdown. Hopefully this is rare.
Membase
https://bugzilla.mozilla.org/show_bug.cgi?id=687731 (sync server core) should be implemented for the production deployment of Membase. We need to spend a day or two storing data in Membase before the cutover occurs, or otherwise production will experience several hours of poor performance and request timeouts.
[atoll] I saw load test results for membase yesterday, so it's up and running enough with what appeared to be stable performance characteristics. Petef has more details.
Engineering
Core Server Platform
Roadmap (Toby)
Big Lebowski (rtilder)
Sync
Client (philikon/rnewman)
Teensy train this week after the all-hands.
- Bug 663181 - Automatic cleanup for Sync error logs.
- Bug 686366 - Canceling Sign In wizard page triggers "Weave is not defined" error.
mak is back, so favicons might progress. Lots of interesting discussions at all-hands.
Server (Tarek)
Notifications
Client/Server (JR/Paul)
- work week
- Investigated alternate notification like applications: Deuxdrop, WebRTC
- Worked with benbangert on more MQ related requirements.
- Helped a lot of droid noobs set up their new tablets.
Next:
- Client
- Docs & testing
- Test Deployment
Beta Channel
Program Launch (mconnor)
QA
Testing and Sign-offs (tracy)
- No client train last week during all hands.
- Two bugs for the client train this week. Builds failed yesterday so we're waiting on respins.
- We're ramping up Miheala from Softvision to help with weekly client trains. She is 10 hours ahead of MV. So delays like the build failure will often exclude her from helping directly on s-c.
- I just finished the account setup smoketest for Mozmill. The test case runs against stage. If/when captcha is enabled there, we'll need bug 679828 fixed so automation can side step the captcha.
- From speaking with jgriffin during the all-hands, I believe we should write our Mozmill test cases to take advantage of TPS capabilities.
FunkLoad/Automation Scripts (jbonacci)
- Nothing new to report
- This work should pick up again once we have our new equipment and our new hire.
Sync Server (jbonacci)
- Focus this week is on the following deployment of membase to Stage and Production:
- Bug 686485 - roll out membase
- Bug 687731 - support writing to two memcaches
- QA also needs some face time with petef before he travels back to NJ.
- Sync Server configuration stuff, mostly
TPS (jgriffin)
- Tracy is currently working with jgriffin to synchronize their client-side automation efforts between Mozmill and TPS.