Releases/Post-mortems/Firefox 3.6.4
From MozillaWiki
< Releases
Contents
Schedule / Location / Call Information
- Thursday, 2010-08-12 @ 11:00 am PST (scheduled to last no longer than 1 hour 45 mins, shooting for 1 hour)
- In Warp Core
- 650-903-0800 x92 Conf# 8605 (US/INTL)
- 1-800-707-2533 (pin 369) Conf# 8605 (US)
Other communication channels
- join irc.mozilla.org #post-mortem for back channel (will be logged and attached here after)
- Etherpad for meeting notes can be found here if people want to use it
Overview
The project can be analyzed by slicing it into the following components:
- Feature development
- "One more blocking bug"/schedule slipping
- Crash-stats irregularities
- Quick spin for bug 574905
Timeline
- Firefox 3.6.3 shipped on April 1
- Christian originally scheduled 3.6.4 for May 11, got feedback and tightened up to May 4
- Lorentz / 3.6.3plugin1 beta went out on April 8
- Firefox 3.6.4 build #1 went as an opt-in beta on April 16
- Had to quickly hack/change all-beta for a manual download location. Delayed media push
- Found 9 blockers that were being fixed for build #2
- Bug 561308 prevented build 2 from being built on schedule and then bug 561817 made us start the builds a day late
- Firefox 3.6.4 build #2 had an issue and never went out
- nthomas found that bug 534666 landed on default and not the relbranch
- Christian emails socorro team with OOPP reporting concerns on April 28
- Firefox 3.6.4 build #3 went as an opt-in beta on May 4
- Bug 563847 was determined to be a blocker, decided we couldn't go to the entire beta audience with build #3
- Christian posted in Farmville forums on May 5th asking for testing. The thread was promptly deleted
- chofmann asked on May 8 if we had Zynga contacts as some users were complaining about Farmville via Hendrix, beltzner said he would reach out and suggested we post in the user forums
- This was the status on May 10th
- On May 12th decided to create build #4 even though there were outstanding issues.
- This was also when it was first decided that 3.5.10 would stay tied to 3.6.4, in response to KaiRo
- Firefox 3.6.4 build #4 went to beta on May 14
- On May 17th metrics/Daniel set up the super-useful page at https://metrics.mozilla.com/stats/firefox.shtml
- Firefox 3.6.4 build #5 went to beta on May 26
- Didn't go on the 25th due to MV network issues
- Found bug 568129 before releasing to beta, knew we would have to respin but decided it wouldn't hurt to ship it to beta users on older builds
- Firefox 3.6.4 build #6 went to beta on May 28
- We called this a "release candidate" and did more press/blog posts. Weren't comfortable calling out 80% improvement as we weren't sure it would hold up in the release audience
- We were watching bug 563361 and bug 569104
- We were trying to get a handle on the Cnet issue. Also, this is the first time we seriously started to discuss turning off OOPP
- We had escalated with Adobe and were getting to the right people at Cnet
- More talk of splitting 3.5.10 and 3.6.4 at this time, TB guys getting antsy
- Christian got to the lead Flash developer and download.com product manager at Cnet on June 4
- Around June 10th (probably earlier) we notice and get concerned about the crash spike
- Main tracking bug was bug 571118
- Also dbaron asked in conversation if socorro would be able to handle the increase in "crash" volume due to oopsies after releases
- Decided not to block on Cnet issue, but bug 562198 became a blocker as it prevented Linux users from using banking sites
- Firefox 3.6.4 build #7 went to beta on June 14
- Most crash-stats investigations were wrapping up by June 15
- TB team ships on June 17, didn't disclose security vulns that affect Firefox
- dbaron asked about socorro capacity on June 21
- crash-stats spike finally solved on June 21 (configuration problem)
- Firefox 3.6.4 shipped on June 22
- A security researcher got turned around and went public with this bug, thinking it was fixed in 3.6.4 (it wasn't shipped until 3.6.7)
- This bug flared up. RRRT saw it, and Lilly contacted Zynga. Required a quick 3.6.6 (as 1.9.2.5 was taken by Fennec)
- Socorro team had to turn throttling to 10% on June 25 as the system was overwhelmed
- Note that an increase was anticipated
- Firefox 3.6.6 shipped on June 26
Discussion points
- What does baking on trunk mean to us now?
- Could we have foreseen every new blocking bug and/or the bug that caused the respin? How can we not get blindsided in the future?
- Did we make the right call keeping 3.5.10 and 3.6.4 tied together?
- Did the "project branch → opt-in beta → beta → release" format work well? How might we do it differently/better?
- How can schedule be better communicated when things are in flux?
- Is socorro at a state we are confident with? Are there more changes that need to be made? Are there future projects that may have the same sort of issues?
- Do we need an action plan for dealing with 3rd parties? How long are we expected to wait? Should we have a more formal outreach/partner program?
- What sort of things might we backport in the future? Are the lessons here specific to OOPP or can they be applied generally?
- clear mails (with correct subjects) on rel-drivers for record keeping
- not everyone reads never-ending scrollback
- easy to miss an important handoff if not reply-all, with changed subject
- hard to figure out historical
- problems tracking patches across branches
- bsmedberg reported problems tracking fixes on lorentz to m-c and then to moz192 and relbranch
- nthomas caught missed fix on relbranch with build#2
- any way to avoid one patch per respin?
- Metrics can work on Operational Metrics dashboards for systems that have complex interactions or systems that can be monitored for the trending affect of things such as a config change or a release. See [[1]]
Things that went right
Things that went wrong
Suggested improvements
- Release codenames to reduce confusion (?) (clegnitto)
- Branch landing verifier scripts (clegnitto)
- Need to not use IRC and meetings, need a written record
- Emails to release-drivers should have clear subjects with the version # and not be threaded
- Date-scoped queries for historical mining of bug state
- Need better defined/more formal beta program and feedback channels
- Create alternate plans at the beginning and add firebreaks with mitigation plans
- Use a rage for certain schedule items (shipping in particular), to give some wiggleroom and prevent excessive schedule churn
- Would be useful for RelEng to have SLAs so that release drivers can set expectations/urgency for each build. The information can also be looked at in post-mortems as well
- Front-load / "pre-mortem" (meeting, etc) the QA test plan to bring new ideas and unique testing. Formally modify the plan as needed, prevent ad-hoc QA test plan
- Socorro team came up with better trends/operational stats and linking them with events. Make this general / use it everywhere
- "Sightings" in bugs will potentially make sure fixes aren't missed when porting between branches
- Implement a "Related items in external systems" (key/value fields) to link bugs to commits via automation