Firefox/Channels/Postmortem/66
Contents
Notes for 66.0 release post mortem
9:15am PT Thursday April 4 (after the channel meeting) Vidyo Channel: ReleaseCoordination IRC: #release-drivers
Attendees
Dan, Julien, Liz, Marcia, Mike Kaply, Pascal, Ritu, Ryan, Sebastian, Tania
Stats
- On nightly: 7 weeks
- On beta: 7 weeks
- Uplifts to beta: 323
- A bit high because of planned uplifts from Activity stream
- https://mzl.la/2TE7ZUd
- 65: 260 Beta /16 Release
- 64: 245 Beta, 13 Release
- Uplifts to release (RC) : 13
- fixed in 66: patches from 3475 bugs
- uplifts verified: 152. Thank you QA folks!
What went well
- WNP went smoothly! Again!
- Coordination with Activity Stream project, QA (lots of planned uplifts)
- Antivirus didn't cause big problems on release - an improvement !
- Likely this has improved from increased QA testing and progress from inject/eject, as well as communicating with a/v companies
- Fennec stability looks good on the Google Play Store! ANRs and crashes seem to be under control
QA Updates
- Relatively low volume of PI requests this time - https://docs.google.com/spreadsheets/d/1vPyFxPmnQt5YZThJlFaeyriLkeXU6sE6jgEHgAryUbo/edit#gid=0
- All the PI requests were filed by the deadline.
- Feature Documents were due by Dec 14th, 3 features didn’t meet the deadline
- Code readiness date was Dec 28th, 4 features landed post the deadline ( the number was 7 for Fx65)
- Coordination with respect to WNP went really well this time, we were notified about the schedule early on
- The communication with Eng teams was very good for most of the features and also was the documentation received, for some cases (Update Certs Error), we needed to ask for information for 2-3 times before receiving an answer
* Shipping dot-releases the same day with betas - e.g. 65.0.1, 60.5.1esr and 66.0b7 on Feb 12 - brings strain on QA resources.
- when timing isn't critical, Relman can try to schedule dot releases for Wednesdays. Otherwise, discuss with us so we can move a beta
- 11 requests received late cycle (last 3 wks). https://docs.google.com/spreadsheets/d/1-qR-YqigLNZmOKlKUV6Ypr1QRBJ196ysejUaHrvKtF4/edit#gid=2050060441 Out of these, PI-38, PI-39 and PI-41 were urgent and had to be handled during RC week.
- Office 365 - 3 new test cases have been added to the Regression test suite, https://testrail.stage.mozaws.net/index.php?/suites/view/1694 (C321230, C321231 and C321233)
Challenges
Last minute release blockers:
- 4 last minute confidential issues from search
- 3 webRTC blockers
- More testing earlier in beta? More automated tests?
- safebrowsing key/ gls followup
- aarch/windows blocker that was missed earlier
- Cleanup of release checklist between 66.0 and 66.0.1/2
- we missed some minor checklist items - fast pace (and maybe a little burnout)
- Will rework the checklist and process doc to help with this (liz)
- Webcompat changes - breakage in Office 365 docs.
- Need better product planning and rollout of such changes
- https://docs.google.com/document/d/1fT2v8cOKRhCiLSUiqu7slfwL4fg2_MJektVLn58PVvc/edit#heading=h.sg56l0hk5phg
- maybe an intent to ship could have helped here (pascal)
- [marcia] Fennec issues which prevented rollout
More examples of last minute release blocking issues
Blank new tab issue: This was a scramble, but it went pretty well.
Users reported a blank new tab page with extensions that change the new tab/home page. There are several duplicate reports from beta.
AMO data showed 2 million users may be affected. Kris Maglione landed a (complex) fix in nightly March 1. Patches landed on mozilla-beta March 8 for the RC build. I asked for feedback from the Product team. The impact here is worth the risk of a late uplift of several complex patches.
https://bugzilla.mozilla.org/show_bug.cgi?id=1518863
What helps in these situations? Supporting data. Line up super-review and testing to move as quickly as possible. As developer, or PM, keep checking in until the code lands and is verified *where it's going to ship*. Add to the post mortem for later reflection.
- Is the dev/team under resourced?
- Does this area of code need more automated tests?
- Did we escalate quickly enough ?
https://bugzilla.mozilla.org/show_bug.cgi?id=1498973 UI issue that got 10+ duplicate bug reports in 65/66; affecting Windows 10 1809 update only Slowly escalated in importance from December till February Tracked in late Feb. Tried the mailing list. Adam fixed it last week. Uplift not recommended. We tried it but it didn't quite make the release. Still going to aim for dot release (66.0.3) Checking telemetry for 1803 vs. 1809 users changed our minds because the problem will become more widespread over the next couple of months. Landed for the RC build. Awesome work from everyone
https://bugzilla.mozilla.org/show_bug.cgi?id=1536453 Powerpoint and other Office365 issues (powerpoint editing is the blocker) The root cause here is a long-standing effort to "fix" a webcompat issue by shipping window.event and align event.keycode/keypress with Blink. Lots of ongoing discussion about the best way to proceed because we're just not getting these issues reported by our pre-release population. The corporate nature of some of the breakage is concerning - could we do more outreach to the people who manage deployments inside large organizations? -- overholt
66.0.1 discussion
- Went smoothly
- Good preparation and planning, coordination
- QA was prepared
- Partly luck that we got the bugs on Thurs rather than on Friday
66.0.2 discussion next Tuesday.
A note on sec bug uplifts for dot releases (dan)
- Be more cautious, discuss with sec team
- liz to add a line to our dot release process doc to this effect.
- just because sec-critical doesn't mean it is necessary for dot release
- in this particular case, reporter was moz employee, issue was over-rated (should be sec-high) so it could wait till 67 release. And, it would have meant an ESR point release. Not high priority enough to drive an ESR on its own.
Open H264 issue
This is from the etherpad at https://public.etherpad-mozilla.org/p/66-openH264 Created to coordinate discussion of a last minute release blocker.
High volume crash in WebRTC
https://bugzilla.mozilla.org/show_bug.cgi?id=1535766 https://crash-stats.mozilla.com/signature/?signature=mozilla%3A%3AWebrtcGmpVideoEncoder%3A%3AEncoded
- Most crash urls are from chatroulette.com, dirtyroulette.com, etc. But there are many crashes without urls, also.
- Because of Bug 1532756, OpenH264 on Android was non-functional for Firefox 66 from around Jan 10 to March 8, so we would not expect to see crashes from before March 8th.
- It was working on 65 release.
- Once bug 1532756 was fixed, and the fix went out on the release channel, we started getting crashes
- Fennec 66 is live at 5% update rate on the release channel.
Options
- We can keep investigating options and hold back updates till we have a fix and are ready to do a 66.0.1.
- Or, we could disable openH264 entirely and do another 66.0 build today (Monday)
- Complications are possible because of the pwn2own contest. What are some possible complications?
- If we need to ship a fix to Fennec in response to pwn2own then we'll likely do that without a fix for this issue. so i'll be aiming to ship this fix next week rather than this week. that's all -- lizzard
- Option 1: Throttle updates to 1%
- Option 2: Leave updates at 5% (This is what I'm going with for now - Lizzard)
- Option 3: Do an official release anyway (at 25%)
- Option 4: Publish release notes and note this crash as a known issue
investigating: bc, dminor, waiting on feedback from jesup + stefan arentz. drno is on PTO
- dminor has a patch ready & reviewed if we want to disable openH264 completely on fennec.
- I'm going to write and test a patch to handle unaligned loads - dminor
- bc investigated but didn't find anything useful with bughunter (which doesn't test android anyway)
- stefan: we should address this regression. No strong opinion on how to handle the release or disabling openh264
- Andreas & Adam (PM): Doing an official release with this crash is not acceptable.
- Are there security concerns around the release? Ie. is it risky to keep users a little longer on 65?
Can we estimate the time to fix this issue?
- if a fix is expected to take a long time (multiple weeks), we probably should do a release of 66 with OpenH264 disabled.
- if it's a matter of delaying 65 with a week or so until a fix has landed, then we can just push the release of 66 to next week instead of doing it this week.
Risks: We don't fully understand the impact of disabling H264 for users - it might cause other issues / breakage. - Jesup: If this is due to GMP plugin changes that happen to hit the unaligned case now, we can easily fix it by removing the alignment restriction and loading byte-at-a-time. Presumably in a point release. Alternatively we could respin the Openh264 codec with code to avoid this case (which we'd have to write, but would be updated with a small openh264 update). - drno (on vacation) (he will look later on monday) - Maire will weigh in for drno - Do we have data on how common openh264 use is? telemetry that shows when it's used? (openh264 should only be called if the user participates in a webrtc call with specific endpoints that only talk h264)
Background
OpenH264 1.8.1 just released from us within the last week or so and is only targetting nightly at the moment desktop only due to our graphics team noticing a crash with it on Nightly. (do we have a bug number for this crash? I'm working on the update and was unaware that we had seen a problem with it on Nightly - dminor) bug 1533001 - lizzard ok thanks, that is a bug I filed, I thought there was a separate one as well.
There have been no changes in the OpenH264 plugin, we're still shipping 1.7.1.
So that would leave a compiler related change, something GMP related, or data corruption.