Releases/Post-mortems/Firefox 3.6.4

From MozillaWiki
Jump to: navigation, search

Contents

Schedule / Location / Call Information

  • Thursday, 2010-08-12 @ 11:00 am PST (scheduled to last no longer than 1 hour 45 mins, shooting for 1 hour)
  • In Warp Core
  • 650-903-0800 x92 Conf# 8605 (US/INTL)
  • 1-800-707-2533 (pin 369) Conf# 8605 (US)

Other communication channels

  • join irc.mozilla.org #post-mortem for back channel (will be logged and attached here after)
  • Etherpad for meeting notes can be found here if people want to use it

Overview

The project can be analyzed by slicing it into the following components:

Feature development

"One more blocking bug"/schedule slipping

Crash-stats irregularities

Quick spin for bug 574905

Timeline

Firefox 3.6.3 shipped on April 1

Christian originally scheduled 3.6.4 for May 11, got feedback and tightened up to May 4

Lorentz / 3.6.3plugin1 beta went out on April 8

Firefox 3.6.4 build #1 went as an opt-in beta on April 16

* Had to quickly hack/change all-beta for a manual download location. Delayed media push

* Found 9 blockers that were being fixed for build #2

Bug 561308 prevented build 2 from being built on schedule and then bug 561817 made us start the builds a day late

Firefox 3.6.4 build #2 had an issue and never went out

* nthomas found that bug 534666 landed on default and not the relbranch

Christian emails socorro team with OOPP reporting concerns on April 28

Firefox 3.6.4 build #3 went as an opt-in beta on May 4

* Bug 563847 was determined to be a blocker, decided we couldn't go to the entire beta audience with build #3

Christian posted in Farmville forums on May 5th asking for testing. The thread was promptly deleted

chofmann asked on May 8 if we had Zynga contacts as some users were complaining about Farmville via Hendrix, beltzner said he would reach out and suggested we post in the user forums

This was the status on May 10th

On May 12th decided to create build #4 even though there were outstanding issues.

* This was also when it was first decided that 3.5.10 would stay tied to 3.6.4, in response to KaiRo

Firefox 3.6.4 build #4 went to beta on May 14

On May 17th metrics/Daniel set up the super-useful page at https://metrics.mozilla.com/stats/firefox.shtml

Firefox 3.6.4 build #5 went to beta on May 26

* Didn't go on the 25th due to MV network issues

* Found bug 568129 before releasing to beta, knew we would have to respin but decided it wouldn't hurt to ship it to beta users on older builds

Firefox 3.6.4 build #6 went to beta on May 28

* We called this a "release candidate" and did more press/blog posts. Weren't comfortable calling out 80% improvement as we weren't sure it would hold up in the release audience

* We were watching bug 563361 and bug 569104

We were trying to get a handle on the Cnet issue. Also, this is the first time we seriously started to discuss turning off OOPP

* We had escalated with Adobe and were getting to the right people at Cnet

* More talk of splitting 3.5.10 and 3.6.4 at this time, TB guys getting antsy

Christian got to the lead Flash developer and download.com product manager at Cnet on June 4

Around June 10th (probably earlier) we notice and get concerned about the crash spike

* Main tracking bug was bug 571118

* Also dbaron asked in conversation if socorro would be able to handle the increase in "crash" volume due to oopsies after releases

Decided not to block on Cnet issue, but bug 562198 became a blocker as it prevented Linux users from using banking sites

Firefox 3.6.4 build #7 went to beta on June 14

Most crash-stats investigations were wrapping up by June 15

TB team ships on June 17, didn't disclose security vulns that affect Firefox

dbaron asked about socorro capacity on June 21

crash-stats spike finally solved on June 21 (configuration problem)

Firefox 3.6.4 shipped on June 22

A security researcher got turned around and went public with this bug, thinking it was fixed in 3.6.4 (it wasn't shipped until 3.6.7)

This bug flared up. RRRT saw it, and Lilly contacted Zynga. Required a quick 3.6.6 (as 1.9.2.5 was taken by Fennec)

Socorro team had to turn throttling to 10% on June 25 as the system was overwhelmed

* Note that an increase was anticipated

Firefox 3.6.6 shipped on June 26

Discussion points

  • What does baking on trunk mean to us now?
  • Could we have foreseen every new blocking bug and/or the bug that caused the respin? How can we not get blindsided in the future?
  • Did we make the right call keeping 3.5.10 and 3.6.4 tied together?
  • Did the "project branch → opt-in beta → beta → release" format work well? How might we do it differently/better?
  • How can schedule be better communicated when things are in flux?
  • Is socorro at a state we are confident with? Are there more changes that need to be made? Are there future projects that may have the same sort of issues?
  • Do we need an action plan for dealing with 3rd parties? How long are we expected to wait? Should we have a more formal outreach/partner program?
  • What sort of things might we backport in the future? Are the lessons here specific to OOPP or can they be applied generally?
  • clear mails (with correct subjects) on rel-drivers for record keeping
    • not everyone reads never-ending scrollback
    • easy to miss an important handoff if not reply-all, with changed subject
    • hard to figure out historical
  • problems tracking patches across branches
    • bsmedberg reported problems tracking fixes on lorentz to m-c and then to moz192 and relbranch
    • nthomas caught missed fix on relbranch with build#2
  • any way to avoid one patch per respin?
  • Metrics can work on Operational Metrics dashboards for systems that have complex interactions or systems that can be monitored for the trending affect of things such as a config change or a release. See [[1]]

Things that went right

Things that went wrong

Suggested improvements

  • Release codenames to reduce confusion (?) (clegnitto)
  • Branch landing verifier scripts (clegnitto)
  • Need to not use IRC and meetings, need a written record
  • Emails to release-drivers should have clear subjects with the version # and not be threaded
  • Date-scoped queries for historical mining of bug state
  • Need better defined/more formal beta program and feedback channels
  • Create alternate plans at the beginning and add firebreaks with mitigation plans
  • Use a rage for certain schedule items (shipping in particular), to give some wiggleroom and prevent excessive schedule churn
  • Would be useful for RelEng to have SLAs so that release drivers can set expectations/urgency for each build. The information can also be looked at in post-mortems as well
  • Front-load / "pre-mortem" (meeting, etc) the QA test plan to bring new ideas and unique testing. Formally modify the plan as needed, prevent ad-hoc QA test plan
  • Socorro team came up with better trends/operational stats and linking them with events. Make this general / use it everywhere
  • "Sightings" in bugs will potentially make sure fixes aren't missed when porting between branches
  • Implement a "Related items in external systems" (key/value fields) to link bugs to commits via automation

Other reference material