Releases/Firefox 3.0.6/Post Mortem: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
Line 11: Line 11:


== IT ==
== IT ==
# One of bouncer's db servers wasn't running, which caused updates to fail ({{bug|476753}}). Had to pull/push updates a few times while trying to get it working. Why was the server not up and running? How do we ensure this doesn't happen in the future?
# Bouncer slave database was disabled, causing updates to fail when the master was under high load ({{bug|476753}}). Had to pull/push updates a few times while debugging.
## Root Cause: <code>tm-bouncer01-slave02</code> was disabled in the load balancer during a previous maintenance window and never re-enabled.
## Actions:
### Need to monitor backend service status on the load balancer ({{bug|476764}}) in the same way we monitor origin web servers.  Nagios would have alerted after maintenance window that the second slave was missing.
### Bouncer needs three databases to withstand a failure of one during release ({{bug|477183}}).
 
 
# Had to throttle bits after release ({{bug|476875}}) because mirrors couldn't handle load. Why? How do we ensure this doesn't happen in the future?
# Had to throttle bits after release ({{bug|476875}}) because mirrors couldn't handle load. Why? How do we ensure this doesn't happen in the future?


== Websites ==
== Websites ==
# [issue here]
# [issue here]

Revision as of 04:17, 6 February 2009

The following are the meeting notes from the Firefox 3.0.6 post-mortem held on Friday, February 6, 2009 at 12:00pm PST.

Development

  1. [issue here]

QA

  1. [issue here]

Build

  1. [issue here]

IT

  1. Bouncer slave database was disabled, causing updates to fail when the master was under high load (bug 476753). Had to pull/push updates a few times while debugging.
    1. Root Cause: tm-bouncer01-slave02 was disabled in the load balancer during a previous maintenance window and never re-enabled.
    2. Actions:
      1. Need to monitor backend service status on the load balancer (bug 476764) in the same way we monitor origin web servers. Nagios would have alerted after maintenance window that the second slave was missing.
      2. Bouncer needs three databases to withstand a failure of one during release (bug 477183).


  1. Had to throttle bits after release (bug 476875) because mirrors couldn't handle load. Why? How do we ensure this doesn't happen in the future?

Websites

  1. [issue here]