Tree Closures: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
Line 7: Line 7:
=== Recent Closures ===
=== Recent Closures ===
* Wed, May 7
* Wed, May 7
** 8:00 AM -  
** 8:00 AM - 12:50 PM
*** Waiting on backout of ({{bug|432492}}) to cycle through.
*** Waiting on backout of ({{bug|432492}}) to cycle through.



Revision as of 19:52, 7 May 2008

Overview

Whenever the main tinderbox tree has to be closed, please record the date, the close start time, a rough time when the problem first started (if different from the close start time), and eventually, a tree open time. We need this information in order to track infrastructure problems, and try to resolve them in the future.

Please keep all times in Mozilla Standard Time (US Pacific, same time as on tinderbox). Put more recent closures on top of old ones. Please include links to any relevant bugs.

Recent Closures

  • Wed, May 7
    • 8:00 AM - 12:50 PM
      • Waiting on backout of (bug 432492) to cycle through.

Older Closures

  • Tuesday, May 6
    • 8:20 - 9:00 PM
      • Talos machines were failing due to cvs-mirror issues (bug 432570).
  • Thursday, May 1
    • Start 12:00 PM
      • TUnit does not complete, qm-centos5-02 orange since yesterday. jst and mkaply in range, jst and sicking investigating.
  • Thursday, April 24
    • 8pm - 11pm
      • expected outage for graph server and buildbot maintenance
  • Wednesday April 16
    • 3PM - 1AM
      • Unexpected outage started ~3pm, bug 429406
      • Tree closed at 8:10 due to qm-xserve-01 still not working
  • Tuesday April 8
    • 2:24AM PDT - 4:31AM
      • Tree closed due to bug 427723 and bug 427728.
      • Windows nightly box restarted and completed, talos boxes started testing
  • Monday April 7
    • 7 PM PDT
      • Tree has been orange for too long (unit test failures) and then someone checked in theme changes (bug 427555) that caused red.
      • The orange was fixed after bug 426501 was backed out.
  • Saturday April 5
    • 00:48 PDT - 20:45
      • unit test failures across 3 platforms
      • filed bug 427248, remaining issues spun off to:
        • bug 425987 worked around reftest failures with a larger timeout
        • bug 426997 for the new PGO test box - still burning, will ignore.
  • Friday, Mar 28, 2008
    • 16:23 PDT - 22:40 PDT
      • Bonsai DB replication issues
  • Tuesday, Mar 25, 2008
    • 14:30 PDT - 15:05 PDT
      • Leak test machines orange
        • Kai's patch for bug 420187 caused a leak, he fixed it
    • 14:30 PDT - 16:10 PDT
      • Windows unit test failures
        • the Windows unit test machine (qm-win2k3-01) had failed for a few cycles for various reasons, wanted to get a green cycle in before accepting more checkins.
        • bug 425081 filed about machine trouble
  • Saturday, Mar 22, 2008
    • 13:48 PDT - 18:10 PDT
      • Windows talos machines are all red
        • fallout was from enabling strict file: URI security policy
        • alice checked in a config change to talos to disable strict URI policy on talos, filed bug 424594 to get talos in line with this strict policy
        • still closed waiting on unit test orange to resolve.
  • Tuesday, March 18, 2008
    • 12:03 Wednesday, Mar 19, 2008
      • johnath re-opened tree after test failures cleared and rapid cycle test boxes were reporting numbers in the pre-closure range
    • 20:22
      • Major network issues at the MPT colo which hosts... everything. Closed until services are back online (including IRC).
      • bug 423882 for details on the missing talos boxes.
    • 17:31 - 17:39
      • Network issues at moco, closing to avoid a mess if tinderboxes and/or bonsai is taken down by it.
  • Friday March 14, 2008
    • 14:25 - 16:03
      • mac and windows unit test boxes stopped cycling sometime between 2am and 7:45am
      • dbaron noticed when firebot announced tinderbox falling off 12hr waterfall page at 14:23, Waldo had noticed and commented in #developers at 12:16 (and again at 12:42) but was preoccupied and only had time to really follow up at 14:23 to ask for tree closure
      • tree closed, bug 423015 filed
    • 6:00am - 07:40am
      • stage migration, closed to make sure Talos reconfig works OK and builds keep flowing
  • Thursday March 6, 2008
    • 7:00pm - 7:40pm
      • closed due to fx-win32-tbox bustage. Cleared stuck process, reopening assuming that was the issue rather than wait another hour for the PGO-lengthened build to finish
  • Wednesday March 5, 2008
    • 2:40pm - 4:00pm
      • Closed for orange on win3k3-01 that looks like memory corruption
      • Caused by bug 418703, backed out
    • 8am - 11:30am
      • bugzilla/CVS down, closed tree for talos bustage because of no-CVS
  • Wednesday, February 27, 2008
    • 9:30pm - 12:10am
      • Overlooked 2 tests that had also broken during the day (bug 420028 due to removal of DOMi, bug 384370 due to incomplete backout) and were not focus issues
      • Box has some other test/build failures that went away by themselves after being kicked
    • 7:40pm - 9:30pm
      • Windows box was orange with focus problems (bug 420010)
      • Missed from earlier in day due to expected issues while PGO was landing
    • 5:30pm - 7:40pm
      • Closed to do Linux kernel upgrades (bug 407796)
      • Quiet tree due to B4 freeze, and holding approval1.9b4 flags for enabling PGO on Windows, so closure has minimal impact
      • Took a bit longer than expected due to a reboot problem (420007)
  • Tuesday, February 26, 2008
    • 22:38 - 00:49
      • problem started at 22:33 when qm-win2k3-01 turned red again
      • same problem as earlier in the day: "Failure: buildbot.slave.commands.TimeoutError: SIGKILL failed to kill process" during tests
      • myk closed the tree around 22:38 to wait out the bustage, since the machine is our only windows unit testerbox
      • got stuck with an open file such that each build failed really quickly
      • bug 419799 filed to get sysadmin help to fix unit test box
      • mrz, who was on-call that evening, jumped into IRC and then went and kicked buildbot; first he closed some dialog about some process having crashed; then he restarted buildbot, but it didn't start building; then he killed both buildbit and its cmd process and then restarted it, after which it started building and completed successfully
      • dbaron reopened the tree to metered checkins of b4 blockers
  • Tuesday, February 26, 2008
    • 18:44 - 19:40
      • problems started with linux txul perf regression
      • continued with fxdbug-win32-tb reporting: ###!!! ASSERTION: invalid active window: 'Error', file e:/builds/tinderbox/Fx-Trunk-Memtest/WINNT_5.2_Depend/mozilla/embedding/components/windowwatcher/src/nsWindowWatcher.cpp, line 1086
      • continued with five cross-platform unit test failures, three reftests and two mochitests
      • reed backed out bug 419452, one of two candidates for the perf regression
      • dbaron fixed the three reftests and one mochitest, which were from his checkin for bug 363248
      • myk backed out sicking's fix for bug 416534, which had caused the last mochitest failure
      • various folks speculated that the fxdbug-win32-tb assertion was random (it didn't show up on Mac or Linux)
      • myk reopened the tree, feeling that things were under control
      • reed backed out second perf regression candidate (bug 395609) when initial backout didn't resolve it
      • sicking fixed test failure in bug 416534 and relanded
      • others started landing again
      • unit test tinderboxes cycled green
      • reed's second backout fixed perf regression
      • reed's first or second backout also fixed the fxdbug-win32-tb assertion
  • Tuesday, February 26, 2008
    • 5:10pm - 5:46pm
      • problem started at 5:06pm when qm-win2k3-01 turned red
      • reed said that machine frequently hits this random bustage and then recovers
      • reed previously noted at the top of the page that "qm-win2k3-01 is the only Windows unit test machine, so if it is orange or red, you should NOT check in."
      • myk, the sheriff for the day, closed the tree to wait out the apparently random failure and then reopened it when the next build came up green
      • reed thought there might be an old bug on the problem but wasn't sure, so dbaron filed bug bug 419761 on the problem to make sure it's tracked and not forgotten
      • wolf also filed bug 419759 on fixing or replacing winxp01 so we aren't entirely reliant on win2k3-01 for windows unittests
  • Sunday, February 24, 2008
    • 11:30am - 4:45pm (with orange lasting longer, with tree open)
      • problem started 9:23am, amid some other bustage
      • closed by dbaron a little after 11:30am
      • filed bug 419328: Windows unit test box stopped cycling
      • aravind hard-rebooted the box, came back with a bunch of popup tests orange.
      • test still orange after another cycle (forced by dbaron)
      • dbaron reopened tree 4:45pm despite the unfixed machine-related orange
      • joduinn rebooted the box again around 4:30pm
      • this time it came back with the color depth wrong, so the PNG reftests failed (but the mochitests worked)
      • color depth issue fixed 6:45am Monday
  • Thursday, February 21, 2008
    • 10pm - 12am
      • Dietrich
      • Closed to facilitate l10n string freeze
  • Tuesday, February 19, 2008
    • 9:10 PM - 10:45 PM
    • 1:00 AM - 3:00 AM (guess)
      • No sheriff (night) (guess)
      • Closed for experimental landing of bug 399852. The checkin stuck.
  • Wednesday, February 13, 2008
    • 1:00 PM - 1:40 PM (problem first noticed around 9:40 AM)
      • No sheriff
      • bug 417313 -- graphs.mozilla.org can't keep up with data being submitted
      • reopened after machines started going green; db load lessened, but underlying issue has not been fixed.