Sheriffing/Deciding To Close A Tree: Difference between revisions

no edit summary
No edit summary
No edit summary
Line 1: Line 1:
== Deciding to close a tree ==
== Deciding to close a tree ==
 
Many objective and subjective criteria are part of the decision to close a tree. Tree closure means that developers are prevented from pushing or merging code to a codebase. Later, sheriffs will reopen the trees when the problem appears to be resolved.
Many objective and subjective criteria are part of deciding to close a tree. Tree closure means that developers are prevented from pushing or merging code to a codebase. Later, sheriffs will reopen the trees when the problem appears to be resolved.


Some of the criteria used include:  
Some of the criteria used include:  
Line 8: Line 7:
* Infrastructure or systems failures that affect a significant number of tests or builds (e.g. AWS, data center, networking issues)
* Infrastructure or systems failures that affect a significant number of tests or builds (e.g. AWS, data center, networking issues)
* Mass "bustage" that could hide other test failures (this is when code lands and causes multiple tests to fail across multiple chunks of tests or suites of tests, making it harder to catch further failures if something else lands *during* the period in which these tests are failing from the original code landing)
* Mass "bustage" that could hide other test failures (this is when code lands and causes multiple tests to fail across multiple chunks of tests or suites of tests, making it harder to catch further failures if something else lands *during* the period in which these tests are failing from the original code landing)
* Infra failure that affects our ability to see what's happening (e.g. treeherder being down or not ingesting jobs or the data it consumes not being updated, or treestatus being broken so we're closed by default)
* Infrastructure failure that affects our ability to see what's happening (e.g. Treeherder being down, not ingesting jobs, the data it consumes not being updated, or treestatus being broken so we're closed by default)
 
In short, if the state of the tree and the surrounding systems is such that things are going to get worse if the trees stay open, it is time to close the tree.


== Actions to take ==
== Actions to take ==
Once you've decided to close the tree, you need to take the following steps:


Mostly this Tree-Closures are due failing Code (like Build Bustages, Test failures) and are this Tree Closures are normally fixed very fast. However we might have cases where we have longer Tree-Closures due to Infra Related Problems that are UN-planned and need deeper investigations by Teams like IT,Releng, Taskcluster etc
* Decide which trees are affected
 
** If the cause is not infrastructure- or load-related, you can probably leave the Try branch and only close the affect tree, e.g. mozilla-inbound
For this cases we might need notifications to Developers to :
** If the cause *is* infrastructure- or load-related, you should close all trees, including Try.
 
* Use the [https://mozilla-releng.net/treestatus TreeStatus] tool to close the affected trees.
 
* If a bug doesn't already exist, create a bug for the tree closure.
* Avoid frustrations when people want to push to try and notice its not possible
* Communicate the tree closure to developers. Announce the closure in IRC in #developers and change the channel topic to point to the tree closure bug. This avoid several unnecessary frustrations:
* Pending/Running Testruns (especially try) fails to whatever unrelated reasons and cause a Developer to spend time on issues that are not caused by her/his changes
** developers want to push and notice it's not possible.
* Reduce hammering on Sheriffs why a tree is closed and whats the eta etc
** developers spending time investigating failures in pending/running testruns (especially Try) that are not caused by her/his changes
 
** minimizes repeated inquiries to the sheriffs about why a tree is closed, what's the ETA, etc.
=== Steps to-do from the onduty Sheriff ===
* engage the people needed to fix the issue. This could be:
** the developer(s) who landed the suspected code (if this is known)
** domain experts for the module where the builds/tests are failing. The [[Modules/All|Module owner list]] can help track people down.
** the buildduty, releng, and/or the taskcluster teams if it's an infrastructure issue
* If the tree closure is expected to be a longer problem, post a short mail to the mozilla.dev.platform newsgroup, e.g. https://groups.google.com/forum/#!topic/mozilla.dev.platform/Kzd1es4KiYA


1.) make clear that there is a outage and who is the current Sheriff with changing the topic of the #developers channel on IRC. Please add also the tracking bug and a possible eta time into the topic of #developers.
== Next steps ==
In case it wasn't clear from the previous section, '''COMMUNICATION''' is the ''most important thing'' during any outage. When engaging others to help, letting them know that the trees are closed ''usually'' encourages prompt cooperation. Developers or service teams may need your help to test fixes efficiently on Try, or to back out particular changesets once they are implicated. Make yourself available to these people as required.


example:
For everyone else not involved in trying to fix the issue, they're simply waiting for the trees to reopen and are effectively blocked. Developers and service teams may not have the time or experience to reliably update the tree closure bugs (or IRC) with status. Sheriffs should take the lead on this and keep other developers updated, both in the bug and in IRC.
Onduty-Sheriff: Tomcat - All Tree Closure - Database issue Bug 1234567 - no eta yet


2. Also post to #treestatus about the tree closure  
== Hand-off ==
In the event of longer tree closures, you may hit the end of your workday. If this happens, or better still, *before* it happens, find someone who can continue acting overseeing the tree closure. If you've followed all the above steps, there should be an adequate papertrail for someone to follow and continue.


3.) If the tree closures is expect to be a longer problem -> Post a short mail to the mozilla.dev.platform and dev-fxos <dev-fxos@lists.mozilla.org> newsgroup like https://groups.google.com/forum/#!topic/mozilla.dev.platform/Kzd1es4KiYA - also like here the next sheriff could if the issue is fixed sent a all clear information.
Note: in the event of more systemic failures, e.g. major infrastructure failures or AWS outages, it is best to invoke the [[https://mana.mozilla.org/wiki/display/MOC MOC]] (Mozilla Operations Center, #moc on IRC). They have 24/7 support and are well-versed in similar process to the above.
canmove, Confirmed users
2,850

edits