Confirmed users
571
edits
No edit summary |
(→Deciding to close a tree: added worker monitoring link, screenshot and explanation) |
||
Line 5: | Line 5: | ||
Some of the criteria used include: | Some of the criteria used include: | ||
* Broken build on an integration or main tree (e.g. mozilla-inbound, mozilla-central, autoland) | * Broken build on an integration or main tree (e.g. mozilla-inbound, mozilla-central, autoland) | ||
* Excessive backlog for builds or tests in any platform | * Excessive backlog for builds or tests in any platform ([https://earthangel-b40313e5.influxcloud.net/d/slXwf4emz/workers?orgId=1&refresh=1m&from=now-3h&to=now&var-provisioner=aws-provisioner-v1&var-provisioner=gce&var-provisioner=proj-autophone&var-provisioner=releng-hardware&var-provisioner=scriptworker-prov-v1&var-workerType=All Grafana monitoring dashboard]). Example: | ||
[[File:Sheriffing workers vs queue.png|center]] | |||
The upper graph shows the count of active workers for a worker type, the lower one the number of jobs which are pending and waiting to run. In a normal situation, the number of active workers would increase to reduce the backlog. If that is not possible (in the example after 20:00), e.g. because the limit for the number of workers has been reached or there is an infrastructure issue, the trees monitored by sheriffs must be regularly checked if builds start in less than 15 minutes and tests in 30 minutes, else trees must be closed (category "infrastructure" if not using the full capacity, "backlog" if taskcluster uses machines up to the capacity limit). #ci on IRC should be notified about the issue and a bug should be created independent from the need to close the trees. | |||
* Infrastructure or systems failures that affect a significant number of tests or builds (e.g. AWS, data center, networking issues) | * Infrastructure or systems failures that affect a significant number of tests or builds (e.g. AWS, data center, networking issues) | ||
* Mass "bustage" that could hide other test failures (this is when code lands and causes multiple tests to fail across multiple chunks of tests or suites of tests, making it harder to catch further failures if something else lands *during* the period in which these tests are failing from the original code landing) | * Mass "bustage" that could hide other test failures (this is when code lands and causes multiple tests to fail across multiple chunks of tests or suites of tests, making it harder to catch further failures if something else lands *during* the period in which these tests are failing from the original code landing) |