CIDuty/How To/High Pending Counts: Difference between revisions

CIDuty/How To/High Pending Counts (view source)

Revision as of 15:59, 15 November 2017

3,213 bytes added , 15 November 2017

Updated instruction on how to deal with pending counts

Aselagea

148

edits

@@ Line 2: / Line 2: @@
 {{Release Engineering How To|Dealing with high pending counts}}
 = Dealing with high pending counts =
-Demand will sometimes outstrip supply in the worker pool. A high number of pending build can indicate a problem with the scheduler, (a set of) buildbot-masters, or a particular pool of slaves (and hence possibly puppet).
+Demand will sometimes outstrip supply in the worker pool. There's a series of possible issues that may result in a test/build backlog.
+At this point, Mozilla's automation uses two CI tools for validating the changes made to the code base: '''Buildbot''' and '''TaskCluster'''. Hence, the first step in dealing with such high pending counts is finding out which automation is affected. We have a Nagios check in place that constantly monitors the number of jobs in each worker pool and alerts in #buildduty when that number reaches past certain thresholds.
-The number of pending builds is available in [http://builddata.pub.build.mozilla.org/reports/pending/pending.html graphs] and is also displayed per worker type in [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/index.html slave health].  The graphs are helpful for noticing anomalous behavior. You will also see an alert in #buildduty, regarding the high number of pending jobs, for example
+<pre>
+<nagios-releng> Fri 01:11:44 UTC [7253] [moc] nagios1.private.releng.scl3.mozilla.com:Pending jobs is CRITICAL: CRITICAL Pending Jobs: 2556 on ['t-w864-ix'] (http://m.mozilla.org/Pending+jobs)
+</pre>
+For Buildbot, the number of pending jobs is available in [https://secure.pub.build.mozilla.org/buildapi/pending BuildAPI] and is also displayed per worker type in [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/index.html SlaveHealth].
+As for TaskCluster, you can find the pending count for individual worker pools by accessing the corresponding endpoint from https://queue.taskcluster.net/v1/pending/{}/{}.
+* e.g. https://queue.taskcluster.net/v1/pending/releng-hardware/gecko-t-osx-1010
+There's also a handy python [https://hg.mozilla.org/build/braindump/file/tip/nagios-related/check_pending_jobs.py script] that can be used to see pending counts for each worker pool at one point:
 <pre>
-nagios-relengMon 17:00:34 PDT [4860] cruncher.srv.releng.scl3.mozilla.com:Pending builds is CRITICAL: CRITICAL Pending Builds: 7370 (http://m.mozilla.org/Pending+builds)
+$ python check_pending_jobs.py -h
+usage: check_pending_jobs.py [-h] [-B] [-t] [-C CRITICAL] [-W WARNING]
+                             [-c CRITICAL] [-w WARNING] [-b] [-T]
+optional arguments:
+  -h, --help            show this help message and exit
+  -B, --builds          compute number of pending builds per machine pool
+  -t, --tests           compute number of pending tests per machine pool
+  -C CRITICAL, --builds_critical CRITICAL
+                        Set builds CRITICAL level as integer eg. 300
+  -W WARNING, --builds_warning WARNING
+                        Set builds WARNING level as integer eg. 200
+  -c CRITICAL, --tests_critical CRITICAL
+                        Set tests CRITICAL level as integer eg. 3000
+  -w WARNING, --tests_warning WARNING
+                        Set tests WARNING level as integer eg. 2000
+  -b, --buildbot        Display pending jobs on buildbot machine pools
+  -T, --taskcluster     Display pending jobs on taskcluster workers
 </pre>
-Here are some steps you can use to help figure out why it's happening:
 == What platforms are affected? ==
@@ Line 18: / Line 42: @@
 == Where is the load coming from? ==
+=== Buildbot ===
-=== Is there a spike in pending jobs ===
+==== Is there a spike in pending jobs ====
+There's a dashboard in Grafana where you can look at the Buildbot pending counts per platform:
 https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/dashboard/db/pending
-You can look at the pending count for jobs in individual taskcluster jobs, using links like this, with the worker type specified
+==== Did nightlies just get triggered? Did nightlies just trigger dependent l10n jobs? ====
-https://queue.taskcluster.net/v1/pending/releng-hardware/gecko-t-osx-1010
-=== Did nightlies just get triggered? Did nightlies just trigger dependent l10n jobs? ===
 These are predictable daily sources of spiky load.
-=== Did the trees just open following a closure? ===
+==== Did the trees just open following a closure? ====
 There is usually a big pulse of activity right after a tree closure as developers start landing code again.
-=== Is someone abusing the try server? ===
+==== Is someone abusing the try server? ====
 Occasionally someone will (re)trigger a large of jobs on a single changeset. People who do this often and with good reason (e.g. jmaher) usually do so on weekends when there is less contention for the infrastructure. If someone does this mid-week, it's best to find them on IRC and figure out why they've done this. You may need to cancel some/all of their extra jobs if it's impacting other developers.
 There is no designated tool for finding these types of abuses. When looking for abuse, I check which branch the pending jobs are coming (usually try), and then walk back through the [https://treeherder.mozilla.org/#/jobs?repo=try treeherder] history for that branch looking for revisions with multiple jobs triggered.
+A particular backlog source is running the same set of tests multiple times (e.g. by pushing using "--rebuild 20"). That is generally needed when dealing with intermittent failures that are difficult to reproduce and require multiple runs. Unfortunately, that will sometimes unfairly block other users from getting their test results in time. To confirm the backlog is coming from such pushes you can check the current jobs in the pending queue, look for duplicated entries and then go to treeherder and see which user those pushes belong to.
-=== Are infrastructure problems causing retries? ===
+==== Are infrastructure problems causing retries? ====
 For example, if the builds cannot fetch a package they will retry. Or if builds cannot upload their resulting binaries. For example, IT once implemented a DNS redirect to a server where we didn't have ssh keys to upload the resulting binaries. See: {{bug|1198296}} In this case builds fail, they retry and pending counts rise.
-=== Is coalescing working? ===
+==== Are new AWS instances starting and running buildbot? ====
-We have SETA configured to coalesce (run certain test jobs less often) on taskcluster on autoland, mozilla-inbound and and graphics branches.  This coalescing does not apply to mac tests until {{bug|1382204}} is resolved.  If a large number of new test jobs have been recently added, their profile might not be in seta yet and thus contributing to a higher load.  See bug {{bug|1386405}} for an example of how to resolve this issue.
-=== Are new AWS instances starting and running buildbot? ===
 There will be an alert in #buildduty regarding aws_watch_pending.log not being updated if new instances are not being created.  A common cause is that there is a typo in configs/watch_pending.cfg. Look at the logs on the aws manager instance (/var/log/messages). There should be an error message regarding a typo in the json file. We shouldn't really get to that point because there are tests to verify this but sometimes it happens.  For example, https://bugzilla.mozilla.org/show_bug.cgi?id=1195893#c8. If the are AWS instances starting, ssh to a instance that has recently started and look at the /var/log/runner.log and see if there are errors.  Does the /builds/slave/twistd.log indicate that builds are completing on this machine?
-=== Is there a problem with the AMI golden master? ===
+==== Is there a problem with the AMI golden master? ====
 Each night, we create a new AMIs for Amazon instances from our puppet configs.  Once it is ready, all new instances are created with this image.  If there is a problem with the image, this has to be corrected and new AMIs generated.  If the image is broken to the extent that it should be pulled, you can deregister the AMI in the amazon console so the previous night's AMI can be used instead.  To quickly bring down the instances that are launched with the problem AMI, you can use this script on aws-manager2.srv.releng.scl3.mozilla.com
@@ Line 58: / Line 78: @@
 For an example of this problem, see {{bug|1203104}}
-=== Are we underbidding for AWS spot instances? ===
+==== Are we underbidding for AWS spot instances? ====
-We use AWS spot instances for a large proportion of our continuous integration's farm's capacity. We have an algorithm that bids for the different instance types within a range of prices. Prices are here https://github.com/mozilla/build-cloud-tools/blob/master/configs/watch_pending.cfg#L50 Algorithm is here https://github.com/mozilla/build-cloud-tools/blob/master/cloudtools/aws/spot.py If we are underbidding for the current costs of the spot instances, we won't get any new AWS instances and we pending counts will go up.
+We use AWS spot instances for a large proportion of our continuous integration's farm's capacity. We have an algorithm that bids for the different instance types within a range of prices. Prices are here https://github.com/mozilla/build-cloud-tools/blob/master/configs/watch_pending.cfg#L63 Algorithm is here https://github.com/mozilla/build-cloud-tools/blob/master/cloudtools/aws/spot.py. If we are underbidding for the current costs of the spot instances, we won't get any new AWS instances and we pending counts will go up. There's a Nagios check in place that should notify us in #buildduty when such things happen.
-=== Are ssh keys a problem on the the masters? ===
+==== Are we unable to connect to AWS due to network issues? ====
-Buildbot ssh keys have a problem {{bug|1198332}}
-=== Are we unable to connect to AWS due to network issues? ===
 Chaos really. We depend heavily on AWS.
-=== Are there problems connecting to the Buildbot database? ===
+==== Are there problems connecting to the Buildbot database? ====
 Cannot connect to database due to network or other issues. Pending count will probably not increase, will just stay the same because jobs aren't deleted from db as they complete.
-=== Are jobs taking longer today than they did yesterday? ===
+==== Did a reconfig not complete successfully on a master? ====
-Check end to end times here https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/#/dashboard/db/kpis
-=== Did a reconfig not complete successfully on a master? ===
 For example, the alert that pending builds are increasing as well as a stale reconfig lockfile may indicate that a master is in a hung state and not running jobs.  In this case, it was an AWS master.
@@ Line 150: / Line 163: @@
 In this case, we needed to hard restart the master because the buildbot process was hung.  SIGTERM didn't work so had to kill -9 the buildbot process.  It would also be a good idea to disable this master in slavealloc as soon as the problem appeared so jobs could be allocated to other masters.
-=== Is there an an unusual amount of buildbot lag? ===
+==== Is there an an unusual amount of buildbot lag? ====
-Look at the graph for buildbot lag {{bug|1220191}}
+There's a check in place that will alert in #buildduty when the master lag goes past certain limits. You can also look at the graph for buildbot lag: {{bug|1220191}}
-=== Was the increase in pending jobs sudden and after a reconfig? ===
+==== Was the increase in pending jobs sudden and after a reconfig? ====
 This may have been due to bogus scheduling jobs added. See {{bug|1223042}} as an example of how to fix this issue.
-== What code/systems have changed? ==
+==== What code/systems have changed? ====
 Figure out how long the backlog has been building, and then check the [[ReleaseEngineering/Maintenance|Maintenance wiki]] to see if code has recently landed that would be affecting that platform.
@@ Line 166: / Line 179: @@
 * ???
-== Has there been a network or systems event? ==
+==== Has there been a network or systems event? ====
 We get nagios alerts in #buildduty for the following:
 * BGP flaps from AWS: this can affect connectivity between slaves and masters in AWS
@@ Line 172: / Line 185: @@
 If there are no alerts, it is worth asking in #MOC and/or #infra to see if IT is tracking any events not currently on our nagios radar.
+=== TaskCluster ===
+==== Is coalescing working? ====
+We have SETA configured to coalesce (run certain test jobs less often) on taskcluster on autoland, mozilla-inbound and and graphics branches.  This coalescing does not apply to mac tests until {{bug|1382204}} is resolved.  If a large number of new test jobs have been recently added, their profile might not be in seta yet and thus contributing to a higher load.  See bug {{bug|1386405}} for an example of how to resolve this issue.
+==== Are we hitting EBS limits? ====
+When the request for running workers is high, we may hit certain EBS limits for our AWS account and won't be able to spawn new instances. If that happens, we should coordinate with the TaskCluster team and investigate why we have such a large number of running workers. If needed, they can contact Amazon to increase those limits.
+Bug [https://bugzilla.mozilla.org/show_bug.cgi?id=1391564 1391564] would serve as a good example.
+==== Bad AMIs? ====
+The workers spawned from such AMIs may not able to take any jobs, which will in turn result in a growing backlog. In such cases, we should ping someone in #taskcluster to roll back the problematic AMIs to the last known good ones. It would also help temporarily bumping the capacity for certain pools until the pending counts are dropping to reasonable values.
+==== Is autologin not working? ====
+Similar to the case above, this would also prevent the existing workers from running any new task and it may be the consequence of changing the cltbld passwords on those machines. The most recent example is  bug [https://bugzilla.mozilla.org/show_bug.cgi?id=1376807 1376807].
 = Rebooting workers in batches =
@@ Line 193: / Line 220: @@
 Please note, if you need to call other Slave API actions, such as "shutdown" instead of reboot, see the API docs here: http://mozilla-slaveapi.readthedocs.org/en/latest/api/#endpoints
+= See also =
+https://mana.mozilla.org/wiki/display/NAGIOS/Backlog+Age