CIDuty/How To/High Pending Counts: Difference between revisions

CIDuty/How To/High Pending Counts (view source)

Revision as of 01:24, 21 December 2018

9,779 bytes removed , 21 December 2018

no edit summary

Rmutter

Confirmed users

22

edits

@@ Line 3: / Line 3: @@
 = Dealing with high pending counts =
 Demand will sometimes outstrip supply in the worker pool. There's a series of possible issues that may result in a test/build backlog.
-At this point, Mozilla's automation uses two CI tools for validating the changes made to the code base: '''Buildbot''' and '''TaskCluster'''. Hence, the first step in dealing with such high pending counts is finding out which automation is affected. We have a Nagios check in place that constantly monitors the number of jobs in each worker pool and alerts in #buildduty when that number reaches past certain thresholds.
+At this point, Mozilla's automation uses one CI tools for validating the changes made to the code base '''TaskCluster'''. Hence, the first step in dealing with such high pending counts is finding out which automation is affected. We have a Nagios check in place that constantly monitors the number of jobs in each worker pool and alerts in #platform-ops-alerts when that number reaches past certain thresholds.
 <pre>
-<nagios-releng> Fri 01:11:44 UTC [7253] [moc] nagios1.private.releng.scl3.mozilla.com:Pending jobs is CRITICAL: CRITICAL Pending Jobs: 2556 on ['t-w864-ix'] (http://m.mozilla.org/Pending+jobs)
+<nagios-releng> Tue 23:01:26 UTC [8020] [Unknown] nagios1.private.releng.mdc1.mozilla.com:Pending tests is WARNING: WARNING Pending tests: 2151 on gecko-t-linux-xlarge (http://m.mozilla.org/Pending+tests)
 </pre>
-For Buildbot, the number of pending jobs is available in [https://secure.pub.build.mozilla.org/buildapi/pending BuildAPI] and is also displayed per worker type in [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/index.html SlaveHealth].
+For TaskCluster, you can find the pending count for individual worker pools by accessing the corresponding endpoint from https://queue.taskcluster.net/v1/pending/{}/{}.
-As for TaskCluster, you can find the pending count for individual worker pools by accessing the corresponding endpoint from https://queue.taskcluster.net/v1/pending/{}/{}.
 * e.g. https://queue.taskcluster.net/v1/pending/releng-hardware/gecko-t-osx-1010
@@ Line 42: / Line 40: @@
 == Where is the load coming from? ==
-=== Buildbot ===
-==== Is there a spike in pending jobs ====
-There's a dashboard in Grafana where you can look at the Buildbot pending counts per platform:
-https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/dashboard/db/pending
 ==== Did nightlies just get triggered? Did nightlies just trigger dependent l10n jobs? ====
@@ Line 64: / Line 56: @@
 ==== Are infrastructure problems causing retries? ====
 For example, if the builds cannot fetch a package they will retry. Or if builds cannot upload their resulting binaries. For example, IT once implemented a DNS redirect to a server where we didn't have ssh keys to upload the resulting binaries. See: {{bug|1198296}} In this case builds fail, they retry and pending counts rise.
-==== Are new AWS instances starting and running buildbot? ====
-There will be an alert in #buildduty regarding aws_watch_pending.log not being updated if new instances are not being created.  A common cause is that there is a typo in configs/watch_pending.cfg. Look at the logs on the aws manager instance (/var/log/messages). There should be an error message regarding a typo in the json file. We shouldn't really get to that point because there are tests to verify this but sometimes it happens.  For example, https://bugzilla.mozilla.org/show_bug.cgi?id=1195893#c8. If the are AWS instances starting, ssh to a instance that has recently started and look at the /var/log/runner.log and see if there are errors.  Does the /builds/slave/twistd.log indicate that builds are completing on this machine?
-==== Is there a problem with the AMI golden master? ====
-Each night, we create a new AMIs for Amazon instances from our puppet configs.  Once it is ready, all new instances are created with this image.  If there is a problem with the image, this has to be corrected and new AMIs generated.  If the image is broken to the extent that it should be pulled, you can deregister the AMI in the amazon console so the previous night's AMI can be used instead.  To quickly bring down the instances that are launched with the problem AMI, you can use this script on aws-manager2.srv.releng.scl3.mozilla.com
 <pre>
@@ Line 79: / Line 65: @@
 ==== Are we underbidding for AWS spot instances? ====
-We use AWS spot instances for a large proportion of our continuous integration's farm's capacity. We have an algorithm that bids for the different instance types within a range of prices. Prices are here https://github.com/mozilla/build-cloud-tools/blob/master/configs/watch_pending.cfg#L63 Algorithm is here https://github.com/mozilla/build-cloud-tools/blob/master/cloudtools/aws/spot.py. If we are underbidding for the current costs of the spot instances, we won't get any new AWS instances and we pending counts will go up. There's a Nagios check in place that should notify us in #buildduty when such things happen.
+We use AWS spot instances for a large proportion of our continuous integration's farm's capacity. We have an algorithm that bids for the different instance types within a range of prices. Prices are here https://github.com/mozilla/build-cloud-tools/blob/master/configs/watch_pending.cfg#L63 Algorithm is here https://github.com/mozilla/build-cloud-tools/blob/master/cloudtools/aws/spot.py. If we are underbidding for the current costs of the spot instances, we won't get any new AWS instances and we pending counts will go up. There's a Nagios check in place that should notify us in #platform-ops-alerts when such things happen.
 ==== Are we unable to connect to AWS due to network issues? ====
 Chaos really. We depend heavily on AWS.
-==== Are there problems connecting to the Buildbot database? ====
-Cannot connect to database due to network or other issues. Pending count will probably not increase, will just stay the same because jobs aren't deleted from db as they complete.
-==== Did a reconfig not complete successfully on a master? ====
-For example, the alert that pending builds are increasing as well as a stale reconfig lockfile may indicate that a master is in a hung state and not running jobs.  In this case, it was an AWS master.
-<pre>
-Mon 11:58:31 PDT [4448] cruncher.srv.releng.scl3.mozilla.com:Pending builds is WARNING: WARNING Pending Builds: 4697 (http://m.mozilla.org/Pending+builds)
-relengbot
-[sns alert] Mon 12:09:06 PDT buildbot-master04.bb.releng.usw2.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes.
-</pre>
-Looking at the log it hadn't been updated for several hours
-<pre>
-[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ tail !$
-tail master/twistd.log
--09-14 09:00:49-0700 [-] prioritizeBuilders: 0.17s builders with slaves: 0
--09-14 09:00:49-0700 [-] prioritizeBuilders: 0.17s prioritized 0 builder(s): []
--09-14 09:00:49-0700 [-] prioritizeBuilders: 0.17s using up to first 0 builders:
--09-14 09:00:49-0700 [-] prioritizeBuilders: 0.17s found 0 important builder(s): []
--09-14 09:00:49-0700 [-] prioritizeBuilders: 0.17s finished prioritization
--09-14 09:00:50-0700 [-] Creating BuildMaster -- buildbot.version: 0.8.2-hg-41f0fbee10f4-production-0.8
--09-14 09:00:50-0700 [-] loading configuration from /builds/buildbot/tests1-linux32/master/master.cfg
--09-14 09:00:50-0700 [-] JacuzziAllocator 1268988944: created
--09-14 09:00:50-0700 [-] nextAWSSlave: start
--09-14 09:00:50-0700 [-] nextAWSSlave: start
-</pre>
-Use strace to look at the hung process
-<pre>
-# checking the reconfig proc
-[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ ls -l /proc/4847/fd | grep 3
-l-wx------ 1 cltbld cltbld 64 Sep 14 12:25 1 -> /builds/buildbot/tests1-linux32/reconfig.log
-l-wx------ 1 cltbld cltbld 64 Sep 14 12:25 2 -> pipe:[8988493]
-lr-x------ 1 cltbld cltbld 64 Sep 14 12:25 3 -> /builds/buildbot/tests1-linux32/master/twistd.log
-[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ ps auwx | grep python
-cltbld    1045  0.0  0.1 196880  5256 ?        S    Aug29   7:05 /builds/buildbot/queue/bin/python /builds/buildbot/queue/tools/buildbot-helpers/command_runner.py -q /dev/shm/queue/commands -r 10 -t 60 -m 60 -j4
-cltbld    1057  0.0  0.7 226360 26316 ?        S    Aug29  10:39 /builds/buildbot/queue/bin/python /builds/buildbot/queue/tools/buildbot-helpers/pulse_publisher.py --passwords /builds/buildbot/queue/passwords.py -q /dev/shm/queue/pulse
-cltbld    4847  0.0  0.2 176936  9292 ?        S    09:00   0:02 python ./tools/buildfarm/maintenance/buildbot-wrangler.py reconfig /builds/buildbot/tests1-linux32/master
-cltbld   15648  0.0  0.0 103252   840 pts/0    S+   12:26   0:00 grep python
-cltbld   17135 26.4 58.2 3015260 2173096 ?     Sl   Aug30 5864:05 /builds/buildbot/tests1-linux32/bin/python /builds/buildbot/tests1-linux32/bin/buildbot start /builds/buildbot/tests1-linux32/master
-[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ strace -p 4847
-Process 4847 attached - interrupt to quit
-select(0, NULL, NULL, NULL, {0, 464499}) = 0 (Timeout)
-read(3, "", 8192)                       = 0
-stat("/builds/buildbot/tests1-linux32/master/twistd.log", {st_mode=S_IFREG|0664, st_size=30765027, ...}) = 0
-select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout)
-read(3, "", 8192)                       = 0
-stat("/builds/buildbot/tests1-linux32/master/twistd.log", {st_mode=S_IFREG|0664, st_size=30765027, ...}) = 0
-select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout)
-read(3, "", 8192)                       = 0
-stat("/builds/buildbot/tests1-linux32/master/twistd.log", {st_mode=S_IFREG|0664, st_size=30765027, ...}) = 0
-select(0, NULL, NULL, NULL, {0, 500000}^C <unfinished ...>
-Process 4847 detached
-[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ ls -l /proc/4847/fd | grep 3
-l-wx------ 1 cltbld cltbld 64 Sep 14 12:25 1 -> /builds/buildbot/tests1-linux32/reconfig.log
-l-wx------ 1 cltbld cltbld 64 Sep 14 12:25 2 -> pipe:[8988493]
-lr-x------ 1 cltbld cltbld 64 Sep 14 12:25 3 -> /builds/buildbot/tests1-linux32/master/twistd.log
-[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$
-[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$
-# ^ looks like it's still running and waiting on new log lines in twistd.log
-# so let's check the buildbot proc itself and see why it's not logging
-[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$
-[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ strace -p 17135
-Process 17135 attached - interrupt to quit
-recvfrom(131, ^C <unfinished ...>
-Process 17135 detached
-[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ ls -l /proc/17135/fd | grep 131
-lrwx------ 1 cltbld cltbld 64 Sep 14 12:28 131 -> socket:[8988995]
-[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ lsof |grep 8988995
-buildbot  17135    cltbld  131u     IPv4 8988995      0t0     TCP buildbot-master04.bb.releng.usw2.mozilla.com:48176->ec2-54-215-155-53.us-west-1.compute.amazonaws.com:http (ESTABLISHED)
-# ^ so we are hung on some socket TCP connection and not doing anything else.
-</pre>
-In this case, we needed to hard restart the master because the buildbot process was hung.  SIGTERM didn't work so had to kill -9 the buildbot process.  It would also be a good idea to disable this master in slavealloc as soon as the problem appeared so jobs could be allocated to other masters.
-==== Is there an an unusual amount of buildbot lag? ====
-There's a check in place that will alert in #buildduty when the master lag goes past certain limits. You can also look at the graph for buildbot lag: {{bug|1220191}}
-==== Was the increase in pending jobs sudden and after a reconfig? ====
-This may have been due to bogus scheduling jobs added. See {{bug|1223042}} as an example of how to fix this issue.
 ==== What code/systems have changed? ====
@@ Line 180: / Line 79: @@
 ==== Has there been a network or systems event? ====
-We get nagios alerts in #buildduty for the following:
+We get nagios alerts in #platform-ops-alerts for the following:
 * BGP flaps from AWS: this can affect connectivity between slaves and masters in AWS
 * load on upload/stage: this can affect the download of artifacts for builds and tests, leading to retries and high pending counts
@@ Line 205: / Line 104: @@
 . If no correlation can be found, let people know in #taskcluster about the spike in case it is not expected.
-= Rebooting workers in batches =
+= Rebooting taskcluster workers =
-When many workers are disconnected, e.g. after a network event, it is useful to be able to reboot many of them at one time. The various slave type subpages in slave health (e.g. [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-w732-ix t-w732-ix]) lets you do this via batch actions.
+Rebooting taskcluster workers has to be done manually depending on the type of machine. For gecko-t-win10-64 and gecko-t-linux-talos it has to be done via ILO. For gecko-t-osx-1010 it can be done via SSH.
-Two batch actions are currently available:
-# Reboot all broken workers - will reboot all workers that haven't reported a result in more than 6 hours
-# Reboot workers that have not reported in # minutes - allows you to specify the time cut-off used for rebooting workers. This is sometimes useful when you have many workers (or even a whole pool) that are failing to connect after, e.g., a network event, and you don't want to wait for them all to idle for 6 hours.
-'''Note:''' these actions don't care or check if the slave is currently running a job, only when the last job was run. As such, you *may* lose work-in-progress. However, if you're having issues across an entire pool, sometimes it is preferable to lose a few in-progress jobs to ensure the health of the larger pool.
-These actions use slaveapi to perform the reboots. You could also do this by hand by creating a text file containing the list of the workers you want to reboot (let's call it bad_workers.list), set MY_LDAP_USER and MY_LDAP_PASSWORD environment variables to your LDAP credentials, making sure you are on the VPN; and then run:
- cat bad_workers.list | \
-   while read slave; do \
-     curl -u "${MY_LDAP_USER}:${MY_LDAP_PASSWORD}" \
-     -dfoo=bar "https://secure.pub.build.mozilla.org/slaveapi/slaves/${slave}/actions/reboot"; \
-   done
-This should reboot the lot in one fell swoop.
-Please note, if you need to call other Slave API actions, such as "shutdown" instead of reboot, see the API docs here: http://mozilla-slaveapi.readthedocs.org/en/latest/api/#endpoints
 = See also =
 https://mana.mozilla.org/wiki/display/NAGIOS/Backlog+Age