CIDuty/How To/High Pending Counts
Dealing with high pending counts
Demand will sometimes outstrip supply in the slave pool. A high number of pending build can indicate a problem with the scheduler, (a set of) buildbot-masters, or a particular pool of slaves (and hence possibly puppet).
The number of pending builds is available in graphs and is also displayed per slave type in slave health. The graphs are helpful for noticing anomalous behavior. You will also see an alert in #buildduty, regarding the high number of pending jobs, for example
nagios-relengMon 17:00:34 PDT [4860] cruncher.srv.releng.scl3.mozilla.com:Pending builds is CRITICAL: CRITICAL Pending Builds: 7370 (http://m.mozilla.org/Pending+builds)
Here are some steps you can use to help figure out why it's happening:
What platforms are affected?
Some platforms, notably talos test for all OSes, have finite pools of hardware. Once all the machines are running jobs, any other work will be queued up.
Spikes in the number of pending requests for Linux jobs that run on AWS instances (build, try, and test) can also occur. AWS instances are terminated when not required, and it can take a while (30 mins?) to spin up new instances to meet sudden demand.
Where is the load coming from?
Did nightlies just get triggered? Did nightlies just trigger dependent l10n jobs?
These are predictable daily sources of spiky load.
Are the pending jobs in jacuzzis?
Lots of pending jobs in a given jacuzzi is generally fine. That's what jacuzzis are for: to make sure a single job type doesn't soak up capacity from the entire pool. If there are an anomalously high number of pending jobs for a single jacuzzi, it's best to look for signs of abuse (below).
Did the trees just open following a closure?
There is usually a big pulse of activity right after a tree closure as developers start landing code again.
Is someone abusing the try server?
Occasionally someone will (re)trigger a large of jobs on a single changeset. People who do this often and with good reason (e.g. jmaher) usually do so on weekends when there is less contention for the infrastructure. If someone does this mid-week, it's best to find them on IRC and figure out why they've done this. You may need to cancel some/all of their extra jobs if it's impacting other developers.
There is no designated tool for finding these types of abuses. When looking for abuse, I check which branch the pending jobs are coming (usually try), and then walk back through the treeherder history for that branch looking for revisions with multiple jobs triggered.
Are infrastructure problems causing retries?
For example, if the builds cannot fetch a package they will retry. Or if builds cannot upload their resulting binaries. For example, IT once implemented a DNS redirect to a server where we didn't have ssh keys to upload the resulting binaries. See: bug 1198296 In this case builds fail, they retry and pending counts rise.
Is coalescing working?
We have SETA configured to coalesce (run certain test jobs less often). If this breaks, we will see a spike in load and high pending counts. There is a bug open to fix this bug 1199347. To see if this is a problem, tail the /builds/buildbot/tests_scheduler/master/twistd.log on buildbot-master81 and ensure there are lines indicating the jobs are being skipped on mozilla-inbound and fx-team, the two branches where SETA is currently enabled. For example:
kmoir@buildbot-master81.bb.releng.scl3.mozilla.com master]$ tail -f twistd.log 2015-09-10 08:17:54-0700 [-] tests-mozilla-inbound-ubuntu32_vm-opt-unittest-7-3600: skipping with 4/7 important changes since only 81/3600s have elapsed 2015-09-10 08:17:54-0700 [-] tests-mozilla-inbound-snowleopard-debug-unittest-7-3600: skipping with 3/7 important changes since only 2190/3600s have elapsed
As an aside, another reason that the logs may not indicate that tests are not being skipped is that both mozilla-inbound and fx-team have been closed for a while. If this is the case, no jobs are being skipped because no jobs are being scheduled.
Are new AWS instances starting and running buildbot?
There will be an alert in #buildduty regarding aws_watch_pending.log not being updated if new instances are not being created. A common cause is that there is a typo in configs/watch_pending.cfg. Look at the logs on the aws manager instance (/var/log/messages). There should be an error message regarding a typo in the json file. We shouldn't really get to that point because there are tests to verify this but sometimes it happens. For example, https://bugzilla.mozilla.org/show_bug.cgi?id=1195893#c8. If the are AWS instances starting, ssh to a instance that has recently started and look at the /var/log/runner.log and see if there are errors. Does the /builds/slave/twistd.log indicate that builds are completing on this machine?
Is there a problem with the AMI golden master?
Each night, we create a new AMIs for Amazon instances from our puppet configs. Once it is ready, all new instances are created with this image. If there is a problem with the image, this has to be corrected and new AMIs generated. If the image is broken to the extent that it should be pulled, you can deregister the AMI in the amazon console so the previous night's AMI can be used instead. To quickly bring down the instances that are launched with the problem AMI, you can use this script on aws-manager2.srv.releng.scl3.mozilla.com
cd /builds/aws_manager/cloud-tools python scripts/aws_terminate_by_ami_id.py <your ami id>
For an example of this problem, see bug 1203104
Are we underbidding for AWS spot instances?
We use AWS spot instances for a large proportion of our continuous integration's farm's capacity. We have an algorithm that bids for the different instance types within a range of prices. Prices are here https://github.com/mozilla/build-cloud-tools/blob/master/configs/watch_pending.cfg#L50 Algorithm is here https://github.com/mozilla/build-cloud-tools/blob/master/cloudtools/aws/spot.py If we are underbidding for the current costs of the spot instances, we won't get any new AWS instances and we pending counts will go up.
Are ssh keys a problem on the the masters?
Buildbot ssh keys have a problem bug 1198332
Are we unable to connect to AWS due to network issues?
Chaos really. We depend heavily on AWS.
Are there problems connecting to the Buildbot database?
Cannot connect to database due to network or other issues. Pending count will probably not increase, will just stay the same because jobs aren't deleted from db as they complete.
Are jobs taking longer today than they did yesterday?
Check end to end times here https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/#/dashboard/db/kpis
Did a reconfig not complete successfully on a master?
For example, the alert that pending builds are increasing as well as a stale reconfig lockfile may indicate that a master is in a hung state and not running jobs. In this case, it was an AWS master.
Mon 11:58:31 PDT [4448] cruncher.srv.releng.scl3.mozilla.com:Pending builds is WARNING: WARNING Pending Builds: 4697 (http://m.mozilla.org/Pending+builds) relengbot [sns alert] Mon 12:09:06 PDT buildbot-master04.bb.releng.usw2.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes.
Looking at the log it hadn't been updated for several hours
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ tail !$ tail master/twistd.log 2015-09-14 09:00:49-0700 [-] prioritizeBuilders: 0.17s builders with slaves: 0 2015-09-14 09:00:49-0700 [-] prioritizeBuilders: 0.17s prioritized 0 builder(s): [] 2015-09-14 09:00:49-0700 [-] prioritizeBuilders: 0.17s using up to first 0 builders: 2015-09-14 09:00:49-0700 [-] prioritizeBuilders: 0.17s found 0 important builder(s): [] 2015-09-14 09:00:49-0700 [-] prioritizeBuilders: 0.17s finished prioritization 2015-09-14 09:00:50-0700 [-] Creating BuildMaster -- buildbot.version: 0.8.2-hg-41f0fbee10f4-production-0.8 2015-09-14 09:00:50-0700 [-] loading configuration from /builds/buildbot/tests1-linux32/master/master.cfg 2015-09-14 09:00:50-0700 [-] JacuzziAllocator 1268988944: created 2015-09-14 09:00:50-0700 [-] nextAWSSlave: start 2015-09-14 09:00:50-0700 [-] nextAWSSlave: start
Use strace to look at the hung process
# checking the reconfig proc [cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ ls -l /proc/4847/fd | grep 3 l-wx------ 1 cltbld cltbld 64 Sep 14 12:25 1 -> /builds/buildbot/tests1-linux32/reconfig.log l-wx------ 1 cltbld cltbld 64 Sep 14 12:25 2 -> pipe:[8988493] lr-x------ 1 cltbld cltbld 64 Sep 14 12:25 3 -> /builds/buildbot/tests1-linux32/master/twistd.log [cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ ps auwx | grep python cltbld 1045 0.0 0.1 196880 5256 ? S Aug29 7:05 /builds/buildbot/queue/bin/python /builds/buildbot/queue/tools/buildbot-helpers/command_runner.py -q /dev/shm/queue/commands -r 10 -t 60 -m 60 -j4 cltbld 1057 0.0 0.7 226360 26316 ? S Aug29 10:39 /builds/buildbot/queue/bin/python /builds/buildbot/queue/tools/buildbot-helpers/pulse_publisher.py --passwords /builds/buildbot/queue/passwords.py -q /dev/shm/queue/pulse cltbld 4847 0.0 0.2 176936 9292 ? S 09:00 0:02 python ./tools/buildfarm/maintenance/buildbot-wrangler.py reconfig /builds/buildbot/tests1-linux32/master cltbld 15648 0.0 0.0 103252 840 pts/0 S+ 12:26 0:00 grep python cltbld 17135 26.4 58.2 3015260 2173096 ? Sl Aug30 5864:05 /builds/buildbot/tests1-linux32/bin/python /builds/buildbot/tests1-linux32/bin/buildbot start /builds/buildbot/tests1-linux32/master [cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ strace -p 4847 Process 4847 attached - interrupt to quit select(0, NULL, NULL, NULL, {0, 464499}) = 0 (Timeout) read(3, "", 8192) = 0 stat("/builds/buildbot/tests1-linux32/master/twistd.log", {st_mode=S_IFREG|0664, st_size=30765027, ...}) = 0 select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout) read(3, "", 8192) = 0 stat("/builds/buildbot/tests1-linux32/master/twistd.log", {st_mode=S_IFREG|0664, st_size=30765027, ...}) = 0 select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout) read(3, "", 8192) = 0 stat("/builds/buildbot/tests1-linux32/master/twistd.log", {st_mode=S_IFREG|0664, st_size=30765027, ...}) = 0 select(0, NULL, NULL, NULL, {0, 500000}^C <unfinished ...> Process 4847 detached [cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ ls -l /proc/4847/fd | grep 3 l-wx------ 1 cltbld cltbld 64 Sep 14 12:25 1 -> /builds/buildbot/tests1-linux32/reconfig.log l-wx------ 1 cltbld cltbld 64 Sep 14 12:25 2 -> pipe:[8988493] lr-x------ 1 cltbld cltbld 64 Sep 14 12:25 3 -> /builds/buildbot/tests1-linux32/master/twistd.log [cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ [cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ # ^ looks like it's still running and waiting on new log lines in twistd.log # so let's check the buildbot proc itself and see why it's not logging [cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ [cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ strace -p 17135 Process 17135 attached - interrupt to quit recvfrom(131, ^C <unfinished ...> Process 17135 detached [cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ ls -l /proc/17135/fd | grep 131 lrwx------ 1 cltbld cltbld 64 Sep 14 12:28 131 -> socket:[8988995] [cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ lsof |grep 8988995 buildbot 17135 cltbld 131u IPv4 8988995 0t0 TCP buildbot-master04.bb.releng.usw2.mozilla.com:48176->ec2-54-215-155-53.us-west-1.compute.amazonaws.com:http (ESTABLISHED) # ^ so we are hung on some socket TCP connection and not doing anything else.
In this case, we needed to hard restart the master because the buildbot process was hung. SIGTERM didn't work so had to kill -9 the buildbot process. It would also be a good idea to disable this master in slavealloc as soon as the problem appeared so jobs could be allocated to other masters.
Is there an an unusual amount of buildbot lag?
Look at the graph for buildbot lag bug 1220191
What code/systems have changed?
Figure out how long the backlog has been building, and then check the Maintenance wiki to see if code has recently landed that would be affecting that platform.
Other sources to check include:
Has there been a network or systems event?
We get nagios alerts in #buildduty for the following:
- BGP flaps from AWS: this can affect connectivity between slaves and masters in AWS
- load on upload/stage: this can affect the download of artifacts for builds and tests, leading to retries and high pending counts
If there are no alerts, it is worth asking in #MOC and/or #infra to see if IT is tracking any events not currently on our nagios radar.
Rebooting slaves in batches
When many slaves are disconnected, e.g. after a network event, it is useful to be able to reboot many of them at one time. The various slave type subpages in slave health (e.g. t-w732-ix) lets you do this via batch actions.
Two batch actions are currently available:
- Reboot all broken slaves - will reboot all slaves that haven't reported a result in more than 6 hours
- Reboot slaves that have not reported in # minutes - allows you to specify the time cut-off used for rebooting slaves. This is sometimes useful when you have many slaves (or even a whole pool) that are failing to connect after, e.g., a network event, and you don't want to wait for them all to idle for 6 hours.
Note: these actions don't care or check if the slave is currently running a job, only when the last job was run. As such, you *may* lose work-in-progress. However, if you're having issues across an entire pool, sometimes it is preferable to lose a few in-progress jobs to ensure the health of the larger pool.
These actions use slaveapi to perform the reboots. You could also do this by hand by creating a text file containing the list of the slaves you want to reboot (let's call it naughty_slaves.list), set MY_LDAP_USER and MY_LDAP_PASSWORD environment variables to your LDAP credentials, making sure you are on the VPN; and then run:
cat naughty_slaves.list | \ while read slave; do \ curl -u "${MY_LDAP_USER}:${MY_LDAP_PASSWORD}" \ -dfoo=bar "https://secure.pub.build.mozilla.org/slaveapi/slaves/${slave}/actions/reboot"; \ done
This should reboot the lot in one fell swoop.
Please note, if you need to call other Slave API actions, such as "shutdown" instead of reboot, see the API docs here: http://mozilla-slaveapi.readthedocs.org/en/latest/api/#endpoints
Is there an an unusual amount of buildbot lag?
Look at the graph for buildbot lag bug 1220191
Was the increase in pending jobs sudden and after a reconfig?
This may have been due to bogus scheduling jobs added. See bug 1223042 as an example of how to fix this issue.