Dealing with high pending counts
Demand will sometimes outstrip supply in the slave pool. A high number of pending build can indicate a problem with the scheduler, (a set of) buildbot-masters, or a particular pool of slaves (and hence possibly puppet).
The number of pending builds is available in graphs and is also displayed per slave type in slave health. The graphs are helpful for noticing anomalous behavior. You will also see an alert in #buildduty, for example
nagios-relengMon 17:00:34 PDT [4860] cruncher.srv.releng.scl3.mozilla.com:Pending builds is CRITICAL: CRITICAL Pending Builds: 7370 (http://m.mozilla.org/Pending+builds)
Here are some steps you can use to help figure out why it's happening:
What platforms are affected?
Some platforms, notably talos test for all OSes, have finite pools of hardware. Once all the machines are running jobs, any other work will be queued up.
Spikes in the number of pending requests for Linux jobs that run on AWS instances (build, try, and test) can also occur. AWS instances are terminated when not required, and it can take a while (30 mins?) to spin up new instances to meet sudden demand.
Where is the load coming from?
Did nightlies just get triggered? Did nightlies just trigger dependent l10n jobs?
These are predictable daily sources of spiky load.
Are the pending jobs in jacuzzis?
Lots of pending jobs in a given jacuzzi is generally fine. That's what jacuzzis are for: to make sure a single job type doesn't soak up capacity from the entire pool. If there are an anomalously high number of pending jobs for a single jacuzzi, it's best to look for signs of abuse (below).
Did the trees just open following a closure?
There is usually a big pulse of activity right after a tree closure as developers start landing code again.
Is someone abusing the try server?
Occasionally someone will (re)trigger a large of jobs on a single changeset. People who do this often and with good reason (e.g. jmaher) usually do so on weekends when there is less contention for the infrastructure. If someone does this mid-week, it's best to find them on IRC and figure out why they've done this. You may need to cancel some/all of their extra jobs if it's impacting other developers.
There is no designated tool for finding these types of abuses. When looking for abuse, I check which branch the pending jobs are coming (usually try), and then walk back through the treeherder history for that branch looking for revisions with multiple jobs triggered.
What code/systems have changed?
Figure out how long the backlog has been building, and then check the Maintenance wiki to see if code has recently landed that would be affecting that platform.
Other sources to check include:
Has there been a network or systems event?
We get nagios alerts in #buildduty for the following:
- BGP flaps from AWS: this can affect connectivity between slaves and masters in AWS
- load on upload/stage: this can affect the download of artifacts for builds and tests, leading to retries and high pending counts
If there are no alerts, it is worth asking in #MOC and/or #infra to see if IT is tracking any events not currently on our nagios radar.
Rebooting slaves in batches
When many slaves are disconnected, e.g. after a network event, it is useful to be able to reboot many of them at one time. The various slave type subpages in slave health (e.g. t-w732-ix) lets you do this via batch actions.
Two batch actions are currently available:
- Reboot all broken slaves - will reboot all slaves that haven't reported a result in more than 6 hours
- Reboot slaves that have not reported in # minutes - allows you to specify the time cut-off used for rebooting slaves. This is sometimes useful when you have many slaves (or even a whole pool) that are failing to connect after, e.g., a network event, and you don't want to wait for them all to idle for 6 hours.
Note: these actions don't care or check if the slave is currently running a job, only when the last job was run. As such, you *may* lose work-in-progress. However, if you're having issues across an entire pool, sometimes it is preferable to lose a few in-progress jobs to ensure the health of the larger pool.
These actions use slaveapi to perform the reboots. You could also do this by hand by creating a text file containing the list of the slaves you want to reboot (let's call it naughty_slaves.list), set MY_LDAP_USER and MY_LDAP_PASSWORD environment variables to your LDAP credentials, making sure you are on the VPN; and then run:
cat naughty_slaves.list | \ while read slave; do \ curl -u "${MY_LDAP_USER}:${MY_LDAP_PASSWORD}" \ -dfoo=bar "https://secure.pub.build.mozilla.org/slaveapi/slaves/${slave}/actions/reboot"; \ done
This should reboot the lot in one fell swoop.
Please note, if you need to call other Slave API actions, such as "shutdown" instead of reboot, see the API docs here: http://mozilla-slaveapi.readthedocs.org/en/latest/api/#endpoints