CIDuty/How To/High Pending Counts: Difference between revisions

Jump to navigation Jump to search
(Created page with "__TOC__ = Dealing with high pending counts = Demand will sometimes outstrip supply in the slave pool. Here are some steps you can use to help figure out why it's happening: *...")
 
Line 3: Line 3:
= Dealing with high pending counts =
= Dealing with high pending counts =
Demand will sometimes outstrip supply in the slave pool. Here are some steps you can use to help figure out why it's happening:
Demand will sometimes outstrip supply in the slave pool. Here are some steps you can use to help figure out why it's happening:
* what platforms are affected?
== What platforms are affected? ==
* where is the load coming from?
Some platforms, notably talos test for all OSes, have finite pools of hardware. Once all the machines are running jobs, any other work will be queued up.
* what code/systems have changed?
 
* has there been a network event?
Spikes in the number of pending requests for Linux jobs that run on AWS instances (build, try, and test) can also occur. AWS instances are terminated when not required, and it can take a while (30 mins?) to spin up new instances to meet sudden demand.
 
== Where is the load coming from? ==
 
=== Did nightlies just get triggered? Did nightlies just trigger dependent l10n jobs? ===
 
These are predictable daily sources of spiky load.
 
=== Are the pending jobs in [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/jacuzzis.html jacuzzis]? ===
Lots of pending jobs in a given jacuzzi is generally fine. That's what [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/jacuzzis.html jacuzzis] are for: to make sure a single job type doesn't soak up capacity from the entire pool. If there are an anomalously high pending counts for a single jacuzzi, it's best to look for signs of abuse (below).
 
=== Did the trees just open following a closure? ===
There is usually a big pulse of activity right after a tree closure as developers start landing code again.
 
=== Is someone abusing the try server? ===
Occasionally someone will (re)trigger a large of jobs on a single changeset. People who do this often and with good reason (e.g. jmaher) usually do this on weekends when there is less contention for the infrastructure. If someone does this mid-week, it's best to find them on IRC and figure out why they've done this. You may need to cancel some/all of their extra jobs if it's impacting other developers.
 
== What code/systems have changed? ==
Figure out how long the backlog has been building, and then check the [[ReleaseEngineering/Maintenance|Maintenance wiki]] to see if code has recently landed that would be affecting that platform.
 
Other sources to check include:
* [https://hg.mozilla.org/build/puppet puppet]
* [https://github.com/mozilla/build-runner runner]
* ???
 
== Has there been a network or systems event? ==
We get nagios alerts in #buildduty for the following:
* BGP flaps from AWS: this can affect connectivity between slaves and masters in AWS
* load on upload/stage: this can affect the download of artifacts for builds and tests, leading to retries and high pending counts
 
If there are no alerts, it is worth asking in #MOC and/or #infra to see if IT is tracking any events not currently on our nagios radar.


= Rebooting slaves in batches =
= Rebooting slaves in batches =
canmove, Confirmed users
2,850

edits

Navigation menu