canmove, Confirmed users
112
edits
m (→Escalation: jhford doesn't works for mozilla any longer. I've removed him from the list.) |
(done some minor formatting changes and added link to the escalation path) |
||
Line 1: | Line 1: | ||
= What to do in case of high pending tests under an AWS worker pool = | = What to do in case of high pending tests under an AWS worker pool = | ||
Sometimes AWS worker pool get overloaded with tests or simply we don't have enough workers of a specific pool. If this happens you will see an alert such as: <br /> | Sometimes AWS worker pool get overloaded with tests or simply we don't have enough workers of a specific pool. If this happens you will see an alert such as: <br /> | ||
<pre>nagios1.private.releng.mdc1.mozilla.com:Pending tests is CRITICAL: CRITICAL Pending tests: 3589 on gecko-t-linux-xlarge.</pre> | <pre>nagios1.private.releng.mdc1.mozilla.com:Pending tests is CRITICAL: CRITICAL Pending tests: 3589 on gecko-t-linux-xlarge.</pre> | ||
When this happens the first step is to check if we are getting outbid. You can see this [https://tools.taskcluster.net/aws-provisioner/recent-errors here]. Look for the number of InsufficientInstanceCapacity instances belonging to the affected pool. | When this happens the first step is to check if we are getting outbid. You can see this [https://tools.taskcluster.net/aws-provisioner/recent-errors here]. | ||
Look for the number of InsufficientInstanceCapacity instances belonging to the affected pool. | |||
A second best step is to check [https://papertrailapp.com/ papertrail]. You can filter the logs after each worker type. | A second best step is to check [https://papertrailapp.com/ papertrail]. | ||
You can filter the logs after each worker type. | |||
= Escalation = | = Escalation = | ||
Letting people know about the queue in #ci before starting with the steps above is always a good thing. | Letting people know about the queue in #ci before starting with the steps above is always a good thing. | ||
If we are just missing workers or the number of jobs just keeps piling up, escalate to sheriffs so they can close trees until the queues go down and notify #ci that trees are closed because of InsufficientInstanceCapacity. | If we are just missing workers or the number of jobs just keeps piling up, escalate to sheriffs so they can close trees until the queues go down and notify #ci that trees are closed because of InsufficientInstanceCapacity. | ||
Sometimes, the problem isn't easy to be found, so, pinging people on IRC/Slack is the next step: | |||
Sometimes the | |||
For EU time-zone we have pmoore <br /> | For EU time-zone we have pmoore <br /> | ||
For US time-zone we have bstack, wcosta. | For US time-zone we have bstack, wcosta. | ||
You can also check the escalation path [https://docs.google.com/spreadsheets/d/1pUFq6Z5M5a1ydbSzxNjQivFfryVoksdXa9xXTg9gtzc/edit#gid=0 here ] |