CIDuty/How To/AWS Pending Test: Difference between revisions

done some minor formatting changes and added link to the escalation path
m (→‎Escalation: jhford doesn't works for mozilla any longer. I've removed him from the list.)
(done some minor formatting changes and added link to the escalation path)
Line 1: Line 1:
= What to do in case of high pending tests under an AWS worker pool =
= What to do in case of high pending tests under an AWS worker pool =
Sometimes AWS worker pool get overloaded with tests or simply we don't have enough workers of a specific pool. If this happens you will see an alert such as: <br />
Sometimes AWS worker pool get overloaded with tests or simply we don't have enough workers of a specific pool. If this happens you will see an alert such as: <br />
<pre>nagios1.private.releng.mdc1.mozilla.com:Pending tests is CRITICAL: CRITICAL Pending tests: 3589 on gecko-t-linux-xlarge.</pre>
<pre>nagios1.private.releng.mdc1.mozilla.com:Pending tests is CRITICAL: CRITICAL Pending tests: 3589 on gecko-t-linux-xlarge.</pre>


When this happens the first step is to check if we are getting outbid. You can see this [https://tools.taskcluster.net/aws-provisioner/recent-errors here]. Look for the number of InsufficientInstanceCapacity instances belonging to the affected pool.
When this happens the first step is to check if we are getting outbid. You can see this [https://tools.taskcluster.net/aws-provisioner/recent-errors here].  
 
Look for the number of InsufficientInstanceCapacity instances belonging to the affected pool.


A second best step is to check [https://papertrailapp.com/ papertrail]. You can filter the logs after each worker type.
A second best step is to check [https://papertrailapp.com/ papertrail].  
 
You can filter the logs after each worker type.


= Escalation =
= Escalation =
Letting people know about the queue in #ci before starting with the steps above is always a good thing.
Letting people know about the queue in #ci before starting with the steps above is always a good thing.
If we are just missing workers or the number of jobs just keeps piling up, escalate to sheriffs so they can close trees until the queues go down and notify #ci that trees are closed because of InsufficientInstanceCapacity.
If we are just missing workers or the number of jobs just keeps piling up, escalate to sheriffs so they can close trees until the queues go down and notify #ci that trees are closed because of InsufficientInstanceCapacity.




 
Sometimes, the problem isn't easy to be found, so, pinging people on IRC/Slack is the next step:
Sometimes the issue isn't as easy to figure out so pinging people is the nest best step:


For EU time-zone we have pmoore <br />
For EU time-zone we have pmoore <br />
For US time-zone we have bstack, wcosta.
For US time-zone we have bstack, wcosta.
You can also check the escalation path [https://docs.google.com/spreadsheets/d/1pUFq6Z5M5a1ydbSzxNjQivFfryVoksdXa9xXTg9gtzc/edit#gid=0 here ]
canmove, Confirmed users
112

edits