CIDuty/How To/High Pending Counts: Difference between revisions

Jump to navigation Jump to search
m
Line 174: Line 174:


= Rebooting workers in batches =
= Rebooting workers in batches =
When many slaves are disconnected, e.g. after a network event, it is useful to be able to reboot many of them at one time. The various slave type subpages in slave health (e.g. [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-w732-ix t-w732-ix]) lets you do this via batch actions.
When many workers are disconnected, e.g. after a network event, it is useful to be able to reboot many of them at one time. The various slave type subpages in slave health (e.g. [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-w732-ix t-w732-ix]) lets you do this via batch actions.


Two batch actions are currently available:
Two batch actions are currently available:
# Reboot all broken slaves - will reboot all slaves that haven't reported a result in more than 6 hours
# Reboot all broken workers - will reboot all workers that haven't reported a result in more than 6 hours
# Reboot slaves that have not reported in # minutes - allows you to specify the time cut-off used for rebooting slaves. This is sometimes useful when you have many slaves (or even a whole pool) that are failing to connect after, e.g., a network event, and you don't want to wait for them all to idle for 6 hours.
# Reboot workers that have not reported in # minutes - allows you to specify the time cut-off used for rebooting workers. This is sometimes useful when you have many workers (or even a whole pool) that are failing to connect after, e.g., a network event, and you don't want to wait for them all to idle for 6 hours.


'''Note:''' these actions don't care or check if the slave is currently running a job, only when the last job was run. As such, you *may* lose work-in-progress. However, if you're having issues across an entire pool, sometimes it is preferable to lose a few in-progress jobs to ensure the health of the larger pool.
'''Note:''' these actions don't care or check if the slave is currently running a job, only when the last job was run. As such, you *may* lose work-in-progress. However, if you're having issues across an entire pool, sometimes it is preferable to lose a few in-progress jobs to ensure the health of the larger pool.


These actions use slaveapi to perform the reboots. You could also do this by hand by creating a text file containing the list of the slaves you want to reboot (let's call it naughty_slaves.list), set MY_LDAP_USER and MY_LDAP_PASSWORD environment variables to your LDAP credentials, making sure you are on the VPN; and then run:
These actions use slaveapi to perform the reboots. You could also do this by hand by creating a text file containing the list of the workers you want to reboot (let's call it bad_workers.list), set MY_LDAP_USER and MY_LDAP_PASSWORD environment variables to your LDAP credentials, making sure you are on the VPN; and then run:


  cat naughty_slaves.list | \  
  cat bad_workers.list | \  
   while read slave; do \
   while read slave; do \
     curl -u "${MY_LDAP_USER}:${MY_LDAP_PASSWORD}" \
     curl -u "${MY_LDAP_USER}:${MY_LDAP_PASSWORD}" \
Confirmed users
1,989

edits

Navigation menu