ReleaseEngineering/How To/Restart Buildbot Masters
We occasionally need to restart buildbot masters for various reasons:
- upgrades to the underlying OS
- gradual increase in memory usage over time, leading to reduced master performance
If you need to restart a single master by hand, here's the sequence you should follow:
- disable the master in slavealloc. This prevents the master from taking more slave connections while you're waiting for it to shutdown.
- click the "Clean Shutdown" button on the web interface for the given master, e.g. http://buildbot-master82.bb.releng.scl3.mozilla.com:8001/
- wait for the jobs currently running on that master to complete. You can track progress by searching in-page for "Running" on the master's buildslaves page, e.g. http://buildbot-master82.bb.releng.scl3.mozilla.com:8001/buildslaves?no_builders=1
- once the master is shutdown, perform whatever upgrades are required, etc.
- restart the master. NOTE: buildbot masters are configured to restart buildbot automatically on boot, so if you reboot the master, buildbot will restart itself. To restart manually:
xebec:buildduty ccooper$ ssh cltbld@buildbot-master82 Unauthorized access prohibited [email@example.com ~]$ cd /builds/buildbot/build1/ [firstname.lastname@example.org build1]$ make start
- re-enable the master in slavealloc.
The above actions have been encapsulated into a script: https://hg.mozilla.org/build/tools/file/default/buildfarm/maintenance/restart_masters.py. The script is setup to run on buildduty-tools.srv.releng.usw2.mozilla.com, located at /home/buildduty/restart_masters/ under buildduty's account.
The wrapper script /home/buildduty/restart_masters.sh is used to update repositories and call restart_masters.py. The latter will prompt for the usernames and passwords it needs (ldap for slavealloc, cltbld and root for ssh).
Here is an example invocation:
# ssh -A buildduty@buildduty-tools $ screen -R restart_masters $ /home/buildduty/restart_masters.sh
You can enter nonsense for the cltbld and root passwords because key auth is used. Forwarding your ssh agent is required for ssh access to the masters. If you use a timeout make sure it's sufficiently long for the script to repeat, but don't leave it running indefinitely. NB: rebooting masters by adding the -r arg to restart_masters.py is non-functional because root logins are disabled.
You can send a SIGUSR1 to restart_masters.py to prompt an extra progress report; allow time for the current status check to complete first.
The above script requires sensitive credentials that shouldn't be stored on disk. For now, we're still running this script by hand.