ReleaseEngineering/How To/Restart Buildbot Masters

From MozillaWiki
Jump to: navigation, search

We occasionally need to restart buildbot masters for various reasons:

  • upgrades to the underlying OS
  • gradual increase in memory usage over time, leading to reduced master performance


If you need to restart a single master by hand, here's the sequence you should follow:

  • disable the master in slavealloc. This prevents the master from taking more slave connections while you're waiting for it to shutdown.
  • click the "Clean Shutdown" button on the web interface for the given master, e.g.
  • wait for the jobs currently running on that master to complete. You can track progress by searching in-page for "Running" on the master's buildslaves page, e.g.
  • once the master is shutdown, perform whatever upgrades are required, etc.
  • restart the master. NOTE: buildbot masters are configured to restart buildbot automatically on boot, so if you reboot the master, buildbot will restart itself. To restart manually:
xebec:buildduty ccooper$ ssh cltbld@buildbot-master82
Unauthorized access prohibited
[ ~]$ cd /builds/buildbot/build1/
[ build1]$ make start

By script

The above actions have been encapsulated into a script: The script is setup to run on, located at /home/buildduty/restart_masters/ under buildduty's account.

The wrapper script /home/buildduty/ is used to update repositories and call The latter will prompt for the usernames and passwords it needs (ldap for slavealloc, cltbld and root for ssh).

Here is an example invocation:

# ssh -A buildduty@buildduty-tools
$ screen -R restart_masters
$ /home/buildduty/

You can enter nonsense for the cltbld and root passwords because key auth is used. Forwarding your ssh agent is required for ssh access to the masters. If you use a timeout make sure it's sufficiently long for the script to repeat, but don't leave it running indefinitely. NB: rebooting masters by adding the -r arg to is non-functional because root logins are disabled.

The logs go to papertrail, outputting a progress report every 60 minutes. For really minimal view (which may hide errors) see this filtered view.

You can send a SIGUSR1 to to prompt an extra progress report; allow time for the current status check to complete first.


The above script requires sensitive credentials that shouldn't be stored on disk. For now, we're still running this script by hand.