CIDuty/How To/Deprecated / Archived/Slave Management
Jump to navigation
Jump to search
Slave Management
In general, slave management involves:
- keeping as many slaves up as possible, including
- proactively checking for hung/broken slaves - see the last build per slave page which is updated once an hour. The Puppet dashboard is also useful.
- returning re-imaged slaves to production
- handling nagios alerts for slaves
- interacting with IT regarding slave maintenance
Known failure modes
- talos-r3-fed|fed64
- these slaves frequently fail to reboot cleanly, knocking themselves off the network entirely. Also, check for stale puppet locks /var/lib/puppet/state/puppetdlock if they fail to puppetize cleanly.
- talos-r3-[w7|xp]
- Windows slaves have issues with modal dialogs, and sometimes the msys shell will fail to close properly. A manual reboot will usually clear this up.
- talos-r4-[lion|snow]
- These slaves will sometimes fail to puppetize correctly. The remote_scutil_cmds.bash script can help with this.
- r4 slaves
- tegras and pandas
- tegras and pandas can fail in many disparate ways. See ReleaseEngineering/How_To/Android_Tegras for more info.
- AWS slaves
- a common failure is running out of disk space. They have default disk allocations of 150GB versus our which have 250GB. Catlee is working on changing that.
- To clean them, you can run mock_mozilla -v -r mozilla-centos6-i386 --scrub=all See bug 829186 for an example.
- Rail wrote a tool to manage aws slaves - enable or disable automatic reboot and automatic shutdown.
- Mozilla DNS servers don't resolve AWS hostnames, thus this document describes how to resolve them
- a common failure is running out of disk space. They have default disk allocations of 150GB versus our which have 250GB. Catlee is working on changing that.
File a bug
- File a bug using the link in slavealloc - it will "do the right thing" to set up a new bug if needed.
- Make the individual slave bug block the appropriate colo reboot/recovery bug (check the machine domain):
- reboots-mtv1 - MTV
- reboots-scl1 - SCL1
- reboots-scl3 - SCL3
- tegra-recovery - tegras
- These bugs get closed when IT has recovered all of the individual blocking slaves. You should clone the recovery bug and move the alias forward as required. Otherwise, you may risk having other machines unintentionally rebooted that were added to the original alias.
- Make sure the alias of the bug is the hostname (done automatically if you follow slavealloc bug link)
- Create dependent bugs for any IT actions (beyond normal reboot)
- should block both the datacenter bug & the per host bug (for record keeping)
- consider whether the slave should be disabled in slavealloc, and note that in bug (no slave without a detailed bug should be disabled)
- dcops assumes if there is no separate bug, they only need to reboot and see the machine come online.
Slave Tracking
- Slave tracking is done via the Slave Allocator. Please disable/enable slaves in slavealloc and add relevant bug numbers to the Notes field.
Slavealloc
Adding a slave
Slaves are added to slavealloc via the 'dbimport' subcommand of the 'slavealloc' command. This is generally run as the slavealloc user on the slavealloc server, which is most easily accessed via su from root.
You'll want a command line something like
/tools/slavealloc/bin/slavealloc dbimport -D $db_url --slave-data mydata.csv
where $db_url is most easily found in slavealloc's shell history. The CSV file should have the headers specified by 'slavealloc dbimport --help':
name,distro,bitlength,speed,datacenter,trustlevel,environment,purpose,pool,basedir,enabled talos-r3-xp-096,winxp,32,mini,scl1,try,prod,tests,tests-scl1-windows,C:\\talos-slave,1
Adding masters is similar - see dbimport's help for more information.
Removing slaves
Connect to slavealloc@slavealloc and look at the history for a command looking like this:
mysql -h $host_ip -p -u buildslaves buildslaves # type the password SELECT name FROM slaves WHERE notes LIKE '%bumblebumble%'; DELETE name FROM slaves WHERE notes LIKE '%bumblebumble%';
Using briar-patch tools (kitten) to manage slaves
See ReleaseEngineering:Buildduty:Kitten
How to decommission a slave
- disable the slave in slavealloc, also setting its environment to "decomm"
- if the hardware has failed:
- file a bug against Server Ops:Releng to decomm the slave. They should (at the very least) make sure the nagios alerts are updated, DNS updated, and the hardware recovered from the dc.
- if the hardware is still viable and can be used by another pool (e.g. r3 mini)
- file a bug against Server Ops:Releng to have the slave re-imaged to another OS with bad wait times (usually Windows)
- add the new slave to the buildbot configs, and make sure nagios monitoring is setup for the new slave (may require a new bug against relops)
- remove the slave from the buildbot configs
- remove the slave from puppet or opsi configs, if it exists in one