CIDuty/How To/Deprecated / Archived/Slave Management

From MozillaWiki
< CIDuty‎ | How To
Revision as of 19:44, 29 January 2013 by ChrisCooper (talk | contribs) (Created page with "= Slave Management = In general, slave management involves: * keeping as many slaves up as possible, including ** proactively checking for hung/broken slaves - see [http://build....")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Slave Management

In general, slave management involves:

  • keeping as many slaves up as possible, including
  • handling nagios alerts for slaves
  • interacting with IT regarding slave maintenance

Known failure modes

  • talos-r3-fed|fed64
    • these slaves frequently fail to reboot cleanly, knocking themselves off the network entirely. Also, check for stale puppet locks /var/lib/puppet/state/puppetdlock if they fail to puppetize cleanly.
  • talos-r3-[w7|xp]
    • Windows slaves have issues with modal dialogs, and sometimes the msys shell will fail to close properly. A manual reboot will usually clear this up.
  • talos-r4-[lion|snow]
    • These slaves will sometimes fail to puppetize correctly. The remote_scutil_cmds.bash script can help with this.
    • r4 slaves
  • tegras and pandas
  • AWS slaves
    • a common failure is running out of disk space. They have default disk allocations of 150GB versus our which have 250GB. Catlee is working on changing that.
      • To clean them, you can run mock_mozilla -v -r mozilla-centos6-i386 --scrub=all See bug 829186 for an example.
    • Rail wrote a tool to manage aws slaves - enable or disable automatic reboot and automatic shutdown.
    • Mozilla DNS servers don't resolve AWS hostnames, thus this document describes how to resolve them

File a bug

  • File a bug using the link in slavealloc - it will "do the right thing" to set up a new bug if needed.
  • Make the individual slave bug block the appropriate colo reboot/recovery bug (check the machine domain):
    • reboots-mtv1 - MTV
    • reboots-scl1 - SCL1
    • reboots-scl3 - SCL3
    • tegra-recovery - tegras
    • These bugs get closed when IT has recovered all of the individual blocking slaves. You should clone the recovery bug and move the alias forward as required. Otherwise, you may risk having other machines unintentionally rebooted that were added to the original alias.
  • Make sure the alias of the bug is the hostname (done automatically if you follow slavealloc bug link)
  • Create dependent bugs for any IT actions (beyond normal reboot)
    • should block both the datacenter bug & the per host bug (for record keeping)
    • consider whether the slave should be disabled in slavealloc, and note that in bug (no slave without a detailed bug should be disabled)
    • dcops assumes if there is no separate bug, they only need to reboot and see the machine come online.

Slave Tracking

  • Slave tracking is done via the Slave Allocator. Please disable/enable slaves in slavealloc and add relevant bug numbers to the Notes field.

Slavealloc

Adding a slave

Slaves are added to slavealloc via the 'dbimport' subcommand of the 'slavealloc' command. This is generally run as the slavealloc user on the slavealloc server, which is most easily accessed via su from root.

You'll want a command line something like

/tools/slavealloc/bin/slavealloc dbimport -D $db_url --slave-data mydata.csv

where $db_url is most easily found in slavealloc's shell history. The CSV file should have the headers specified by 'slavealloc dbimport --help':

name,distro,bitlength,speed,datacenter,trustlevel,environment,purpose,pool,basedir,enabled
talos-r3-xp-096,winxp,32,mini,scl1,try,prod,tests,tests-scl1-windows,C:\\talos-slave,1

Adding masters is similar - see dbimport's help for more information.

Removing slaves

Connect to slavealloc@slavealloc and look at the history for a command looking like this:

 mysql -h $host_ip -p -u buildslaves buildslaves
 # type the password
 SELECT name FROM slaves WHERE notes LIKE '%bumblebumble%';
 DELETE name FROM slaves WHERE notes LIKE '%bumblebumble%';

Using briar-patch tools (kitten) to manage slaves

See ReleaseEngineering:Buildduty:Kitten

How to decommission a slave

  • disable the slave in slavealloc, also setting its environment to "decomm"
  • if the hardware has failed:
    • file a bug against Server Ops:Releng to decomm the slave. They should (at the very least) make sure the nagios alerts are updated, DNS updated, and the hardware recovered from the dc.
  • if the hardware is still viable and can be used by another pool (e.g. r3 mini)
    • file a bug against Server Ops:Releng to have the slave re-imaged to another OS with bad wait times (usually Windows)
    • add the new slave to the buildbot configs, and make sure nagios monitoring is setup for the new slave (may require a new bug against relops)
  • remove the slave from the buildbot configs
  • remove the slave from puppet or opsi configs, if it exists in one