ReleaseEngineering/How To/Manage Buildbot with Fabric: Difference between revisions

 
(15 intermediate revisions by 7 users not shown)
Line 2: Line 2:
RelEng has started writing some tools to manage all the buildbot masters using fabric.
RelEng has started writing some tools to manage all the buildbot masters using fabric.


The [http://hg.mozilla.org/build/tools/file/default/buildfarm/maintenance/manage_masters.py manage_masters.py] script is available from the [http://hg.mozilla.org/build/tools tools] repository
The [http://hg.mozilla.org/build/tools/file/default/buildfarm/maintenance/manage_masters.py manage_masters.py] script is available from the [http://hg.mozilla.org/build/tools tools] repository, in the buildfarm/maintenance directory.


[http://docs.fabfile.org/0.9.3/ Fabric] is a pre-requisite for running these tools.  It is easy-installable into a virtual environment.
[http://docs.fabfile.org/0.9.3/ Fabric] is a pre-requisite for running these tools.  It is easy-installable into a virtual environment. Setup of <tt>ssh-agent</tt> is strongly recommended (see [[#Hosts_and_role_groups|below]] for details.
 
= Setup =
 
hg clone ssh://hg.mozilla.org/build/tools
cd tools
mkvirtualenv tools
pip install fabric


= Usage =
= Usage =
Line 17: Line 24:
  python manage_masters.py \
  python manage_masters.py \
     -f http://hg.mozilla.org/build/tools/raw-file/tip/buildfarm/maintenance/production-masters.json \
     -f http://hg.mozilla.org/build/tools/raw-file/tip/buildfarm/maintenance/production-masters.json \
     -R scheduler -j 2 check
     -R scheduler check
or if your tools repo is up to date just
python manage_masters.py -f production-masters.json -R scheduler check


= buildbot-wrangler.py =
= buildbot-wrangler.py =
Make sure you run fabric from "buildfarm/maintenance" since buildbot-wrangler.py is there and needs to be uploaded to the masters when we try to do a reconfigure.
Make sure you run fabric from "buildfarm/maintenance" since buildbot-wrangler.py is there and needs to be uploaded to the masters when we try to do a reconfig.
  Traceback (most recent call last):
  Traceback (most recent call last):
   File "build/bdist.macosx-10.6-universal/egg/fabric/main.py", line 540, in main
   File "build/bdist.macosx-10.6-universal/egg/fabric/main.py", line 540, in main
Line 31: Line 40:


= Suggestions =
= Suggestions =
Don't use fabric with the test masters to reconfigure if you are in a rush (backing something out) as it takes forever (sequential reconfigures).
Don't use fabric with the test masters to reconfig if you are in a rush (backing something out) as it takes forever (sequential reconfigs).


If you need to reconfigure everything it is much better if you run four instances of fabric (each on a different terminal). The reconfigure step is blocking and it won't continue to the next host on a role group until it finishes.
If you need to reconfig everything it is much better if you run four instances of fabric (each on a different terminal). The reconfig step is blocking and it won't continue to the next host on a role group until it finishes. (Remember the reconfig step does NOT update.)


  # in case it is not clear; Run each one on a different window
  # in case it is not clear; Run each one on a different window
  python manage_masters.py -f production-masters.json -R scheduler reconfigure
  python manage_masters.py -f production-masters.json -j16 -R scheduler update checkconfig reconfig
  python manage_masters.py -f production-masters.json -R build reconfigure
  python manage_masters.py -f production-masters.json -j16 -R build     update checkconfig reconfig
  python manage_masters.py -f production-masters.json -R try reconfigure
  python manage_masters.py -f production-masters.json -j16 -R try       update checkconfig reconfig
  python manage_masters.py -f production-masters.json -R tests reconfigure
  python manage_masters.py -f production-masters.json -j16 -R tests     update checkconfig reconfig
 
The tests reconfig can take a really long time, so you can parallelize the test process using -M {macosx|windows|linux|panda} (instead of "-R tests") each on a different tab plus -j16. So, replace the last line/window with these 5 (for a total of 8 windows):
 
python manage_masters.py -f production-masters.json -j16 -M macosx  update checkconfig reconfig
python manage_masters.py -f production-masters.json -j16 -M windows update checkconfig reconfig
python manage_masters.py -f production-masters.json -j16 -M linux  update checkconfig reconfig
python manage_masters.py -f production-masters.json -j16 -M tegra  update checkconfig reconfig
python manage_masters.py -f production-masters.json -j16 -M panda  update checkconfig reconfig
 
To validate the above (i.e. we haven't added any new platforms since the docs were updated), run:
diff -u \
  <(./manage_masters.py -f production-masters.json -l -R tests) \
  <(./manage_masters.py -f production-masters.json -l -M macosx \
      -M windows -M linux -M tegra  -M panda)
If any differences are reported, include those platforms and update the docs.


= Hosts and role groups =
= Hosts and role groups =
Fabric works on individual hosts, and supports organizing these hosts into groups.  This is mostly a good fit for how we need to work, except we often have multiple buildbot masters on a single host, so there is a bit of hacking in master_fabric.py to pick out the right hosts to operate on depending on what the user has selected.
Fabric works on individual hosts, and supports organizing these hosts into groups.  This is mostly a good fit for how we need to work, except we often have multiple buildbot masters on a single host, so there is a bit of hacking in master_fabric.py to pick out the right hosts to operate on depending on what the user has selected.


Hosts are selected with the -H flag, and roles are selected with the -R flag.  Hosts correspond to the 'name' field in the masters json file, and are short abbreviations to refer to each master, e.g. pm01-bm, pm01-sm, pm02-try.  We have 4 roles defined: 'build', 'scheduler', 'try', and 'tests'.  Selecting a role will restrict fabric to only operate on masters that operate on that role.
Hosts are selected with the -H flag, and roles are selected with the -R flag.  Hosts correspond to the 'name' field in the masters json file, and are short abbreviations to refer to each master, e.g. bm13-build1, bm19-tests1-tegra, bm33-try1, bm36-build_scheduler.  We have 4 roles defined: '''build''', '''scheduler''', '''try''', and '''tests'''.  Selecting a role will restrict fabric to only operate on masters that operate on that role.


The string 'all' when specified via -H or -R means that all masters in the masters file will be operated on.
The string 'all' when specified via -H or -R means that all masters in the masters file will be operated on. You can also use -M flag to match on strings in the master name, eg -M tests1-windows to pick up all the windows test masters. Note that manage_masters.py will "or" all host specifications from the command line, e.g. "-R tests -M windows" will return all hosts in role "tests", not just the windows test masters.


Fabric relies on being able to ssh to the masters without password authentication, so be sure to have your ssh keys set up!
Fabric relies on being able to ssh to the masters without password authentication, so be sure to have your ssh keys set up! Which means have the needed keys added into the running instance of your ssh-agent (your "<tt>~/.ssh/config</tt>" file is ''not'' consulted by Paramiko.) If you don't have the keys set up, you'll be asked for your password one time per invocation, so use multiple commands per invocation where appropriate.


= Updating checkout =
= Updating checkout =
Line 119: Line 143:
= Checkconfig =
= Checkconfig =
<pre>
<pre>
python manage_masters.py -f production-masters.json -R build,scheduler checkconfig
python manage_masters.py -f production-masters.json -R build -R scheduler checkconfig
bm3            OK
bm3            OK
pm02-sm        OK
pm02-sm        OK
Line 134: Line 158:


= Reconfigure =
= Reconfigure =
'''''Reminder:''''' ''<tt>reconfigure</tt> only does the reconfig; you need to have previously done an '<tt>update</tt>' and '<tt>checkconfig</tt>'''
<pre>
<pre>
python manage_masters.py -f production-masters.json -R build reconfig     
python manage_masters.py -f production-masters.json -R build reconfig     
Line 155: Line 182:
Disconnecting from buildbot-master2.build.mozilla.org... done.
Disconnecting from buildbot-master2.build.mozilla.org... done.
</pre>
</pre>
If the reconfig gets stuck, see [https://wiki.mozilla.org/ReleaseEngineering/How_To/Unstick_a_Stuck_Slave_From_A_Master How To/Unstick a Stuck Slave From A Master].
As a special case for test masters, you can unstick things by either:
* triggering a "Clean Shutdown" from the web UI for that master, or
* using manage_masters.py graceful_restart command
After jobs complete, the master will shut down (web page will not be served). Fabric should notice and unstick itself at that point. If fabric doesn't notice, in a separate window, individually do the update and start steps. If fabric still doesn't notice, good luck and document what works.
Confirmed users
1,989

edits