Confirmed users
1,989
edits
ChrisCooper (talk | contribs) |
|||
(15 intermediate revisions by 7 users not shown) | |||
Line 2: | Line 2: | ||
RelEng has started writing some tools to manage all the buildbot masters using fabric. | RelEng has started writing some tools to manage all the buildbot masters using fabric. | ||
The [http://hg.mozilla.org/build/tools/file/default/buildfarm/maintenance/manage_masters.py manage_masters.py] script is available from the [http://hg.mozilla.org/build/tools tools] repository | The [http://hg.mozilla.org/build/tools/file/default/buildfarm/maintenance/manage_masters.py manage_masters.py] script is available from the [http://hg.mozilla.org/build/tools tools] repository, in the buildfarm/maintenance directory. | ||
[http://docs.fabfile.org/0.9.3/ Fabric] is a pre-requisite for running these tools. It is easy-installable into a virtual environment. | [http://docs.fabfile.org/0.9.3/ Fabric] is a pre-requisite for running these tools. It is easy-installable into a virtual environment. Setup of <tt>ssh-agent</tt> is strongly recommended (see [[#Hosts_and_role_groups|below]] for details. | ||
= Setup = | |||
hg clone ssh://hg.mozilla.org/build/tools | |||
cd tools | |||
mkvirtualenv tools | |||
pip install fabric | |||
= Usage = | = Usage = | ||
Line 17: | Line 24: | ||
python manage_masters.py \ | python manage_masters.py \ | ||
-f http://hg.mozilla.org/build/tools/raw-file/tip/buildfarm/maintenance/production-masters.json \ | -f http://hg.mozilla.org/build/tools/raw-file/tip/buildfarm/maintenance/production-masters.json \ | ||
-R scheduler - | -R scheduler check | ||
or if your tools repo is up to date just | |||
python manage_masters.py -f production-masters.json -R scheduler check | |||
= buildbot-wrangler.py = | = buildbot-wrangler.py = | ||
Make sure you run fabric from "buildfarm/maintenance" since buildbot-wrangler.py is there and needs to be uploaded to the masters when we try to do a | Make sure you run fabric from "buildfarm/maintenance" since buildbot-wrangler.py is there and needs to be uploaded to the masters when we try to do a reconfig. | ||
Traceback (most recent call last): | Traceback (most recent call last): | ||
File "build/bdist.macosx-10.6-universal/egg/fabric/main.py", line 540, in main | File "build/bdist.macosx-10.6-universal/egg/fabric/main.py", line 540, in main | ||
Line 31: | Line 40: | ||
= Suggestions = | = Suggestions = | ||
Don't use fabric with the test masters to | Don't use fabric with the test masters to reconfig if you are in a rush (backing something out) as it takes forever (sequential reconfigs). | ||
If you need to | If you need to reconfig everything it is much better if you run four instances of fabric (each on a different terminal). The reconfig step is blocking and it won't continue to the next host on a role group until it finishes. (Remember the reconfig step does NOT update.) | ||
# in case it is not clear; Run each one on a different window | # in case it is not clear; Run each one on a different window | ||
python manage_masters.py -f production-masters.json -R scheduler | python manage_masters.py -f production-masters.json -j16 -R scheduler update checkconfig reconfig | ||
python manage_masters.py -f production-masters.json -R build | python manage_masters.py -f production-masters.json -j16 -R build update checkconfig reconfig | ||
python manage_masters.py -f production-masters.json -R try | python manage_masters.py -f production-masters.json -j16 -R try update checkconfig reconfig | ||
python manage_masters.py -f production-masters.json -R tests | python manage_masters.py -f production-masters.json -j16 -R tests update checkconfig reconfig | ||
The tests reconfig can take a really long time, so you can parallelize the test process using -M {macosx|windows|linux|panda} (instead of "-R tests") each on a different tab plus -j16. So, replace the last line/window with these 5 (for a total of 8 windows): | |||
python manage_masters.py -f production-masters.json -j16 -M macosx update checkconfig reconfig | |||
python manage_masters.py -f production-masters.json -j16 -M windows update checkconfig reconfig | |||
python manage_masters.py -f production-masters.json -j16 -M linux update checkconfig reconfig | |||
python manage_masters.py -f production-masters.json -j16 -M tegra update checkconfig reconfig | |||
python manage_masters.py -f production-masters.json -j16 -M panda update checkconfig reconfig | |||
To validate the above (i.e. we haven't added any new platforms since the docs were updated), run: | |||
diff -u \ | |||
<(./manage_masters.py -f production-masters.json -l -R tests) \ | |||
<(./manage_masters.py -f production-masters.json -l -M macosx \ | |||
-M windows -M linux -M tegra -M panda) | |||
If any differences are reported, include those platforms and update the docs. | |||
= Hosts and role groups = | = Hosts and role groups = | ||
Fabric works on individual hosts, and supports organizing these hosts into groups. This is mostly a good fit for how we need to work, except we often have multiple buildbot masters on a single host, so there is a bit of hacking in master_fabric.py to pick out the right hosts to operate on depending on what the user has selected. | Fabric works on individual hosts, and supports organizing these hosts into groups. This is mostly a good fit for how we need to work, except we often have multiple buildbot masters on a single host, so there is a bit of hacking in master_fabric.py to pick out the right hosts to operate on depending on what the user has selected. | ||
Hosts are selected with the -H flag, and roles are selected with the -R flag. Hosts correspond to the 'name' field in the masters json file, and are short abbreviations to refer to each master, e.g. | Hosts are selected with the -H flag, and roles are selected with the -R flag. Hosts correspond to the 'name' field in the masters json file, and are short abbreviations to refer to each master, e.g. bm13-build1, bm19-tests1-tegra, bm33-try1, bm36-build_scheduler. We have 4 roles defined: '''build''', '''scheduler''', '''try''', and '''tests'''. Selecting a role will restrict fabric to only operate on masters that operate on that role. | ||
The string 'all' when specified via -H or -R means that all masters in the masters file will be operated on. | The string 'all' when specified via -H or -R means that all masters in the masters file will be operated on. You can also use -M flag to match on strings in the master name, eg -M tests1-windows to pick up all the windows test masters. Note that manage_masters.py will "or" all host specifications from the command line, e.g. "-R tests -M windows" will return all hosts in role "tests", not just the windows test masters. | ||
Fabric relies on being able to ssh to the masters without password authentication, so be sure to have your ssh keys set up! | Fabric relies on being able to ssh to the masters without password authentication, so be sure to have your ssh keys set up! Which means have the needed keys added into the running instance of your ssh-agent (your "<tt>~/.ssh/config</tt>" file is ''not'' consulted by Paramiko.) If you don't have the keys set up, you'll be asked for your password one time per invocation, so use multiple commands per invocation where appropriate. | ||
= Updating checkout = | = Updating checkout = | ||
Line 119: | Line 143: | ||
= Checkconfig = | = Checkconfig = | ||
<pre> | <pre> | ||
python manage_masters.py -f production-masters.json -R build | python manage_masters.py -f production-masters.json -R build -R scheduler checkconfig | ||
bm3 OK | bm3 OK | ||
pm02-sm OK | pm02-sm OK | ||
Line 134: | Line 158: | ||
= Reconfigure = | = Reconfigure = | ||
'''''Reminder:''''' ''<tt>reconfigure</tt> only does the reconfig; you need to have previously done an '<tt>update</tt>' and '<tt>checkconfig</tt>''' | |||
<pre> | <pre> | ||
python manage_masters.py -f production-masters.json -R build reconfig | python manage_masters.py -f production-masters.json -R build reconfig | ||
Line 155: | Line 182: | ||
Disconnecting from buildbot-master2.build.mozilla.org... done. | Disconnecting from buildbot-master2.build.mozilla.org... done. | ||
</pre> | </pre> | ||
If the reconfig gets stuck, see [https://wiki.mozilla.org/ReleaseEngineering/How_To/Unstick_a_Stuck_Slave_From_A_Master How To/Unstick a Stuck Slave From A Master]. | |||
As a special case for test masters, you can unstick things by either: | |||
* triggering a "Clean Shutdown" from the web UI for that master, or | |||
* using manage_masters.py graceful_restart command | |||
After jobs complete, the master will shut down (web page will not be served). Fabric should notice and unstick itself at that point. If fabric doesn't notice, in a separate window, individually do the update and start steps. If fabric still doesn't notice, good luck and document what works. |