Confirmed users
1,989
edits
Line 74: | Line 74: | ||
Check end to end times here https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/#/dashboard/db/kpis | Check end to end times here https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/#/dashboard/db/kpis | ||
=== Did a master hang during reconfig and as result is not running jobs? === | |||
For example, the alert that pending builds are increasing as well as a stale reconfig lockfile may indicate that a master is in a hung state and not running jobs. In this case, it was an AWS master. | |||
<pre> | |||
Mon 11:58:31 PDT [4448] cruncher.srv.releng.scl3.mozilla.com:Pending builds is WARNING: WARNING Pending Builds: 4697 (http://m.mozilla.org/Pending+builds) | |||
relengbot | |||
[sns alert] Mon 12:09:06 PDT buildbot-master04.bb.releng.usw2.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes. | |||
</pre> | |||
Looking at the log it hadn't been updated for several hours | |||
<pre> | |||
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ tail !$ | |||
tail master/twistd.log | |||
2015-09-14 09:00:49-0700 [-] prioritizeBuilders: 0.17s builders with slaves: 0 | |||
2015-09-14 09:00:49-0700 [-] prioritizeBuilders: 0.17s prioritized 0 builder(s): [] | |||
2015-09-14 09:00:49-0700 [-] prioritizeBuilders: 0.17s using up to first 0 builders: | |||
2015-09-14 09:00:49-0700 [-] prioritizeBuilders: 0.17s found 0 important builder(s): [] | |||
2015-09-14 09:00:49-0700 [-] prioritizeBuilders: 0.17s finished prioritization | |||
2015-09-14 09:00:50-0700 [-] Creating BuildMaster -- buildbot.version: 0.8.2-hg-41f0fbee10f4-production-0.8 | |||
2015-09-14 09:00:50-0700 [-] loading configuration from /builds/buildbot/tests1-linux32/master/master.cfg | |||
2015-09-14 09:00:50-0700 [-] JacuzziAllocator 1268988944: created | |||
2015-09-14 09:00:50-0700 [-] nextAWSSlave: start | |||
2015-09-14 09:00:50-0700 [-] nextAWSSlave: start | |||
</pre> | |||
Jordan used strace to look at the hung process | |||
</pre> | |||
# checking the reconfig proc | |||
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ ls -l /proc/4847/fd | grep 3 | |||
l-wx------ 1 cltbld cltbld 64 Sep 14 12:25 1 -> /builds/buildbot/tests1-linux32/reconfig.log | |||
l-wx------ 1 cltbld cltbld 64 Sep 14 12:25 2 -> pipe:[8988493] | |||
lr-x------ 1 cltbld cltbld 64 Sep 14 12:25 3 -> /builds/buildbot/tests1-linux32/master/twistd.log | |||
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ ps auwx | grep python | |||
cltbld 1045 0.0 0.1 196880 5256 ? S Aug29 7:05 /builds/buildbot/queue/bin/python /builds/buildbot/queue/tools/buildbot-helpers/command_runner.py -q /dev/shm/queue/commands -r 10 -t 60 -m 60 -j4 | |||
cltbld 1057 0.0 0.7 226360 26316 ? S Aug29 10:39 /builds/buildbot/queue/bin/python /builds/buildbot/queue/tools/buildbot-helpers/pulse_publisher.py --passwords /builds/buildbot/queue/passwords.py -q /dev/shm/queue/pulse | |||
cltbld 4847 0.0 0.2 176936 9292 ? S 09:00 0:02 python ./tools/buildfarm/maintenance/buildbot-wrangler.py reconfig /builds/buildbot/tests1-linux32/master | |||
cltbld 15648 0.0 0.0 103252 840 pts/0 S+ 12:26 0:00 grep python | |||
cltbld 17135 26.4 58.2 3015260 2173096 ? Sl Aug30 5864:05 /builds/buildbot/tests1-linux32/bin/python /builds/buildbot/tests1-linux32/bin/buildbot start /builds/buildbot/tests1-linux32/master | |||
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ strace -p 4847 | |||
Process 4847 attached - interrupt to quit | |||
select(0, NULL, NULL, NULL, {0, 464499}) = 0 (Timeout) | |||
read(3, "", 8192) = 0 | |||
stat("/builds/buildbot/tests1-linux32/master/twistd.log", {st_mode=S_IFREG|0664, st_size=30765027, ...}) = 0 | |||
select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout) | |||
read(3, "", 8192) = 0 | |||
stat("/builds/buildbot/tests1-linux32/master/twistd.log", {st_mode=S_IFREG|0664, st_size=30765027, ...}) = 0 | |||
select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout) | |||
read(3, "", 8192) = 0 | |||
stat("/builds/buildbot/tests1-linux32/master/twistd.log", {st_mode=S_IFREG|0664, st_size=30765027, ...}) = 0 | |||
select(0, NULL, NULL, NULL, {0, 500000}^C <unfinished ...> | |||
Process 4847 detached | |||
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ ls -l /proc/4847/fd | grep 3 | |||
l-wx------ 1 cltbld cltbld 64 Sep 14 12:25 1 -> /builds/buildbot/tests1-linux32/reconfig.log | |||
l-wx------ 1 cltbld cltbld 64 Sep 14 12:25 2 -> pipe:[8988493] | |||
lr-x------ 1 cltbld cltbld 64 Sep 14 12:25 3 -> /builds/buildbot/tests1-linux32/master/twistd.log | |||
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ | |||
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ | |||
# ^ looks like it's still running and waiting on new log lines in twistd.log | |||
# so let's check the buildbot proc itself and see why it's not logging | |||
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ | |||
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ strace -p 17135 | |||
Process 17135 attached - interrupt to quit | |||
recvfrom(131, ^C <unfinished ...> | |||
Process 17135 detached | |||
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ ls -l /proc/17135/fd | grep 131 | |||
lrwx------ 1 cltbld cltbld 64 Sep 14 12:28 131 -> socket:[8988995] | |||
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ lsof |grep 8988995 | |||
buildbot 17135 cltbld 131u IPv4 8988995 0t0 TCP buildbot-master04.bb.releng.usw2.mozilla.com:48176->ec2-54-215-155-53.us-west-1.compute.amazonaws.com:http (ESTABLISHED) | |||
# ^ so we are hung on some socket TCP connection and not doing anything else. | |||
</pre> | |||
Jordan had to do hard restart the master because the buildbot process was hung. SIGTERM didn't work so had to kill -9 the buildbot process. It would also be a good idea to disable this master in slavealloc as soon as the problem appeared so jobs could be allocated to other masters. | |||
== What code/systems have changed? == | == What code/systems have changed? == |