CIDuty/How To/High Pending Counts: Difference between revisions

Jump to navigation Jump to search
Line 74: Line 74:


Check end to end times here https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/#/dashboard/db/kpis
Check end to end times here https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/#/dashboard/db/kpis
=== Did a master hang during reconfig and as result is not running jobs? ===
For example, the alert that pending builds are increasing as well as a stale reconfig lockfile may indicate that a master is in a hung state and not running jobs.  In this case, it was an AWS master. 
<pre>
Mon 11:58:31 PDT [4448] cruncher.srv.releng.scl3.mozilla.com:Pending builds is WARNING: WARNING Pending Builds: 4697 (http://m.mozilla.org/Pending+builds)
relengbot
[sns alert] Mon 12:09:06 PDT buildbot-master04.bb.releng.usw2.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes.
</pre>
Looking at the log it hadn't been updated for several hours
<pre>
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ tail !$
tail master/twistd.log
2015-09-14 09:00:49-0700 [-] prioritizeBuilders: 0.17s builders with slaves: 0
2015-09-14 09:00:49-0700 [-] prioritizeBuilders: 0.17s prioritized 0 builder(s): []
2015-09-14 09:00:49-0700 [-] prioritizeBuilders: 0.17s using up to first 0 builders:
2015-09-14 09:00:49-0700 [-] prioritizeBuilders: 0.17s found 0 important builder(s): []
2015-09-14 09:00:49-0700 [-] prioritizeBuilders: 0.17s finished prioritization
2015-09-14 09:00:50-0700 [-] Creating BuildMaster -- buildbot.version: 0.8.2-hg-41f0fbee10f4-production-0.8
2015-09-14 09:00:50-0700 [-] loading configuration from /builds/buildbot/tests1-linux32/master/master.cfg
2015-09-14 09:00:50-0700 [-] JacuzziAllocator 1268988944: created
2015-09-14 09:00:50-0700 [-] nextAWSSlave: start
2015-09-14 09:00:50-0700 [-] nextAWSSlave: start
</pre>
Jordan used strace to look at the hung process
</pre>
# checking the reconfig proc
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ ls -l /proc/4847/fd | grep 3
l-wx------ 1 cltbld cltbld 64 Sep 14 12:25 1 -> /builds/buildbot/tests1-linux32/reconfig.log
l-wx------ 1 cltbld cltbld 64 Sep 14 12:25 2 -> pipe:[8988493]
lr-x------ 1 cltbld cltbld 64 Sep 14 12:25 3 -> /builds/buildbot/tests1-linux32/master/twistd.log
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ ps auwx | grep python
cltbld    1045  0.0  0.1 196880  5256 ?        S    Aug29  7:05 /builds/buildbot/queue/bin/python /builds/buildbot/queue/tools/buildbot-helpers/command_runner.py -q /dev/shm/queue/commands -r 10 -t 60 -m 60 -j4
cltbld    1057  0.0  0.7 226360 26316 ?        S    Aug29  10:39 /builds/buildbot/queue/bin/python /builds/buildbot/queue/tools/buildbot-helpers/pulse_publisher.py --passwords /builds/buildbot/queue/passwords.py -q /dev/shm/queue/pulse
cltbld    4847  0.0  0.2 176936  9292 ?        S    09:00  0:02 python ./tools/buildfarm/maintenance/buildbot-wrangler.py reconfig /builds/buildbot/tests1-linux32/master
cltbld  15648  0.0  0.0 103252  840 pts/0    S+  12:26  0:00 grep python
cltbld  17135 26.4 58.2 3015260 2173096 ?    Sl  Aug30 5864:05 /builds/buildbot/tests1-linux32/bin/python /builds/buildbot/tests1-linux32/bin/buildbot start /builds/buildbot/tests1-linux32/master
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ strace -p 4847
Process 4847 attached - interrupt to quit
select(0, NULL, NULL, NULL, {0, 464499}) = 0 (Timeout)
read(3, "", 8192)                      = 0
stat("/builds/buildbot/tests1-linux32/master/twistd.log", {st_mode=S_IFREG|0664, st_size=30765027, ...}) = 0
select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout)
read(3, "", 8192)                      = 0
stat("/builds/buildbot/tests1-linux32/master/twistd.log", {st_mode=S_IFREG|0664, st_size=30765027, ...}) = 0
select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout)
read(3, "", 8192)                      = 0
stat("/builds/buildbot/tests1-linux32/master/twistd.log", {st_mode=S_IFREG|0664, st_size=30765027, ...}) = 0
select(0, NULL, NULL, NULL, {0, 500000}^C <unfinished ...>
Process 4847 detached
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ ls -l /proc/4847/fd | grep 3
l-wx------ 1 cltbld cltbld 64 Sep 14 12:25 1 -> /builds/buildbot/tests1-linux32/reconfig.log
l-wx------ 1 cltbld cltbld 64 Sep 14 12:25 2 -> pipe:[8988493]
lr-x------ 1 cltbld cltbld 64 Sep 14 12:25 3 -> /builds/buildbot/tests1-linux32/master/twistd.log
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$
# ^ looks like it's still running and waiting on new log lines in twistd.log
# so let's check the buildbot proc itself and see why it's not logging
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ strace -p 17135
Process 17135 attached - interrupt to quit
recvfrom(131, ^C <unfinished ...>
Process 17135 detached
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ ls -l /proc/17135/fd | grep 131
lrwx------ 1 cltbld cltbld 64 Sep 14 12:28 131 -> socket:[8988995]
[cltbld@buildbot-master04.bb.releng.usw2.mozilla.com tests1-linux32]$ lsof |grep 8988995
buildbot  17135    cltbld  131u    IPv4 8988995      0t0    TCP buildbot-master04.bb.releng.usw2.mozilla.com:48176->ec2-54-215-155-53.us-west-1.compute.amazonaws.com:http (ESTABLISHED)
# ^ so we are hung on some socket TCP connection and not doing anything else.
</pre>
Jordan had to do hard restart the master because the buildbot process was hung.  SIGTERM didn't work so had to kill -9 the buildbot process.  It would also be a good idea to disable this master in slavealloc as soon as the problem appeared so jobs could be allocated to other masters.


== What code/systems have changed? ==
== What code/systems have changed? ==
Confirmed users
1,989

edits

Navigation menu