ReleaseEngineering/NoReboots: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(Fill in links to source code.)
Line 48: Line 48:
   summary:    Bug 1103123 - Turn off rebooting of talos machines; r=catlee
   summary:    Bug 1103123 - Turn off rebooting of talos machines; r=catlee


==How is no-reboot mode enabled (idllizer, post_flight)?==
==How no-reboot mode is enabled (idllizer, post_flight)==


Buildbot is now started/managed by runner, which runs tasks in an infinite loop according to some specified order; each task is blocking. As such, buildbot initiates a graceful shutdown immediately after accepting any job so that the runner tasks may loop around again after it’s finished. A single runner loop looks like this:
Buildbot is now started/managed by [[ReleaseEngineering/Applications/Runner|runner]], which runs tasks in an infinite loop according to some specified order [each task is blocking]. As such, buildbot initiates a graceful shutdown immediately after accepting any job so that the runner tasks may loop around again after it’s finished. A single runner loop looks like this:


<tasks before buildbot> -> buildbot.py [graceful shutdown] -> <tasks after buildbot> -> post_flight.py
    <tasks before buildbot> -> buildbot.py [graceful shutdown] -> <tasks after buildbot> -> post_flight.py


The graceful shutdown is initiated by idliizer, then, post_flight.py decides whether or not to shut down the machine or go forward with another loop.
The graceful shutdown is initiated by [http://hg.mozilla.org/build/buildbot/file/09de3a58d602/slave/buildslave/idleizer.py#l176 idelizer.py], then, [http://hg.mozilla.org/build/puppet/file/1185781bb6c1/modules/runner/files/post_flight.py post_flight.py] decides whether or not to shut down the machine or go forward with another loop.


===post_flight checks:===
===post_flight checks:===


====hostname blacklist====
=====hostname blacklist=====


Any machine with a hostname that matches some regular expression found in this list will be rebooted by post_flight. For example: [“^tst-“, “^t-"] would reboot all test machines after any job.
Any machine with a hostname that matches some regular expression found in this list will be rebooted by post_flight. For example: [“^tst-“, “^t-"] would reboot all test machines after any job.


====build api====
=====build api=====


BuildAPI is used to fetch data about the most recent job, if the job fails the slave is rebooted.
BuildAPI is used to fetch data about the most recent job, if the job fails the slave is rebooted.


====todo: jobname blacklist====
=====todo: jobname blacklist=====


Works like the hostname blacklist, except, acting on the name of the most recently run job (which is known about by BuildAPI).
Works like the hostname blacklist, except, acting on the name of the most recently run job (which is known about by BuildAPI).

Revision as of 09:18, 13 January 2015

Why no reboots:

Turning off reboots saves machine time in several ways: 70-120 seconds of reboot time, plus FS cache utilization, and opportunities for sharing work (like clobbering) across jobs more effectively.

Results:

A preliminary survey of the time saved by non-rebooting spot instances suggests that, if jobs are not failing more often as a result of the changes, a great deal of wasted time is being recovered:

One hour sample from 01-13-2015 01 CST
start_to_bb_sec_avg reboot_sec_avg start_to_bb_sec_total halts_seen reboot_percentage iterations_seen type reboot_sec_total
47 79 132104 178 6 2804 all spot instances 14070
45 79 121606 167 6 2676 test and try 13329
82 67 10498 11 8 128 builders 741
Three hour sample from 01-13-2015 02 CST
start_to_bb_sec_avg reboot_sec_avg start_to_bb_sec_total halts_seen reboot_percentage iterations_seen type reboot_sec_total
47 92 410104 605 7 8547 all spot instances 55769
45 80 362177 553 7 7884 test and try 44648
72 213 47927 52 7 663 builders 11121

No reboots timeline (puppet):

 date:        Tue Jan 06 14:26:55 2015 -0600
 summary:     Bug 1118125 - Turn off osx reboots; r=Callek
 date:        Tue Dec 30 19:40:23 2014 +0000
 summary:     Bug 1103123 - Turn off rebooting of all linux slaves; r=callek
 date:        Thu Dec 18 18:45:10 2014 +0000
 summary:     Bug 1113245 - Remove cleanslate process list on Linux and Mac machines during reboots with halt.py; r=rail
 date:        Fri Dec 12 21:33:04 2014 +0000
 summary:     Bug 1103123 - Turn off rebooting of talos machines; r=catlee

How no-reboot mode is enabled (idllizer, post_flight)

Buildbot is now started/managed by runner, which runs tasks in an infinite loop according to some specified order [each task is blocking]. As such, buildbot initiates a graceful shutdown immediately after accepting any job so that the runner tasks may loop around again after it’s finished. A single runner loop looks like this:

   <tasks before buildbot> -> buildbot.py [graceful shutdown] -> <tasks after buildbot> -> post_flight.py

The graceful shutdown is initiated by idelizer.py, then, post_flight.py decides whether or not to shut down the machine or go forward with another loop.

post_flight checks:

hostname blacklist

Any machine with a hostname that matches some regular expression found in this list will be rebooted by post_flight. For example: [“^tst-“, “^t-"] would reboot all test machines after any job.

build api

BuildAPI is used to fetch data about the most recent job, if the job fails the slave is rebooted.

todo: jobname blacklist

Works like the hostname blacklist, except, acting on the name of the most recently run job (which is known about by BuildAPI).

What issues have been noted since no-reboot work started?

These bugs have been noted, since December '14, as having possible connection to Runner/NoReboots:

https://bugzilla.mozilla.org/show_bug.cgi?id=1114541

https://bugzilla.mozilla.org/show_bug.cgi?id=989048

https://bugzilla.mozilla.org/show_bug.cgi?id=1109932

https://bugzilla.mozilla.org/show_bug.cgi?id=1114688

https://bugzilla.mozilla.org/show_bug.cgi?id=1111137

How are we tracking the status of machines, and measuring effectiveness?

Runner constantly uploads task stats to influxdb, for dashboards see: https://stats.taskcluster.net/grafana/#/dashboard/db/runner