ReleaseEngineering/NoReboots
Why no reboots:
Turning off reboots saves machine time in several ways: 70-120 seconds of reboot time, plus FS cache utilization, and opportunities for sharing work (like clobbering) across jobs more effectively.
Results:
A preliminary survey of the time saved by non-rebooting spot instances suggests that, if jobs are not failing more often as a result of the changes, a great deal of wasted time is being recovered:
One hour sample from 01-13-2015 01 CST Raw Data
The test/try machines spend more time rebooting (79s) and less time doing pre-flight tasks (~45s) on average while builders have an opposite skew (82s pre-flight tasks and 67s reboot time).
start_to_bb_sec_avg | reboot_sec_avg | start_to_bb_sec_total | halts_seen | reboot_percentage | iterations_seen | type | reboot_sec_total |
---|---|---|---|---|---|---|---|
47 | 79 | 132104 | 178 | 6 | 2804 | all spot instances | 14070 |
45 | 79 | 121606 | 167 | 6 | 2676 | test and try | 13329 |
82 | 67 | 10498 | 11 | 8 | 128 | builders | 741 |
Three hour sample from 01-13-2015 02 CST Raw Data
During the longer samples builder reboot times increased dramatically, however, this is likely the result of builder issues which were occuring around the time the data was taken; test/try results remained stable.
start_to_bb_sec_avg | reboot_sec_avg | start_to_bb_sec_total | halts_seen | reboot_percentage | iterations_seen | type | reboot_sec_total |
---|---|---|---|---|---|---|---|
47 | 92 | 410104 | 605 | 7 | 8547 | all spot instances | 55769 |
45 | 80 | 362177 | 553 | 7 | 7884 | test and try | 44648 |
72 | 213 | 47927 | 52 | 7 | 663 | builders | 11121 |
NOTE: The data above was gathered via runners influxdb logging and verified by spot checking runner logs (/var/log/runner.log) The reports themselves were generated by this script: http://pastebin.mozilla.org/8191890
No reboots timeline (puppet):
date: Tue Jan 06 14:26:55 2015 -0600 summary: Bug 1118125 - Turn off osx reboots; r=Callek
date: Tue Dec 30 19:40:23 2014 +0000 summary: Bug 1103123 - Turn off rebooting of all linux slaves; r=callek
date: Thu Dec 18 18:45:10 2014 +0000 summary: Bug 1113245 - Remove cleanslate process list on Linux and Mac machines during reboots with halt.py; r=rail
date: Fri Dec 12 21:33:04 2014 +0000 summary: Bug 1103123 - Turn off rebooting of talos machines; r=catlee
How no-reboot mode is enabled (idllizer, post_flight)
Buildbot is now started/managed by runner, which runs tasks in an infinite loop according to some specified order [each task is blocking]. As such, buildbot initiates a graceful shutdown immediately after accepting any job so that the runner tasks may loop around again after it’s finished. A single runner loop looks like this:
<tasks before buildbot> -> buildbot.py [graceful shutdown] -> <tasks after buildbot> -> post_flight.py
The graceful shutdown is initiated by idelizer.py, then, post_flight.py decides whether or not to shut down the machine or go forward with another loop.
post_flight checks:
hostname blacklist
Any machine with a hostname that matches some regular expression found in this list will be rebooted by post_flight. For example: [“^tst-“, “^t-"] would reboot all test machines after any job.
build api
BuildAPI is used to fetch data about the most recent job, if the job fails the slave is rebooted.
todo: jobname blacklist
Works like the hostname blacklist, except, acting on the name of the most recently run job (which is known about by BuildAPI).
What issues have been noted since no-reboot work started?
These bugs have been noted, since December '14, as having possible connection to Runner/NoReboots:
https://bugzilla.mozilla.org/show_bug.cgi?id=1114541
https://bugzilla.mozilla.org/show_bug.cgi?id=989048
https://bugzilla.mozilla.org/show_bug.cgi?id=1109932
https://bugzilla.mozilla.org/show_bug.cgi?id=1114688
https://bugzilla.mozilla.org/show_bug.cgi?id=1111137
How are we tracking the status of machines, and measuring effectiveness?
Runner constantly uploads task stats to influxdb, for dashboards see: https://stats.taskcluster.net/grafana/#/dashboard/db/runner