QA/Execution/Web Testing/Flame Buildout

Goals

Short Term

build a small capacity (>=8 Flames) automation system running b2gperf and subset of UI tests per gaia commit, flashing gecko periodically from b2g-inbound
ETA: July 21st

Long Term

build a higher-capacity (>30 Firefox OS Flames) pool of phones hooked up to mozpool, fully-capable (SIM cards, Wi-Fi, SD cards, etc.), reporting to treeherder
ETA: Q3

Jenkins URL: http://jenkins1.qa.scl3.mozilla.com/

No results.

0 Total; 0 Open (0%); 0 Resolved (0%); 0 Verified (0%);

Etherpad

https://etherpad.mozilla.org/b2g-per-checkin-smoketest

Architecture

Pictured above is the original planned architecture for automated testing on Flames. Although each Jenkins node will execute only one job at a time, for redundancy each has multiple, probably two or three, Flames attached via USB to the test machine and assigned to its Mozpool instance. It is Mozpool's job to both prep a Flame and verify that it is fully functional before passing it back (via its serial number) to the node's job. If a Flame is nonfunctional, Mozpool will try the next one, as long as there are Flames available.

Note that the Flames are seated in controllable power harnesses, which will act as the relays do in the original Panda-based Pulse system.

This isn't terribly efficient, since many of the phones will sit idle. In addition, if all the devices assigned to a node fail, that node will be forced to fail jobs, even if another node on the same machine has several working devices.

Thus it might make sense to have one Mozpool instance per machine to handle all connected devices:

The dotted lines in the diagram above indicate that any node on a machine may access any device connected to that machine, as selected by Mozpool. This eliminates the possibility of a node having no working devices while another node on the same machine has several.

There is still the problem that a node may not be able to get a working device, if all devices attached to the machine are either currently assigned to nodes or nonfunctional. A solution to this would be to dynamically disable nodes when a device is marked as nonfunctional, so that the number of nodes always matches the number of functional devices:

In this case, there needs to be some process which is responsible for notifying the master that a node is down and similarly to notify the master that a node is back up if the device is later fixed or replaced. This is referred to as the "node controller" in the above diagram. As before, any node can use any device, so that we can mark currently unused nodes as disabled when a device is determined to be nonfunctional; the chosen node can then continue to find a functional device.

There is still the edge case when the chosen node determines that the last available device is nonfunctional. In this case it would have to fail the job, unless we are able to somehow indicate to the Jenkins master that it should try to reschedule the job, since it should be the master that is waiting for a free node/device, rather than the node waiting for a free device itself. In other words, the nodes should only be waiting for a device if there are available devices to try; it should not be waiting for a device to become free if there are none currently available. That's the job of the Jenkins master. We are stretching the standard Jenkins model a bit in this setup because both a free node *and* a working, free device are required to service a job, and when a node accepts a job, we don't yet know if we can actually find a working device, since even an available device might fail during initialization.

The general way to work around this problem is to have the Jenkins master use Mozpool *during* the scheduling phase, that is, to have the master contact Mozpool when a job comes in and have Mozpool return a node/device combination. Something like this:

However at this point we are starting to do some of Jenkins' work (the job scheduling part) ourselves. Again, the reasoning behind putting some of the scheduling work onto Mozpool is due to the fact that we need both a node and a working device. In a standard Jenkins setup, if a node is down, it will never be selected for a job, but in our setup, a node can be up but not be able to get a working device. It's not clear if shifting some of the scheduling responsibilities away from the Jenkins master is a net gain, particularly if we are able to have a node indicate to the master that it cannot perform the given job and that the job should be rescheduled.