QA/Execution/Web Testing/Flame Buildout: Difference between revisions

Latest revision as of 22:16, 27 July 2014

Goals

Short Term

build a small capacity (>=8 Flames) automation system running b2gperf and subset of UI tests per gaia commit, flashing gecko periodically from b2g-inbound
ETA: July 21st

Long Term

build a higher-capacity (>30 Firefox OS Flames) pool of phones hooked up to mozpool, fully-capable (SIM cards, Wi-Fi, SD cards, etc.), reporting to treeherder
ETA: Q3

Jenkins URL: http://jenkins1.qa.scl3.mozilla.com/

No results.

0 Total; 0 Open (0%); 0 Resolved (0%); 0 Verified (0%);

Etherpad

https://etherpad.mozilla.org/b2g-per-checkin-smoketest

Architecture

Pictured above is an architecture diagram for automated testing on Flames for a sample system with two nodes and two devices per node. Flames are attached to test machines via USB and assigned to its Mozpool instance.

Because we need not just a free Jenkins executor but a working and prepped Flame, the Jenkins scheduler contacts a central Mozpool instance that routes requests to a local Mozpool instance that has an available device (all Mozpool instances share a single database). The local Mozpool instance flashes the Flame and verifies that it is fully functional before passing its serial number and executor to the scheduler, which then instructs the node to be begin the job. If a Flame is nonfunctional, the Jenkins master will choose another or block until one becomes available.

Note that the Flames are seated in controllable power harnesses, which will act as the relays do in the original Panda-based Pulse system.

@@ Line 23: / Line 23: @@
 [[File:Flame-setup.png]]
-Pictured above is the original planned architecture for automated testing on Flames.  Although each Jenkins node will execute only one job at a time, for redundancy each has multiple, probably two or three, Flames attached via USB to the test machine and assigned to its [[ReleaseEngineering/Mozpool|Mozpool]] instance. It is Mozpool's job to both prep a Flame and verify that it is fully functional before passing it back (via its serial number) to the node's job.  If a Flame is nonfunctional, Mozpool will try the next one, as long as there are Flames available.
+Pictured above is an architecture diagram for automated testing on Flames for a sample system with two nodes and two devices per node.  Flames are attached to test machines via USB and assigned to its [[ReleaseEngineering/Mozpool|Mozpool]] instance.
+Because we need not just a free Jenkins executor but a working and prepped Flame, the Jenkins scheduler contacts a central Mozpool instance that routes requests to a local Mozpool instance that has an available device (all Mozpool instances share a single database).  The local Mozpool instance flashes the Flame and verifies that it is fully functional before passing its serial number and executor to the scheduler, which then instructs the node to be begin the job.  If a Flame is nonfunctional, the Jenkins master will choose another or block until one becomes available.
 Note that the Flames are seated in controllable power harnesses, which will act as the relays do in the original Panda-based Pulse system.
-This isn't terribly efficient, since many of the phones will sit idle.  In addition, if all the devices assigned to a node fail, that node will be forced to fail jobs, even if another node on the same machine has several working devices.
-Thus it might make sense to have one Mozpool instance per machine to handle all connected devices:
-[[File:Flame-setup-single-mozpool.png]]
-The dotted lines in the diagram above indicate that any node on a machine may access any device connected to that machine, as selected by Mozpool.  This eliminates the possibility of a node having no working devices while another node on the same machine has several.
-There is still the problem that a node may not be able to get a working device, if all devices attached to the machine are either currently assigned to nodes or nonfunctional.  A solution to this would be to dynamically disable nodes when a device is marked as nonfunctional, so that the number of nodes always matches the number of functional devices:
-[[File:Flame-setup-dynamic-nodes.png]]
-In this case, there needs to be some process which is responsible for notifying the master that a node is down and similarly to notify the master that a node is back up if the device is later fixed or replaced.  This is referred to as the "node controller" in the above diagram.  As before, any node can use any device, so that we can mark currently unused nodes as disabled when a device is determined to be nonfunctional; the chosen node can then continue to find a functional device.
-There is still the edge case when the chosen node determines that the last available device is nonfunctional.  In this case it would have to fail the job, unless we are able to somehow indicate to the Jenkins master that it should try to reschedule the job, since it should be the master that is waiting for a free node/device, rather than the node waiting for a free device itself.  In other words, the nodes should only be waiting for a device if there are available devices to try; it should not be waiting for a device to become free if there are none currently available.  That's the job of the Jenkins master.  We are stretching the standard Jenkins model a bit in this setup because both a free node *and* a working, free device are required to service a job, and when a node accepts a job, we don't yet know if we can actually find a working device, since even an available device might fail during initialization.
-The general way to work around this problem is to have the Jenkins master use Mozpool *during* the scheduling phase, that is, to have the master contact Mozpool when a job comes in and have Mozpool return a node/device combination.  Something like this:
-[[File:Flame-setup-mozpool-picks-node.png]]
-However at this point we are starting to do some of Jenkins' work (the job scheduling part) ourselves.  Again, the reasoning behind putting some of the scheduling work onto Mozpool is due to the fact that we need both a node and a working device.  In a standard Jenkins setup, if a node is down, it will never be selected for a job, but in our setup, a node can be up but not be able to get a working device.  It's not clear if shifting some of the scheduling responsibilities away from the Jenkins master is a net gain, particularly if we are able to have a node indicate to the master that it cannot perform the given job and that the job should be rescheduled.

QA/Execution/Web Testing/Flame Buildout: Difference between revisions

Latest revision as of 22:16, 27 July 2014

Contents

Goals

Short Term

Long Term

Jenkins URL: http://jenkins1.qa.scl3.mozilla.com/

Etherpad

Architecture

Navigation menu

QA/Execution/Web Testing/Flame Buildout: Difference between revisions

Latest revision as of 22:16, 27 July 2014

Goals

Short Term

Long Term

Jenkins URL: http://jenkins1.qa.scl3.mozilla.com/

Etherpad

Architecture

Navigation menu

Search