ReleaseEngineering/Mozpool/How To Interpret Device State in Mozpool

From MozillaWiki
Jump to navigation Jump to search

If you look at the Lifeguard or BMM UI's in Mozpool, you'll see that each device has a state. This page will help to break those down.

Note: This is not an exhaustive list of states. For that, see devicemachine.py.

Normal States

These are states that a panda normally will stay in for a while

new

Newly-added pandas show up in this state. Mozpool will not allocate these. They should be manually imaged using lifeguard instead, as a smoke test for bum hardware. The imaging will leave them in the "ready" state, or in a failure state.

free

The device is functional, but currently not allocated to any request. Devices in this state may be allocated by mozpool at any time.

ready

The device has been set up by mozpool to meet a request, and is in use. This name could be confusing -- think of it as "in use".

maintenance_mode

The device is in maintenance mode: live-booted to the Ubuntu image and awaiting login via SSH. You can re-image or power-cycle it from this state when you're done with your maintenance.

Action States

When lifeguard is doing things to a device, it cycles the device through a number of states. Devices don't stay in these states for very long. Lifeguard has timeouts for each state, and will only retry each state a configured number of times before it decides the device has failed (see the failure states, below).

If a device is in an action state, the correct behavior on your part is to wait patiently until it gets to a normal or failed state. Lifeguard's still doing its thing, and you'll only get in the way. Don't make the lifeguard angry.

The details of the action states will change as we develop the tool, but there's a general pattern:

Lifeguard operations begin with one of two actions: either cycle the power and boot from the sdcard (states with a pc_ prefix), or cycle the power with a PXE config in place (states with a pxe_ prefix). Depending on the PXE config selected, the latter moves into purpose-specific states. These have prefixes like android_, b2g_ or maintenance. For installs, the suffixes roughly track the progress of the install:

  • download the binaries
  • extract them onto the sdcard
  • reboot
  • come up with an active network connection

You can see these state transitions in the second-stage scripts at http://hg.mozilla.org/build/puppet/file/tip/modules/bmm/templates/.

Failed States

Sometimes pandas go bad. Bad panda! When that happens, Lifeguard will generally detect it by not seeing the expected behavior, and assigning the device to a failed_ state. Many of these states are named by adding "failed_" to the name of the action state in which the device failed repeatedly.

When pandas fail, it's not always a hardware problem; see Handling Panda Failures for the process to follow there; the below just describes the states.

failed_pxe_booting

In this state, after cycling the power many times, lifeguard has never heard anything from the board. This represents the absence of data available to mozpool, so it could be cause by a number of physical conditions on the device. It is generally *not* related to the image on the device. Consult the logs.

/failed_.*_downloading/

In these states, the device has booted, talked to lifeguard, and begun downloading the stuff it needs to blast onto the sdcard. Then died. The download is to RAM. Bad download URLs, server failures, or network failures might cause this. Consult the logs.

/failed_.*_extracting/

In these stages, the download was successful, but extracting it onto the sdcard failed. This can indicate a bad sdcard, or a corrupt image. The logs may help to distinguish those two.

/failed_.*_pinging/

After a re-image, Lifeguard tries to ping the device. If it doesn't come up after a while, the device ends up in this state. This is almost always an image problem.

failed_power_cycling

This state occurs when the power-cycle operation itself fails repeatedly. It is more likely an indication of a bad relay board than a problem with the device.