ReleaseEngineering/Mozpool/How To Interpret Device State in Mozpool
If you look at the Lifeguard or BMM UI's in Mozpool, you'll see that each device has a state. This page will help to break those down.�
Note: This is not an exhaustive list of states. For that, see devicemachine.py.
Normal States
These are states that a panda normally will stay in for a while
new
Newly-added pandas show up in this state. Mozpool will not allocate these. They should be manually imaged using lifeguard instead, as a smoke test for bum hardware. The imaging will leave them in the "ready" state, or in a failure state.
free
The device is functional, but currently not allocated to any request. Devices in this state may be allocated by mozpool at any time.
ready
The device has been set up by mozpool to meet a request, and is in use. This name could be confusing -- think of it as "in use".
Action States
When lifeguard is doing things to a device, it cycles the device through a number of states. Devices don't stay in these states for very long. Lifeguard has timeouts for each state, and will only retry each state a configured number of times before it decides the device has failed (see the failure states, below). The details of the action states will change as we develop the tool, but there's a general pattern:
Lifeguard operations begin with one of two actions: either cycle the power and boot from the sdcard (states with a pc_ prefix), or cycle the power with a PXE config in place (states with a pxe_ prefix). Depending on the PXE config selected, the latter moves into purpose-specific states. These have prefixes like android_, b2g_ or maintenance. For installs, the suffixes roughly track the progress of the install:
- download the binaries
- extract them onto the sdcard
- reboot
- come up with an active network connection
You can see these state transitions in the second-stage scripts at http://hg.mozilla.org/build/puppet/file/tip/modules/bmm/templates/.
Failed States
Sometimes pandas go bad. Bad panda! When that happens, Lifeguard will generally detect it by not seeing the expected behavior, and assigning the device to a failed_ state. Many of these states are named by adding "failed_" to the name of the action state in which the device failed repeatedly.
The first level of diagnosis for a panda is which failed state lifeguard has chosen for the panda. The next level is usually to look at the logs in lifeguard.
failed_pxe_booting
In this state, after cycling the power many times, lifeguard has never heard anything from the board. Check the logs for anything from the board, tagged "syslog". If you see that, then the board is booting and has network. Otherwise, you'll need to investigate starting at the beginning:
does the board have power? (blinkenlights) does the board have link? (lights on the switch) does the board have an sdcard? does the board become pingable even briefly if you force a power-cycle from the BMM UI? if so, then power, link, and sdcard are all working at least a little bit - try a new sdcard
after all of that, if you haven't found the problem, then it's time for some serial diagnostics.
failed_android_downloading or failed_b2g_downloading
In this state, the board has booted, talked to lifeguard, and begun downloading the stuff it needs to blast onto the sdcard. Then died. Most of the time, this will be either a corrupt or dead sdcard. Try swapping in another card.
Let's also try re-writing the u-boot image to the card, in case it was corrupted, but marking the sdcard somehow. If it turns out that the u-boot image gets corrupted sometimes, but re-writing it fixes that, then we can avoid trashing a lot of good sdcards. If it never helps, delete this paragraph.