ReleaseEngineering/Archive/Android Tegras

From MozillaWiki
Jump to navigation Jump to search

Notes for buildduty

Tegra Problem Tracking

If a Tegra is failing tests, the basic "escalation path" is as follows:

  • reimage the sdcard
  • failing that, replace the sdcard
  • failing that, sometimes a second sdcard replacement works. Refer to job history success rate as to whether it's worth replacing the sdcard a second time.
  • as a last resort, decomm the Tegra. Only do this if the Tegra looks especially bad compared to the others since we have a limited number of these devices.


Common tegra error states and how to recover from them

PING checks are failing

Nagios will alert us in channel (and send email) after the it hits the retry limit for ping attempts.

See the section power cycle a tegra.

tegra agent check is CRITICAL

Verify that the tegra is not rebooting. This can take up to 15 minutes, then verify current status. If still "rebooting", then treat as if #PING checks are failing

Unable to properly remove /mnt/sdcard/tests

Common buildduty requests

Removing a Tegra temporarily from service

You may receive a request to remove a tegra from service that is burning builds (or you may find one yourself). Check the recent list to see if it's still behaving badly, e.g.: (LDAP auth required)

   https://secure.pub.build.mozilla.org/buildapi/recent/tegra-338 
  • If it's burning builds, connect to the associated foopy listed in the dashboard and stop the tegra(s).

Return a tegra to production

When dcops returns tegras to production by closing the "tegra-recovery" bug:

  1. Check the status of the tegras to ensure everything is up
  2. Start the tegras (if it was manually stopped) to return them to production
  3. Update & resolve the tegras' problem tracker bugs

Deploying a new version of SUTAgent

While infrequent, you may be asked to help deploy a new version of the SUTAgent.

Basic tegra management

Find what foopy a Tegra is on

Open the Tegra page in Slave Health, click on the name of the Tegra in question, and look for the foopy name. The foopy name is also in devices.json.

Check status of Tegra(s)

Find the Tegra on the Dashboard and then ssh to that foopy

ssh cltbld@foopy##
cd /builds
./check.sh -t tegra-###

To check on the status of all Tegras covered by that foopy

./check.sh

check.sh is found in /builds on a foopy

Clear an error flag

This is done automatically, once an hour. But if you need to do it manually for some reason...

Find the Tegra on the Dashboard, ssh to that foopy and then

ssh cltbld@foopy05
rm -f /builds/tegra-NNN/error.flg

start Tegra(s)

Find out which foopy server you need to be on and then run:

cd /builds
rm -f /builds/tegra-###/{disabled,error}.flg

The device should then attempt to startup within 5 minutes, running through verify then starting buildbot it verify succeeds.

Should it seem to have trouble starting, you can check its watcher log:

tail /builds/tegra-###/watcher.log

And if that is stale you might want to peek at recover a foopy

Disable a tegra

First find the foopy server for the Tegra and then run:

cd /builds
touch tegra-NNN/disabled.flg

This will then stop the device within 5 minutes, at the next watch_devices cycle.

Should it seem to have trouble starting, you can check its watcher log:

tail /builds/tegra-###/watcher.log

And if that is stale you might want to peek at recover a foopy

Reboot a tegra

To reboot a batch of tegras, specify '-D <tegra>' for each:

cd tools; hg pull -u
python buildfarm/maintenance/manage_foopies.py -f buildfarm/mobile/devices.json -D tegra-xxx -D tegra-yyy reboot

Note: double-check the enabled/disabled state before rebooting.

The more manual method to reboot a tegra:

ssh cltbld@<foopy>
cd /builds
python sut_tools/tegra_powercycle.py ###

You will see the snmpset call result if it worked.

You can run this from your own machine rather than a foopy by doing this:

cd ~/repos/tools/sut_tools
# If tools not up to date, run risk of toggling wrong PDU, if info changed
PYTHONPATH=. tegra_powercycle.py ###

If rebooting via PDU does not clear the problem, here are things to try:

  • reboot again - fairly common to have 2nd one clear it
    • especially if device responsive to ping & telnet (port 20701) after first reboot

Recover a foopy

If a foopy has been shutdown without having cleanly stopped all Tegras, you will need to do the following:

Note: Establish the base screen session, if needed by trying screen -x first

ssh cltbld@foopy##
screen -x
cd /builds
rm -f tegra-*/watcher.lock
./watch_devices.sh

Decommission a Tegra

  • file a RelOps bug to decommission the tegra and have it block the build slave tracking bug
  • once RelOps and DcOps have removed the tegra from DNS and Inventory:
    • in slavealloc, disable the tegra, set the environment to 'decomm', and set the comment to "decommissioned in bug XXXXXX"
    • remove the /builds/tegra-XXX directory from the foopy
    • locate the tegra in [tools]/buildfarm/mobile/devices.json and set the following fields (example):
 "foopy":"None"
 "_comment": "disabled in bug XXXXXX"
    • add tegra to the list of decommissioned tegras in [buildbot-configs]/mozilla-tests/production_config.py (example)
  • once all the above is done and landed, close the build slave tracking bug

Advanced tegra management

Remote format an SDCard

To remote format the SDCard you want to ensure the bare-minimum has even a chance of touching the tegra until you are done. this is usually a simple matter of making sure clientproxy and buildbot are not running for the tegra.

Then you want to telnet in:

telnet tegra-### 20701

Assuming the tegra is up, this will get you to a command-prompt, you'll then be checking the current mount info (used for sanity check, incase anything goes wrong) and executing the command to format, then rebooting the tegra to ensure it gets started in a clean/known state. Optionally you can telnet back in and recheck the mount info

exec mount
exec newfs_msdos -F 32 /dev/block/vold/179:9
exec rebt
^]
quit

Yes the ^] is a ctrl+] so that you can get out of the telnet prompt, since rebt does not return a value, and otherwise might have you stuck in the prompt until the connection closes remotely.

If you receive this error when attempting the format:

newfs_msdos: /dev/block/vold/179:9: No such file or directory

That is usually an indication of a bad SDCard or bad SDCard reader (the OS doesn't think there is an SDCard).

You should PDU reboot the Tegra and clear the error flag after formatting the SDcard.

check.py options

To manually run check.py, find the appropriate foopy server and

cd /builds
python sut_tools/check.py [-m [s|p]] [-r] [-c] [-t tegra-###]
  • -m [s|p] restrict Tegra list to 's'taging or 'p'roduction
  • -r reset any error.flg semaphore if found and send "rebt" command to tegra
  • -c powercycle the Tegra by telneting to the appropriate PDU

This will scan a given Tegra (or all of them) and report back it's status.

Start ADB

On the Tegra do:

telnet tegra-### 20701
exec su -c "setprop service.adb.tcp.port 5555"
exec su -c "stop adbd"
exec su -c "start adbd"

On your computer do:

adb tcpip 5555
adb connect <ipaddr of tegra>
adb shell

Add a tegra to staging

  • on the new foopy, if it is already in service (if the foopy is brand new, the initial install will do all these steps):
    • update the local tools: cd /builds/tools ; hg pull --update; cd -
    • manually add the tegra to the /builds/create_dirs.sh file
    • manually run cd /builds; ./create_dirs.sh
  • copy a buildbot.tac file from an existing tegra directory and modify the contents to point to the new tegra
  • add new tegra to slavealloc
  • add new tegra to localconfig.py on the dev-master
    • don't forget to checkconfig && reconfig the dev-master

Move a tegra from one foopy to another

The steps are written for moving one tegra. If you're moving a bunch, then you may want to apply each major step to all tegras involved, and use the "reconfigure foopy" approach to save work.

NOTE: use this technique to replace a tegra as well. (It's really two moves: move old to dust bin, then move replacement to live.)

  1. update foopies.sh & tegras.json in your working directory
  2. commit the changes to foopies.sh and tegras.json
    • make sure json is clean: python -c 'import json; json.loads(open("tegras.json").read())'
  3. in buildbot, request a "graceful shutdown"
    • wait for tegra to show "idle"
  4. on the old foopy:
    • stop the tegra via /builds/stop_cp.sh
    • manually remove the tegra from the /builds/create_dirs.sh file
      • OR run ./foopies.sh old_foopy_number from your working directory blocked by: bug 713690
  5. on the new foopy, if it is already in service (if the foopy is brand new, the initial install will do all these steps):
    • update the local tools: cd /builds/tools ; hg pull --update; cd -
    • manually add the tegra to the /builds/create_dirs.sh file
    • manually run cd /builds; ./create_dirs.sh
    • if this is a replacement tegra, manually push the ini files by judicious use of: grep python update_tegra_ini.sh | sed 's/$TEGRA/tegra-xxx/'
  6. on the new foopy:
    • restart the tegras using screen -x # or you will hit bug 642369; cd /builds ; ./start_cp.sh
      • NOTE: do not start any new tegras, which require a reconfig to be active, until after the reconfig is complete.

Move a tegra from staging to production

  1. If the tegra is running, stop it: /builds/stop_cp.sh tegra-###
  2. Edit the tegra's buildbot.tac: /builds/tegra-###/buildbot.tac
  3. Adjust the master, port and password to the appropriate server
  4. Save and restart the Tegra: /screen -x # or you will hit bug 642369; builds/start_cp.sh tegra-###

Note - yes, it's a manual process (sorry) until Tegras are in SlaveAlloc

This process also applies to moving tegras from one dev staging master to another.

Managing SUTagent

Update the SUTagent in the staging environment

Note: Foopies 110 is the staging foopy

ssh -l userid to relengwebadm and sudo to root

copy the new SUTagent apk here cd /mnt/netapp/relengweb/talos-bundles/mobile

It should be called something like SUTagentAndroid.1.13.apk. To ensure it can be read, chmod o+r the file.

On each staging foopy, connect as cltbld

screen -x
cd sut_tools
edit target_version in updateSUT.py to reflect the new version
cd ..
./stop_cp.sh
detach from screen, this will take 45-60 minutes to bring down all the attached tegras
./start_cp.sh

Update the SUTagent in production

ssh -l userid to relengwebadm and sudo to root

copy the new SUTagent apk here cd /mnt/netapp/relengweb/talos-bundles/mobile

It should be called something like SUTagentAndroid.1.17.apk. To ensure it can be read, chmod o+r the file

Land the the patch to update the version, for instance bug 827873. Run this script to update all the foopies with the latest version of the tools repo. Graceful all the masters that control the tegras and pandas. Update the foopies with the latest version of code in the tools repo with this handy script.

To avoid blue when the sutagent tries to update: Throw a disabled.flg/error.flg at all devices temporarily

python manage_foopies.py -f ~/hg/tools/buildfarm/mobile/devices.json update -H all

bug 827903 has been logged for suggestions on how to improve this process

How to tell what version of SUTagent is running on a device

telnet device-hostname 20701
ver
Escape character is '^]'.
$>activity
##AGENT-WARNING## [activity] command is currently not implemented.
$>ver
SUTAgentAndroid Version 1.15
$>activity
##AGENT-WARNING## [activity] command is currently not implemented.
$>

If the version on the device doesn't match the one on /builds/tools/sut_tools/updateSUT.py, you have a problem where verify is failing to update the SUTagent on the device.

Environment

The Tegra builders are run on multiple "foopy" servers with about 15-20 Tegra's per foopy. Each Tegra has it's own buildslave environment and they share common tool and talos environments -- all found inside of /builds.

  • Each Tegra has a /builds/tegra-### containing a buildslave and a symlink to /builds/tools/sut_tools/clientproxy.py
  • All of the shared talos info is in /builds/talos-data and HG is used to maintain it
  • All of the sut related helper code is found /builds/sut_tools (a symlink to /builds/tools/sut_tools/)

Tegra is the short name for the Tegra 250 Developer Kit test board, see http://developer.nvidia.com/tegra-250-tango-development-kits for details. It allows us to install and test Firefox on a device that runs Android Froyo while also allowing for debugging.

Unlike the N900's we don't run a buildbot environment on the device, but rather communicate to the device via the sutAgentAndroid program that the a-team maintains. All of the buildslave activities are handled by the clientproxy.py program which monitors the Tegra and it's state and starts/stops the buildslave as needed.

Foopies

Update software on the foopies

You can update the version of the tools repo on all foopies using fabric.

python manage_foopies.py -H [all|foopyNN] [-H [foopyNN]]* <action>

where action can be (currently) "update" or "show_revision"

See bug 809763 or bug 846354 for details

Pandas

Determining what Pandas are attached to a Chassis

At the moment, there are two ways you could look up this information:

  • Check the name of the relay board in the notes section for a given panda. That will also tell you the chassis name.
  • Use the rack-level view in inventory, like so:

https://inventory.mozilla.org/en-US/systems/racks/?rack=224

You can use the rack location in the first column to identify which boards and which relay are sharing a chassis.

Also, in general what you probably care more about is which relay it's attached to. Each production panda has a k/v store key of system.relay.0 that will tell you which relay it's hooked up to (which corresponds to chassis as well) and which bank and slot on the relay.

Disabling a Panda

All pandas can be disabled via the disabled.flg file on its corresponding foopy (as listed in devices.json). The disabled.flg file goes in the /builds/panda-XXXX/ dir on the foopy. Please add any information as to *why* the panda is being disabled to the disabled.flg file (bug #, etc).

There is no need to disable a panda via the Mozpool/Lifeguard UI, except in rare circumstances.

If you find a panda that does not have an entry in devices.json, that is a bug, not a feature. To deal with that, the best way is to do a cssh across all foopies looking for /builds/panda-XXXX/.

Rebooting Pandas

To reboot a batch of pandas, specify '-D <panda>' for each:

cd tools; hg pull -u
python buildfarm/maintenance/manage_foopies.py -f buildfarm/mobile/devices.json -D panda-xxx -D panda-yyy reboot

Note that panda reboots can fail due to contention on the relay board used for them. Also, double-check the enabled/disabled state.

An alternative method:

  • You can reboot from any machine that can access the panda-relay-* via ping, it's just TCP
  • Soon they'll all be hooked up to BMM and you'll just be able to HTTP POST to reboot a panda
  • relay.py just talks to the relay control board that the pandas' power supplies are wired through so it's a hard powercycle
  • Method #2
    • There's also a script to query the devices.json so you just have to specify the device name here https://hg.mozilla.org/build/braindump/reboot_panda.py. You'll need to update the path to the relay.py in the script.
    • ie. python reboot_panda.py -f ~/hg/tools/buildfarm/mobile/devices.json -d panda-0873
    • I reboot ranges of pandas like this
for ((c=615;c<=623;c++)) ; do echo panda-0$c; python reboot_panda.py -f ~/hg/tools/buildfarm/mobile/devices.json -d panda-0$c ; done

Moving replacement pandas into production

Pandas die and the hardware needs to be replaced. DC Ops have several spare pandas they use for replacement purposes. The new pandas will have a different hostnames that the old ones. The procedure to replace a dead panda from the releng side is:

  • If the panda is used for Android tests, it needs to be imaged in Lifeguard with the android image. After the imaging process has completed, you need to set its status to "locked_out".
  • Since the hostnames are different, the range of pandas allocated to production may need to be modified in mozilla-tests/production_config.py.
  • The devices.json file needs to be modified to remove the old panda's definition and include the new pandas. See bug 841486 for an example of the changes required. See https://bugzilla.mozilla.org/show_bug.cgi?id=836857#c2 for an example of the old to new panda mapping from DC Ops.
  • Update or add the directory on the appropriate foopy with the the new panda name. Ensure the appropriate buildbot master is listed in the buildbot.tac
  • Update slavealloc as appropriate to include or enable the pandas

How to run smoketests on the pandas

jmaher blogged about it

http://elvis314.wordpress.com/2013/03/01/smoketest-for-firefox-android-on-panda-boards/

How to determine the version of preseed running on a panda

  • Put the panda in question into maintenance mode via mozpool. In lifeguard, ensure that the panda is in an unlocked state, if not, unlock it. Then from the drop down image menu, select please start_maintenance. (This is the same menu that has please_image.
  • Wait for it to reboot and appear as in maintenance_mode state.
  • ssh to the device as the build user (current password is from the time the devices were imaged so n-1)
  • mkdir /home/cltbld/tmp
  • sudo -i mount /dev/mmcblk0p1 /home/cltbld/tmp
  • cat /home/cltbld/tmp/boot.txt

The current version looks like this

setenv bootargs "console=ttyO2,115200n8 rootwait ro earlyprintk fixrtc nocompcache vram=48M omapfb.vram=0:24M,1:24M mem=456M@0x80000000 mem=512M@0xA0000000 init=/init andro
idboot.console=ttyO2 omapdss.def_disp=dvi omapdss.hdmimode=0 omapdss.mode=dvi:640x480MR-24@60 consoleblank=0"
setenv bootandroid "echo Booting Android from SDcard; fatload mmc 0:1 0x80200000 uImage; fatload mmc 0:1 0x81600000 uInitrd; bootm 0x80200000 0x81600000"
setenv bootpxefirst "echo Launching PXE boot... ; if usb start; then set autoload no; bootp; if pxe get; then pxe boot; fi; fi; usb stop; run bootandroid"

Which dividehex assures me is the current version

kmoir: that looks correct. the previous version had the bootargs truncated like so "setenv bootargs "console=ttyO2,115200n8 rootwait ro earlyprintk fixrtc nocompcache vram=48M omapfb.vram=0:24M,1:24M mem=456M@0x80000000 mem=512M@0xA0000000 init=/init androidboot.console=ttyO2"
  • Reimage the devices with the correct image as appropriate. You'll have to force the devices into the free state, reimage, and then lock if required.

How to deal with failing pandas in mozpool

Non-talos tests for Android run via mozpool. bug 829211. The devices should be in the "ready" state with the android imaqe in mozpool to be able to be allocated to test jobs. The device is allocated by buildbot based on an idle slave that is available to the master. So mozpool doesn't allocate the specific device, buildbot does. When the "run_script" part of the job runs and invokes mozharness, the device is verified by mozpool. The mozharness scripts also ensure that the correct version of the sutagent is installed and updates it if necessary.

  • These mesages in the buildbot log means that the panda requested by buildbot was not made available in the ready state in within the timeout parameters specified by mozpool. Sometimes this seems to be an intermittent error. If not, the panda should be disabled from production runs and a bug opened with DC Ops to investigate this device.
13:17:41     INFO - Waiting for request 'ready' stage.  Current state: 'contacting_lifeguard'
13:18:41     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:19:41     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:20:41     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:21:41     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:22:41     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:23:41     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:24:41     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:25:41     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:26:41     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:27:41     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:28:42     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:29:42     INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:30:42     INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:31:42     INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:32:42     INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:33:42     INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:34:42     INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:35:42     INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:36:42     INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:37:42    ERROR - INFRA-ERROR: Request did not become ready in time
  • Devices in the "failed_pxe_booting" state in mozpool. Not sure what to do in this state since requests to reboot or reimage are refused because the device isn't available on the network. Opening a bug with DC Ops to investigate this device is a good first step.

References

One source of truth

As of Oct 2012, tools/buildfarm/mobile/devices.json should be the most authoritative document.

  • if you find a tegra deployed that is not listed here, check slavealloc. Once found, file a releng bug to get devices.json updated.
  • if you find a PDU not labeled per the device.json file, file a releng bug to update the human labels.