ReleaseEngineering/Archive/Android Tegras
Notes for buildduty
Tegra Problem Tracking
If a Tegra is failing tests, the basic "escalation path" is as follows:
- reformat the sdcard
- failing that, replace the sdcard
- failing that, sometimes a second sdcard replacement works. Refer to job history success rate as to whether it's worth replacing the sdcard a second time.
- as a last resort, decomm the Tegra. Only do this if the Tegra looks especially bad compared to the others since we have a limited number of these devices.
Terminology note:
- reformat = wipe out the SD card
- reimage = re-flash the Tegra
"always reimage" is a good rule of thumb when attempting to recover Tegras.
Common tegra error states and how to recover from them
PING checks are failing
Nagios will alert us in channel (and send email) after the it hits the retry limit for ping attempts.
See the section power cycle a tegra.
tegra agent check is CRITICAL
Verify that the tegra is not rebooting. This can take up to 15 minutes, then verify current status. If still "rebooting", then treat as if #PING checks are failing
Unable to properly remove /mnt/sdcard/tests
- Disable the tegra: Stop tegra(s)
- Format the SDCard: Remote format an SDCard
- Start back up: Start Tegra(s)
Common buildduty requests
Removing a Tegra temporarily from service
You may receive a request to remove a tegra from service that is burning builds (or you may find one yourself). Check the recent list to see if it's still behaving badly, e.g.: (LDAP auth required)
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-338
- If it's burning builds, connect to the associated foopy listed in the dashboard and stop the tegra(s).
Return a tegra to production
When dcops returns tegras to production by closing the "tegra-recovery" bug:
- Check the status of the tegras to ensure everything is up
- Start the tegras (if it was manually stopped) to return them to production
- Update & resolve the tegras' problem tracker bugs
Deploying a new version of SUTAgent
While infrequent, you may be asked to help deploy a new version of the SUTAgent.
Basic tegra management
Find what foopy a Tegra is on
Open the Tegra page in Slave Health, click on the name of the Tegra in question, and look for the foopy name. The foopy name is also in devices.json.
Check status of Tegra(s)
Find the Tegra on the Dashboard and then ssh to that foopy
ssh cltbld@foopy## cd /builds ./check.sh -t tegra-###
To check on the status of all Tegras covered by that foopy
./check.sh
check.sh is found in /builds on a foopy
Clear an error flag
This is done automatically, once an hour. But if you need to do it manually for some reason...
Find the Tegra on the Dashboard, ssh to that foopy and then
ssh cltbld@foopy05 rm -f /builds/tegra-NNN/error.flg
start Tegra(s)
Find out which foopy server you need to be on and then run:
cd /builds rm -f /builds/tegra-###/{disabled,error}.flg
The device should then attempt to startup within 5 minutes, running through verify then starting buildbot it verify succeeds.
Should it seem to have trouble starting, you can check its watcher log:
tail /builds/tegra-###/watcher.log
And if that is stale you might want to peek at recover a foopy
Disable a tegra
First find the foopy server for the Tegra and then run:
cd /builds touch tegra-NNN/disabled.flg
This will then stop the device within 5 minutes, at the next watch_devices cycle.
Should it seem to have trouble starting, you can check its watcher log:
tail /builds/tegra-###/watcher.log
And if that is stale you might want to peek at recover a foopy
Reboot a tegra
To reboot a batch of tegras, specify '-D <tegra>' for each:
cd tools; hg pull -u python buildfarm/maintenance/manage_foopies.py -f buildfarm/mobile/devices.json -D tegra-xxx -D tegra-yyy reboot
Note: double-check the enabled/disabled state before rebooting.
The more manual method to reboot a tegra:
ssh cltbld@<foopy> cd /builds python sut_tools/tegra_powercycle.py ###
You will see the snmpset call result if it worked.
You can run this from your own machine rather than a foopy by doing this:
cd ~/repos/tools/sut_tools # If tools not up to date, run risk of toggling wrong PDU, if info changed PYTHONPATH=. tegra_powercycle.py ###
If rebooting via PDU does not clear the problem, here are things to try:
- reboot again - fairly common to have 2nd one clear it
- especially if device responsive to ping & telnet (port 20701) after first reboot
Recover a foopy
If a foopy has been shutdown without having cleanly stopped all Tegras, you will need to do the following:
Note: Establish the base screen session, if needed by trying screen -x first
ssh cltbld@foopy## screen -x cd /builds rm -f tegra-*/watcher.lock ./watch_devices.sh
Decommission a Tegra
- file a RelOps bug to decommission the tegra and have it block the build slave tracking bug
- once RelOps and DcOps have removed the tegra from DNS and Inventory:
- in slavealloc, disable the tegra, set the environment to 'decomm', and set the comment to "decommissioned in bug XXXXXX"
- remove the /builds/tegra-XXX directory from the foopy
- locate the tegra in [tools]/buildfarm/mobile/devices.json and set the following fields (example):
"foopy":"None" "_comment": "disabled in bug XXXXXX"
- add tegra to the list of decommissioned tegras in [buildbot-configs]/mozilla-tests/production_config.py (example)
- once all the above is done and landed, close the build slave tracking bug
Advanced tegra management
Remote format an SDCard
To remote format the SDCard you want to ensure the bare-minimum has even a chance of touching the tegra until you are done. this is usually a simple matter of making sure clientproxy and buildbot are not running for the tegra.
Then you want to telnet in:
telnet tegra-### 20701
Assuming the tegra is up, this will get you to a command-prompt, you'll then be checking the current mount info (used for sanity check, incase anything goes wrong) and executing the command to format, then rebooting the tegra to ensure it gets started in a clean/known state. Optionally you can telnet back in and recheck the mount info
exec mount exec newfs_msdos -F 32 /dev/block/vold/179:9 exec rebt ^] quit
Yes the ^] is a ctrl+] so that you can get out of the telnet prompt, since rebt does not return a value, and otherwise might have you stuck in the prompt until the connection closes remotely.
If you receive this error when attempting the format:
newfs_msdos: /dev/block/vold/179:9: No such file or directory
That is usually an indication of a bad SDCard or bad SDCard reader (the OS doesn't think there is an SDCard).
You should PDU reboot the Tegra and clear the error flag after formatting the SDcard.
check.py options
To manually run check.py, find the appropriate foopy server and
cd /builds python sut_tools/check.py [-m [s|p]] [-r] [-c] [-t tegra-###]
- -m [s|p] restrict Tegra list to 's'taging or 'p'roduction
- -r reset any error.flg semaphore if found and send "rebt" command to tegra
- -c powercycle the Tegra by telneting to the appropriate PDU
This will scan a given Tegra (or all of them) and report back it's status.
Start ADB
On the Tegra do:
telnet tegra-### 20701 exec su -c "setprop service.adb.tcp.port 5555" exec su -c "stop adbd" exec su -c "start adbd"
On your computer do:
adb tcpip 5555 adb connect <ipaddr of tegra> adb shell
Add a tegra to staging
- on the new foopy, if it is already in service (if the foopy is brand new, the initial install will do all these steps):
- update the local tools: cd /builds/tools ; hg pull --update; cd -
- manually add the tegra to the /builds/create_dirs.sh file
- manually run cd /builds; ./create_dirs.sh
- copy a buildbot.tac file from an existing tegra directory and modify the contents to point to the new tegra
- add new tegra to slavealloc
- add new tegra to localconfig.py on the dev-master
- don't forget to checkconfig && reconfig the dev-master
Move a tegra from one foopy to another
The steps are written for moving one tegra. If you're moving a bunch, then you may want to apply each major step to all tegras involved, and use the "reconfigure foopy" approach to save work.
NOTE: use this technique to replace a tegra as well. (It's really two moves: move old to dust bin, then move replacement to live.)
- update foopies.sh & tegras.json in your working directory
- commit the changes to foopies.sh and tegras.json
- make sure json is clean: python -c 'import json; json.loads(open("tegras.json").read())'
- in buildbot, request a "graceful shutdown"
- wait for tegra to show "idle"
- on the old foopy:
- stop the tegra via /builds/stop_cp.sh
- manually remove the tegra from the /builds/create_dirs.sh file
OR run ./foopies.sh old_foopy_number from your working directoryblocked by: bug 713690
- on the new foopy, if it is already in service (if the foopy is brand new, the initial install will do all these steps):
- update the local tools: cd /builds/tools ; hg pull --update; cd -
- manually add the tegra to the /builds/create_dirs.sh file
- manually run cd /builds; ./create_dirs.sh
- if this is a replacement tegra, manually push the ini files by judicious use of: grep python update_tegra_ini.sh | sed 's/$TEGRA/tegra-xxx/'
- on the new foopy:
- restart the tegras using screen -x # or you will hit bug 642369; cd /builds ; ./start_cp.sh
- NOTE: do not start any new tegras, which require a reconfig to be active, until after the reconfig is complete.
- restart the tegras using screen -x # or you will hit bug 642369; cd /builds ; ./start_cp.sh
Move a tegra from staging to production
- If the tegra is running, stop it: /builds/stop_cp.sh tegra-###
- Edit the tegra's buildbot.tac: /builds/tegra-###/buildbot.tac
- Adjust the master, port and password to the appropriate server
- Save and restart the Tegra: /screen -x # or you will hit bug 642369; builds/start_cp.sh tegra-###
Note - yes, it's a manual process (sorry) until Tegras are in SlaveAlloc
This process also applies to moving tegras from one dev staging master to another.
Managing SUTagent
Update the SUTagent in the staging environment
Note: Foopies 110 is the staging foopy
ssh -l userid to relengwebadm and sudo to root
copy the new SUTagent apk here cd /mnt/netapp/relengweb/talos-bundles/mobile
It should be called something like SUTagentAndroid.1.13.apk. To ensure it can be read, chmod o+r the file.
On each staging foopy, connect as cltbld
screen -x cd sut_tools edit target_version in updateSUT.py to reflect the new version cd .. ./stop_cp.sh detach from screen, this will take 45-60 minutes to bring down all the attached tegras ./start_cp.sh
Update the SUTagent in production
ssh -l userid to relengwebadm and sudo to root
copy the new SUTagent apk here cd /mnt/netapp/relengweb/talos-bundles/mobile
It should be called something like SUTagentAndroid.1.17.apk. To ensure it can be read, chmod o+r the file
Land the the patch to update the version, for instance bug 827873. Run this script to update all the foopies with the latest version of the tools repo. Graceful all the masters that control the tegras and pandas. Update the foopies with the latest version of code in the tools repo with this handy script.
To avoid blue when the sutagent tries to update: Throw a disabled.flg/error.flg at all devices temporarily
python manage_foopies.py -f ~/hg/tools/buildfarm/mobile/devices.json update -H all
bug 827903 has been logged for suggestions on how to improve this process
How to tell what version of SUTagent is running on a device
telnet device-hostname 20701 ver
Escape character is '^]'. $>activity ##AGENT-WARNING## [activity] command is currently not implemented. $>ver SUTAgentAndroid Version 1.15 $>activity ##AGENT-WARNING## [activity] command is currently not implemented. $>
If the version on the device doesn't match the one on /builds/tools/sut_tools/updateSUT.py, you have a problem where verify is failing to update the SUTagent on the device.
Environment
The Tegra builders are run on multiple "foopy" servers with about 15-20 Tegra's per foopy. Each Tegra has it's own buildslave environment and they share common tool and talos environments -- all found inside of /builds.
- Each Tegra has a /builds/tegra-### containing a buildslave and a symlink to /builds/tools/sut_tools/clientproxy.py
- All of the shared talos info is in /builds/talos-data and HG is used to maintain it
- All of the sut related helper code is found /builds/sut_tools (a symlink to /builds/tools/sut_tools/)
Tegra is the short name for the Tegra 250 Developer Kit test board, see http://developer.nvidia.com/tegra-250-tango-development-kits for details. It allows us to install and test Firefox on a device that runs Android Froyo while also allowing for debugging.
Unlike the N900's we don't run a buildbot environment on the device, but rather communicate to the device via the sutAgentAndroid program that the a-team maintains. All of the buildslave activities are handled by the clientproxy.py program which monitors the Tegra and it's state and starts/stops the buildslave as needed.
Foopies
Update software on the foopies
You can update the version of the tools repo on all foopies using fabric.
python manage_foopies.py -H [all|foopyNN] [-H [foopyNN]]* <action>
where action can be (currently) "update" or "show_revision"
See bug 809763 or bug 846354 for details
Pandas
Determining what Pandas are attached to a Chassis
At the moment, there are two ways you could look up this information:
- Check the name of the relay board in the notes section for a given panda. That will also tell you the chassis name.
- Use the rack-level view in inventory, like so:
https://inventory.mozilla.org/en-US/systems/racks/?rack=224
You can use the rack location in the first column to identify which boards and which relay are sharing a chassis.
Also, in general what you probably care more about is which relay it's attached to. Each production panda has a k/v store key of system.relay.0 that will tell you which relay it's hooked up to (which corresponds to chassis as well) and which bank and slot on the relay.
Disabling a Panda
All pandas can be disabled via the disabled.flg file on its corresponding foopy (as listed in devices.json). The disabled.flg file goes in the /builds/panda-XXXX/ dir on the foopy. Please add any information as to *why* the panda is being disabled to the disabled.flg file (bug #, etc).
There is no need to disable a panda via the Mozpool/Lifeguard UI, except in rare circumstances.
If you find a panda that does not have an entry in devices.json, that is a bug, not a feature. To deal with that, the best way is to do a cssh across all foopies looking for /builds/panda-XXXX/.
Rebooting Pandas
To reboot a batch of pandas, specify '-D <panda>' for each:
cd tools; hg pull -u python buildfarm/maintenance/manage_foopies.py -f buildfarm/mobile/devices.json -D panda-xxx -D panda-yyy reboot
Note that panda reboots can fail due to contention on the relay board used for them. Also, double-check the enabled/disabled state.
An alternative method:
- You can reboot from any machine that can access the panda-relay-* via ping, it's just TCP
- Soon they'll all be hooked up to BMM and you'll just be able to HTTP POST to reboot a panda
- relay.py just talks to the relay control board that the pandas' power supplies are wired through so it's a hard powercycle
- Method #1
- Clone http://hg.mozilla.org/users/tmielczarek_mozilla.com/relay-control to get relay.py
- Look in inventory for the panda you want to reboot, https://inventory.mozilla.org/en-US/#inventory, for example panda-0010: https://inventory.mozilla.org/en-US/systems/show/6812/
- At the bottom of the page, you can see the server, bank and relay: panda-relay-001.build.scl1.mozilla.com:bank1:relay4
- These are the parameters to relay.py: "python relay.py powercycle panda-relay-001.build.scl1.mozilla.com 1 4"
- Method #2
- There's also a script to query the devices.json so you just have to specify the device name here https://hg.mozilla.org/build/braindump/reboot_panda.py. You'll need to update the path to the relay.py in the script.
- ie. python reboot_panda.py -f ~/hg/tools/buildfarm/mobile/devices.json -d panda-0873
- I reboot ranges of pandas like this
for ((c=615;c<=623;c++)) ; do echo panda-0$c; python reboot_panda.py -f ~/hg/tools/buildfarm/mobile/devices.json -d panda-0$c ; done
Moving replacement pandas into production
Pandas die and the hardware needs to be replaced. DC Ops have several spare pandas they use for replacement purposes. The new pandas will have a different hostnames that the old ones. The procedure to replace a dead panda from the releng side is:
- If the panda is used for Android tests, it needs to be imaged in Lifeguard with the android image. After the imaging process has completed, you need to set its status to "locked_out".
- Since the hostnames are different, the range of pandas allocated to production may need to be modified in mozilla-tests/production_config.py.
- The devices.json file needs to be modified to remove the old panda's definition and include the new pandas. See bug 841486 for an example of the changes required. See https://bugzilla.mozilla.org/show_bug.cgi?id=836857#c2 for an example of the old to new panda mapping from DC Ops.
- Update or add the directory on the appropriate foopy with the the new panda name. Ensure the appropriate buildbot master is listed in the buildbot.tac
- Update slavealloc as appropriate to include or enable the pandas
How to run smoketests on the pandas
jmaher blogged about it
http://elvis314.wordpress.com/2013/03/01/smoketest-for-firefox-android-on-panda-boards/
How to determine the version of preseed running on a panda
- Put the panda in question into maintenance mode via mozpool. In lifeguard, ensure that the panda is in an unlocked state, if not, unlock it. Then from the drop down image menu, select please start_maintenance. (This is the same menu that has please_image.
- Wait for it to reboot and appear as in maintenance_mode state.
- ssh to the device as the build user (current password is from the time the devices were imaged so n-1)
- mkdir /home/cltbld/tmp
- sudo -i mount /dev/mmcblk0p1 /home/cltbld/tmp
- cat /home/cltbld/tmp/boot.txt
The current version looks like this
setenv bootargs "console=ttyO2,115200n8 rootwait ro earlyprintk fixrtc nocompcache vram=48M omapfb.vram=0:24M,1:24M mem=456M@0x80000000 mem=512M@0xA0000000 init=/init andro idboot.console=ttyO2 omapdss.def_disp=dvi omapdss.hdmimode=0 omapdss.mode=dvi:640x480MR-24@60 consoleblank=0" setenv bootandroid "echo Booting Android from SDcard; fatload mmc 0:1 0x80200000 uImage; fatload mmc 0:1 0x81600000 uInitrd; bootm 0x80200000 0x81600000" setenv bootpxefirst "echo Launching PXE boot... ; if usb start; then set autoload no; bootp; if pxe get; then pxe boot; fi; fi; usb stop; run bootandroid"
Which dividehex assures me is the current version
kmoir: that looks correct. the previous version had the bootargs truncated like so "setenv bootargs "console=ttyO2,115200n8 rootwait ro earlyprintk fixrtc nocompcache vram=48M omapfb.vram=0:24M,1:24M mem=456M@0x80000000 mem=512M@0xA0000000 init=/init androidboot.console=ttyO2"
- Reimage the devices with the correct image as appropriate. You'll have to force the devices into the free state, reimage, and then lock if required.
How to deal with failing pandas in mozpool
Non-talos tests for Android run via mozpool. bug 829211. The devices should be in the "ready" state with the android imaqe in mozpool to be able to be allocated to test jobs. The device is allocated by buildbot based on an idle slave that is available to the master. So mozpool doesn't allocate the specific device, buildbot does. When the "run_script" part of the job runs and invokes mozharness, the device is verified by mozpool. The mozharness scripts also ensure that the correct version of the sutagent is installed and updates it if necessary.
- These mesages in the buildbot log means that the panda requested by buildbot was not made available in the ready state in within the timeout parameters specified by mozpool. Sometimes this seems to be an intermittent error. If not, the panda should be disabled from production runs and a bug opened with DC Ops to investigate this device.
13:17:41 INFO - Waiting for request 'ready' stage. Current state: 'contacting_lifeguard' 13:18:41 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:19:41 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:20:41 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:21:41 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:22:41 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:23:41 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:24:41 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:25:41 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:26:41 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:27:41 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:28:42 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:29:42 INFO - Waiting for request 'ready' stage. Current state: 'failed_device_busy' 13:30:42 INFO - Waiting for request 'ready' stage. Current state: 'failed_device_busy' 13:31:42 INFO - Waiting for request 'ready' stage. Current state: 'failed_device_busy' 13:32:42 INFO - Waiting for request 'ready' stage. Current state: 'failed_device_busy' 13:33:42 INFO - Waiting for request 'ready' stage. Current state: 'failed_device_busy' 13:34:42 INFO - Waiting for request 'ready' stage. Current state: 'failed_device_busy' 13:35:42 INFO - Waiting for request 'ready' stage. Current state: 'failed_device_busy' 13:36:42 INFO - Waiting for request 'ready' stage. Current state: 'failed_device_busy' 13:37:42 ERROR - INFRA-ERROR: Request did not become ready in time
- Devices in the "failed_pxe_booting" state in mozpool. Not sure what to do in this state since requests to reboot or reimage are refused because the device isn't available on the network. Opening a bug with DC Ops to investigate this device is a good first step.
References
One source of truth
As of Oct 2012, tools/buildfarm/mobile/devices.json should be the most authoritative document.
- if you find a tegra deployed that is not listed here, check slavealloc. Once found, file a releng bug to get devices.json updated.
- if you find a PDU not labeled per the device.json file, file a releng bug to update the human labels.