ReleaseEngineering/Archive/Android Tegras: Difference between revisions
(scl1 -> scl3) |
(deprecate tegra docs) |
||
Line 2: | Line 2: | ||
'''<font color='red'>Tegras have been decommisioned, but this doc contains relevant documentation for pandas.</font>''' | '''<font color='red'>Tegras have been decommisioned, but this doc contains relevant documentation for pandas.</font>''' | ||
= Managing SUTagent = | = Managing SUTagent = | ||
Line 274: | Line 22: | ||
cd .. | cd .. | ||
./stop_cp.sh | ./stop_cp.sh | ||
detach from screen, this will take 45-60 minutes to bring down all the attached | detach from screen, this will take 45-60 minutes to bring down all the attached pandas | ||
./start_cp.sh | ./start_cp.sh | ||
Line 316: | Line 64: | ||
= Environment = | = Environment = | ||
The | The panda builders are run on multiple "foopy" servers with about 10-12 panda's per foopy. Each panda has it's own buildslave environment and they share common tool and talos environments -- all found inside of '''/builds'''. | ||
* Each | * Each panda has a '''/builds/panda-###''' containing a buildslave and a symlink to /builds/tools/sut_tools/clientproxy.py | ||
* All of the shared talos info is in '''/builds/talos-data''' and HG is used to maintain it | * All of the shared talos info is in '''/builds/talos-data''' and HG is used to maintain it | ||
* All of the sut related helper code is found '''/builds/sut_tools''' (a symlink to /builds/tools/sut_tools/) | * All of the sut related helper code is found '''/builds/sut_tools''' (a symlink to /builds/tools/sut_tools/) | ||
= Foopies = | = Foopies = | ||
Line 336: | Line 80: | ||
where action can be (currently) "update" or "show_revision" | where action can be (currently) "update" or "show_revision" | ||
See {{bug|809763}} or {{bug|846354}} for details | See {{bug|809763}} or {{bug|846354}} for details. This usually happens automatically on reconfig so it's not something we need to do manually as a general rule. | ||
= Pandas = | = Pandas = |
Revision as of 00:25, 11 September 2015
Tegras have been decommisioned, but this doc contains relevant documentation for pandas.
Managing SUTagent
Update the SUTagent in the staging environment
Note: Foopies 110 is the staging foopy
ssh -l userid to relengwebadm and sudo to root
copy the new SUTagent apk here cd /mnt/netapp/relengweb/talos-bundles/mobile
It should be called something like SUTagentAndroid.1.13.apk. To ensure it can be read, chmod o+r the file.
On each staging foopy, connect as cltbld
screen -x cd sut_tools edit target_version in updateSUT.py to reflect the new version cd .. ./stop_cp.sh detach from screen, this will take 45-60 minutes to bring down all the attached pandas ./start_cp.sh
Update the SUTagent in production
ssh -l userid to relengwebadm and sudo to root
copy the new SUTagent apk here cd /mnt/netapp/relengweb/talos-bundles/mobile
It should be called something like SUTagentAndroid.1.17.apk. To ensure it can be read, chmod o+r the file
Land the the patch to update the version, for instance bug 827873. Run this script to update all the foopies with the latest version of the tools repo. Graceful all the masters that control the tegras and pandas. Update the foopies with the latest version of code in the tools repo with this handy script.
To avoid blue when the sutagent tries to update: Throw a disabled.flg/error.flg at all devices temporarily
python manage_foopies.py -f ~/hg/tools/buildfarm/mobile/devices.json update -H all
bug 827903 has been logged for suggestions on how to improve this process
How to tell what version of SUTagent is running on a device
telnet device-hostname 20701 ver
Escape character is '^]'. $>activity ##AGENT-WARNING## [activity] command is currently not implemented. $>ver SUTAgentAndroid Version 1.15 $>activity ##AGENT-WARNING## [activity] command is currently not implemented. $>
If the version on the device doesn't match the one on /builds/tools/sut_tools/updateSUT.py, you have a problem where verify is failing to update the SUTagent on the device.
Environment
The panda builders are run on multiple "foopy" servers with about 10-12 panda's per foopy. Each panda has it's own buildslave environment and they share common tool and talos environments -- all found inside of /builds.
- Each panda has a /builds/panda-### containing a buildslave and a symlink to /builds/tools/sut_tools/clientproxy.py
- All of the shared talos info is in /builds/talos-data and HG is used to maintain it
- All of the sut related helper code is found /builds/sut_tools (a symlink to /builds/tools/sut_tools/)
Foopies
Update software on the foopies
You can update the version of the tools repo on all foopies using fabric.
python manage_foopies.py -H [all|foopyNN] [-H [foopyNN]]* <action>
where action can be (currently) "update" or "show_revision"
See bug 809763 or bug 846354 for details. This usually happens automatically on reconfig so it's not something we need to do manually as a general rule.
Pandas
Determining what Pandas are attached to a Chassis
At the moment, there are two ways you could look up this information:
- Check the name of the relay board in the notes section for a given panda. That will also tell you the chassis name.
- Use the rack-level view in inventory, like so:
https://inventory.mozilla.org/en-US/systems/racks/?rack=224
You can use the rack location in the first column to identify which boards and which relay are sharing a chassis.
Also, in general what you probably care more about is which relay it's attached to. Each production panda has a k/v store key of system.relay.0 that will tell you which relay it's hooked up to (which corresponds to chassis as well) and which bank and slot on the relay.
Disabling a Panda
All pandas can be disabled via the disabled.flg file on its corresponding foopy (as listed in devices.json). The disabled.flg file goes in the /builds/panda-XXXX/ dir on the foopy. Please add any information as to *why* the panda is being disabled to the disabled.flg file (bug #, etc).
There is no need to disable a panda via the Mozpool/Lifeguard UI, except in rare circumstances.
If you find a panda that does not have an entry in devices.json, that is a bug, not a feature. To deal with that, the best way is to do a cssh across all foopies looking for /builds/panda-XXXX/.
Rebooting Pandas
Go to lifeguard ui http://mobile-imaging-010.p10.releng.scl3.mozilla.com/ui/lifeguard.html. Select the pandas you want to reboot. Select please_power_cycle in the drop down menu on the bottom left hand corner of the page. Select go. Their status should show rebooting.
Moving replacement pandas into production
Pandas die and the hardware needs to be replaced. DC Ops have several spare pandas they use for replacement purposes. The new pandas will have a different hostnames that the old ones. The procedure to replace a dead panda from the releng side is:
- If the panda is used for Android tests, it needs to be imaged in Lifeguard with the android image. After the imaging process has completed, you need to set its status to "locked_out".
- Since the hostnames are different, the range of pandas allocated to production may need to be modified in mozilla-tests/production_config.py.
- The devices.json file needs to be modified to remove the old panda's definition and include the new pandas. See bug 841486 for an example of the changes required. See https://bugzilla.mozilla.org/show_bug.cgi?id=836857#c2 for an example of the old to new panda mapping from DC Ops.
- Update or add the directory on the appropriate foopy with the the new panda name. Ensure the appropriate buildbot master is listed in the buildbot.tac
- Update slavealloc as appropriate to include or enable the pandas
How to run smoketests on the pandas
jmaher blogged about it
http://elvis314.wordpress.com/2013/03/01/smoketest-for-firefox-android-on-panda-boards/
How to determine the version of preseed running on a panda
- Put the panda in question into maintenance mode via mozpool. In lifeguard, ensure that the panda is in an unlocked state, if not, unlock it. Then from the drop down image menu, select please start_maintenance. (This is the same menu that has please_image.
- Wait for it to reboot and appear as in maintenance_mode state.
- ssh to the device as the build user (current password is from the time the devices were imaged so n-1)
- mkdir /home/cltbld/tmp
- sudo -i mount /dev/mmcblk0p1 /home/cltbld/tmp
- cat /home/cltbld/tmp/boot.txt
The current version looks like this
setenv bootargs "console=ttyO2,115200n8 rootwait ro earlyprintk fixrtc nocompcache vram=48M omapfb.vram=0:24M,1:24M mem=456M@0x80000000 mem=512M@0xA0000000 init=/init andro idboot.console=ttyO2 omapdss.def_disp=dvi omapdss.hdmimode=0 omapdss.mode=dvi:640x480MR-24@60 consoleblank=0" setenv bootandroid "echo Booting Android from SDcard; fatload mmc 0:1 0x80200000 uImage; fatload mmc 0:1 0x81600000 uInitrd; bootm 0x80200000 0x81600000" setenv bootpxefirst "echo Launching PXE boot... ; if usb start; then set autoload no; bootp; if pxe get; then pxe boot; fi; fi; usb stop; run bootandroid"
Which dividehex assures me is the current version
kmoir: that looks correct. the previous version had the bootargs truncated like so "setenv bootargs "console=ttyO2,115200n8 rootwait ro earlyprintk fixrtc nocompcache vram=48M omapfb.vram=0:24M,1:24M mem=456M@0x80000000 mem=512M@0xA0000000 init=/init androidboot.console=ttyO2"
- Reimage the devices with the correct image as appropriate. You'll have to force the devices into the free state, reimage, and then lock if required.
How to deal with failing pandas in mozpool
Non-talos tests for Android run via mozpool. bug 829211. The devices should be in the "ready" state with the android imaqe in mozpool to be able to be allocated to test jobs. The device is allocated by buildbot based on an idle slave that is available to the master. So mozpool doesn't allocate the specific device, buildbot does. When the "run_script" part of the job runs and invokes mozharness, the device is verified by mozpool. The mozharness scripts also ensure that the correct version of the sutagent is installed and updates it if necessary.
- These mesages in the buildbot log means that the panda requested by buildbot was not made available in the ready state in within the timeout parameters specified by mozpool. Sometimes this seems to be an intermittent error. If not, the panda should be disabled from production runs and a bug opened with DC Ops to investigate this device.
13:17:41 INFO - Waiting for request 'ready' stage. Current state: 'contacting_lifeguard' 13:18:41 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:19:41 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:20:41 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:21:41 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:22:41 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:23:41 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:24:41 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:25:41 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:26:41 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:27:41 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:28:42 INFO - Waiting for request 'ready' stage. Current state: 'pending' 13:29:42 INFO - Waiting for request 'ready' stage. Current state: 'failed_device_busy' 13:30:42 INFO - Waiting for request 'ready' stage. Current state: 'failed_device_busy' 13:31:42 INFO - Waiting for request 'ready' stage. Current state: 'failed_device_busy' 13:32:42 INFO - Waiting for request 'ready' stage. Current state: 'failed_device_busy' 13:33:42 INFO - Waiting for request 'ready' stage. Current state: 'failed_device_busy' 13:34:42 INFO - Waiting for request 'ready' stage. Current state: 'failed_device_busy' 13:35:42 INFO - Waiting for request 'ready' stage. Current state: 'failed_device_busy' 13:36:42 INFO - Waiting for request 'ready' stage. Current state: 'failed_device_busy' 13:37:42 ERROR - INFRA-ERROR: Request did not become ready in time
- Devices in the "failed_pxe_booting" state in mozpool. Not sure what to do in this state since requests to reboot or reimage are refused because the device isn't available on the network. Opening a bug with DC Ops to investigate this device is a good first step.
Releng changes required when moving panda racks from one data centre to another
We moved the pandas from scl1 to scl3 in bug 1001705. Here are the changes needed to implement that from the releng side of things.
To disable pandas
Follow the steps here https://bugzilla.mozilla.org/show_bug.cgi?id=1001705#c1
To enable pandas
- update devices.json in tools repo so relays point to correct data centre. Example: https://bugzilla.mozilla.org/attachment.cgi?id=8419438&action=edit. With the next reconfig, |ensure that the latest version of the tools repo is updated on the foopies
- Relops will reimage the foopies. Two post imaging steps are needed on each foopy:
- The directories in /builds will need to be recreated. Use:
tools/buildfarm/maintenance/manage_foopies.py -f ../mobile/devices.json -H $foopy create_device_dirs
- Make sure the "tools" checkout is up to date. Use:
for f in $(grep ^foop ~1/train_2_short.txt | cut -d, -f1) ; do ./manage_foopies.py -f ../mobile/devices.json -H $f update done
for f in $(grep ^foop ~1/train_2_short.txt | cut -d, -f1) ; do ./manage_foopies.py -f ../mobile/devices.json -H $f show_revision done
- For good measure (i.e. optional-but-good-if-you-have-time), I reimage the panda devices in lifeguard ui of mozpool http://mobile-imaging-010.p10.releng.scl3.mozilla.com/ui/lifeguard.html
- In slavealloc db: update the datacentre they belong to and enable the devices
References
One source of truth
As of Oct 2012, tools/buildfarm/mobile/devices.json should be the most authoritative document.
- if you find a tegra deployed that is not listed here, check slavealloc. Once found, file a releng bug to get devices.json updated.
- if you find a PDU not labeled per the device.json file, file a releng bug to update the human labels.