ReleaseEngineering/Archive/Android Tegras: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
 
(3 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{Release Engineering How To|Android Pandas}}
This page is obsolete.
 
'''<font color='red'>Tegras have been decommisioned, but this doc contains relevant documentation for pandas.</font>'''
 
= Managing SUTagent =
 
== Update the SUTagent in the staging environment ==
 
Note: Foopies 110 is the staging foopy
 
ssh -l userid to relengwebadm and sudo to root
 
copy the new SUTagent apk here
cd /mnt/netapp/relengweb/talos-bundles/mobile
 
It should be called something like SUTagentAndroid.1.13.apk. To ensure it can be read, chmod o+r the file.
 
On each staging foopy, connect as cltbld
screen -x
cd sut_tools
edit target_version in updateSUT.py to reflect the new version
cd ..
./stop_cp.sh
detach from screen, this will take 45-60 minutes to bring down all the attached pandas
./start_cp.sh
 
== Update the SUTagent in production ==
 
ssh -l userid to relengwebadm and sudo to root
 
copy the new SUTagent apk here
cd /mnt/netapp/relengweb/talos-bundles/mobile
 
It should be called something like SUTagentAndroid.1.17.apk. To ensure it can be read, chmod o+r the file
 
Land the the patch to update the version, for instance {{bug|827873}}.  Run this script to update all the foopies with the latest version of the tools repo.  Graceful all the masters that control the tegras and pandas.  Update the foopies with the latest version of code in the tools repo with this handy script.
 
To avoid blue when the sutagent tries to update:
Throw a disabled.flg/error.flg at all devices temporarily
 
python manage_foopies.py -f ~/hg/tools/buildfarm/mobile/devices.json update -H all
 
{{bug|827903}} has been logged for suggestions on how to improve this process
 
== How to tell what version of SUTagent is running on a device ==
<pre>
telnet device-hostname 20701
ver
</pre>
 
<pre>
Escape character is '^]'.
$>activity
##AGENT-WARNING## [activity] command is currently not implemented.
$>ver
SUTAgentAndroid Version 1.15
$>activity
##AGENT-WARNING## [activity] command is currently not implemented.
$>
</pre>
 
If the version on the device doesn't match the one on /builds/tools/sut_tools/updateSUT.py, you have a problem where verify is failing to update the SUTagent on the device.
 
= Environment =
 
The panda builders are run on multiple "foopy" servers with about 10-12 panda's per foopy.  Each panda has it's own buildslave environment and they share common tool and talos environments -- all found inside of '''/builds'''.
 
* Each panda has a '''/builds/panda-###''' containing a buildslave and a symlink to /builds/tools/sut_tools/clientproxy.py
* All of the shared talos info is in '''/builds/talos-data''' and HG is used to maintain it
* All of the sut related helper code is found '''/builds/sut_tools''' (a symlink to /builds/tools/sut_tools/)
 
= Foopies =
 
== Update software on the foopies ==
 
You can update the version of the tools repo on all foopies using fabric. 
 
python manage_foopies.py -H [all|foopyNN] [-H [foopyNN]]* <action>
 
where action can be (currently) "update" or "show_revision"
 
See {{bug|809763}} or {{bug|846354}} for details.  This usually happens automatically on reconfig so it's not something we need to do manually as a general rule.
 
= Pandas =
 
== Determining what Pandas are attached to a Chassis ==
 
At the moment, there are two ways you could look up this information:
 
* Check the name of the relay board in the notes section for a given panda. That will also tell you the chassis name.
* Use the rack-level view in inventory, like so:
 
https://inventory.mozilla.org/en-US/systems/racks/?rack=224
 
You can use the rack location in the first column to identify which boards and which relay are sharing a chassis.
 
Also, in general what you probably care more about is which relay it's attached to.  Each production panda has a k/v store key of system.relay.0 that will tell you which relay it's hooked up to (which corresponds to chassis as well) and which bank and slot on the relay.
 
== Disabling a Panda ==
 
All pandas can be disabled via the disabled.flg file on its corresponding foopy (as listed in [http://hg.mozilla.org/build/tools/file/default/buildfarm/mobile/devices.json devices.json]). The disabled.flg file goes in the /builds/panda-XXXX/ dir on the foopy. Please add any information as to *why* the panda is being disabled to the disabled.flg file (bug #, etc).
 
<em>There is no need to disable a panda via the Mozpool/Lifeguard UI, except in rare circumstances.</em>
 
If you find a panda that does not have an entry in devices.json, that is a bug, not a feature.  To deal with that, the best way is to do a cssh across all foopies looking for /builds/panda-XXXX/.
 
== Rebooting Pandas ==
 
Go to lifeguard ui http://mobile-imaging-010.p10.releng.scl3.mozilla.com/ui/lifeguard.html.  Select the pandas you want to reboot.  Select please_power_cycle in the drop down menu on the bottom left hand corner of the page.  Select go. Their status should show rebooting.
 
==Moving replacement pandas into production==
 
Pandas die and the hardware needs to be replaced.  DC Ops have several spare pandas they use for replacement purposes. The new pandas will have a different hostnames that the old ones.  The procedure to replace a dead panda from the releng side is:
* If the panda is used for Android tests, it needs to be imaged in Lifeguard with the android image. After the imaging process has completed, you need to set its status to "locked_out".
* Since the hostnames are different, the range of pandas allocated to production may need to be modified in mozilla-tests/production_config.py.
* The devices.json file needs to be modified to remove the old panda's definition and include the new pandas.  See {{bug|841486}} for an example of the changes required. See https://bugzilla.mozilla.org/show_bug.cgi?id=836857#c2 for an example of the old to new panda mapping from DC Ops.
* Update or add the directory on the appropriate foopy with the the new panda name.  Ensure the appropriate buildbot master is listed in the buildbot.tac
* Update slavealloc as appropriate to include or enable the pandas
 
==How to run smoketests on the pandas==
 
jmaher blogged about it
 
http://elvis314.wordpress.com/2013/03/01/smoketest-for-firefox-android-on-panda-boards/
 
==How to determine the version of preseed running on a panda==
 
* Put the panda in question into maintenance mode via mozpool.  In lifeguard, ensure that the panda is in an unlocked state, if not, unlock it.  Then from the drop down image menu, select please start_maintenance.  (This is the same menu that has please_image.
* Wait for it to reboot and appear as in maintenance_mode state.
* ssh to the device as the build user (current password is from the time the devices were imaged so n-1)
* mkdir /home/cltbld/tmp
* sudo -i mount /dev/mmcblk0p1 /home/cltbld/tmp
* cat /home/cltbld/tmp/boot.txt
 
The current version looks like this
 
<pre>
setenv bootargs "console=ttyO2,115200n8 rootwait ro earlyprintk fixrtc nocompcache vram=48M omapfb.vram=0:24M,1:24M mem=456M@0x80000000 mem=512M@0xA0000000 init=/init andro
idboot.console=ttyO2 omapdss.def_disp=dvi omapdss.hdmimode=0 omapdss.mode=dvi:640x480MR-24@60 consoleblank=0"
setenv bootandroid "echo Booting Android from SDcard; fatload mmc 0:1 0x80200000 uImage; fatload mmc 0:1 0x81600000 uInitrd; bootm 0x80200000 0x81600000"
setenv bootpxefirst "echo Launching PXE boot... ; if usb start; then set autoload no; bootp; if pxe get; then pxe boot; fi; fi; usb stop; run bootandroid"
</pre>
 
Which dividehex assures me is the current version
<pre>
kmoir: that looks correct. the previous version had the bootargs truncated like so "setenv bootargs "console=ttyO2,115200n8 rootwait ro earlyprintk fixrtc nocompcache vram=48M omapfb.vram=0:24M,1:24M mem=456M@0x80000000 mem=512M@0xA0000000 init=/init androidboot.console=ttyO2"
</pre>
 
* Reimage the devices with the correct image as appropriate.  You'll have to force the devices into the free state, reimage, and then lock if required.
 
== How to deal with failing pandas in mozpool ==
 
Non-talos tests for Android run via mozpool.  {{bug|829211}}.  The devices should be in the "ready" state with the android imaqe in mozpool to be able to be allocated to test jobs.  The device is allocated by buildbot based on an idle slave that is available to the master.  So mozpool doesn't allocate the specific device, buildbot does. When the "run_script" part of the job runs and invokes mozharness, the device is verified by mozpool.  The mozharness scripts also ensure that the correct version of the sutagent is installed and updates it if necessary.
 
* These mesages in the buildbot log means that the panda requested by buildbot was not made available in the ready state in within the timeout parameters specified by mozpool.  Sometimes this seems to be an intermittent error.  If not, the panda should be disabled from production runs and a bug opened with DC Ops to investigate this device.
 
<pre>
13:17:41    INFO - Waiting for request 'ready' stage.  Current state: 'contacting_lifeguard'
13:18:41    INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:19:41    INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:20:41    INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:21:41    INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:22:41    INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:23:41    INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:24:41    INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:25:41    INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:26:41    INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:27:41    INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:28:42    INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:29:42    INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:30:42    INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:31:42    INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:32:42    INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:33:42    INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:34:42    INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:35:42    INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:36:42    INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:37:42    ERROR - INFRA-ERROR: Request did not become ready in time
</pre>
 
* Devices in the "failed_pxe_booting" state in mozpool.  Not sure what to do in this state since requests to reboot or reimage are refused because the device isn't available on the network. Opening a bug with DC Ops to investigate this device is a good first step.
 
== Releng changes required when moving panda racks from one data centre to another ==
 
We moved the pandas from scl1 to scl3 in {{bug|1001705}}.  Here are the changes needed to implement that from the releng side of things.
 
===To disable pandas===
 
Follow the steps here https://bugzilla.mozilla.org/show_bug.cgi?id=1001705#c1
 
===To enable pandas===
* update devices.json in tools repo so relays point to correct data centre.  Example: https://bugzilla.mozilla.org/attachment.cgi?id=8419438&action=edit.  With the next reconfig, [https://wiki.mozilla.org/ReleaseEngineering/How_To/Android_Tegras#Update_software_on_the_foopies |ensure that the latest version of the tools repo is updated on the foopies]
* Relops will reimage the foopies. Two post imaging steps are needed on each foopy:
** The directories in /builds will need to be recreated. Use:
  tools/buildfarm/maintenance/manage_foopies.py -f ../mobile/devices.json -H $foopy create_device_dirs
** Make sure the "tools" checkout is up to date. Use:
    for f in $(grep ^foop ~1/train_2_short.txt | cut -d, -f1) ; do
        ./manage_foopies.py -f ../mobile/devices.json -H $f update
    done
 
    for f in $(grep ^foop ~1/train_2_short.txt | cut -d, -f1) ; do
        ./manage_foopies.py -f ../mobile/devices.json -H $f show_revision
    done
* For good measure (i.e. optional-but-good-if-you-have-time), I reimage the panda devices in lifeguard ui of mozpool http://mobile-imaging-010.p10.releng.scl3.mozilla.com/ui/lifeguard.html
* In slavealloc db: update the datacentre they belong to and enable the devices
 
===To loan pandas===
* disable the panda and all associated pandas on that foopy in slavealloc
* remove the buildbot.tac on the associated foopy and any keys
 
= References =
 
== One source of truth ==
As of Oct 2012, [https://hg.mozilla.org/build/tools/file/default/buildfarm/mobile/devices.json <tt>tools/buildfarm/mobile/devices.json</tt>] should be the most authoritative document.
* if you find a tegra deployed that is not listed here, check [http://slavealloc.build.mozilla.org/ui/ slavealloc]. Once found, file a releng bug to get <tt>devices.json</tt> updated.
* if you find a PDU not labeled per the <tt>device.json</tt> file, file a releng bug to update the human labels.

Latest revision as of 01:01, 11 July 2017

This page is obsolete.