ReleaseEngineering/Archive/Android Tegras: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(change name)
Line 204: Line 204:
* For good measure (i.e. optional-but-good-if-you-have-time), I reimage the panda devices in lifeguard ui of mozpool http://mobile-imaging-010.p10.releng.scl3.mozilla.com/ui/lifeguard.html
* For good measure (i.e. optional-but-good-if-you-have-time), I reimage the panda devices in lifeguard ui of mozpool http://mobile-imaging-010.p10.releng.scl3.mozilla.com/ui/lifeguard.html
* In slavealloc db: update the datacentre they belong to and enable the devices
* In slavealloc db: update the datacentre they belong to and enable the devices
===To loan pandas===
* disable the panda and all associated pandas on that foopy in slavealloc
* remove the buildbot.tac on the associated foopy and any keys


= References =
= References =

Revision as of 02:08, 7 October 2015


Tegras have been decommisioned, but this doc contains relevant documentation for pandas.

Managing SUTagent

Update the SUTagent in the staging environment

Note: Foopies 110 is the staging foopy

ssh -l userid to relengwebadm and sudo to root

copy the new SUTagent apk here cd /mnt/netapp/relengweb/talos-bundles/mobile

It should be called something like SUTagentAndroid.1.13.apk. To ensure it can be read, chmod o+r the file.

On each staging foopy, connect as cltbld

screen -x
cd sut_tools
edit target_version in updateSUT.py to reflect the new version
cd ..
./stop_cp.sh
detach from screen, this will take 45-60 minutes to bring down all the attached pandas
./start_cp.sh

Update the SUTagent in production

ssh -l userid to relengwebadm and sudo to root

copy the new SUTagent apk here cd /mnt/netapp/relengweb/talos-bundles/mobile

It should be called something like SUTagentAndroid.1.17.apk. To ensure it can be read, chmod o+r the file

Land the the patch to update the version, for instance bug 827873. Run this script to update all the foopies with the latest version of the tools repo. Graceful all the masters that control the tegras and pandas. Update the foopies with the latest version of code in the tools repo with this handy script.

To avoid blue when the sutagent tries to update: Throw a disabled.flg/error.flg at all devices temporarily

python manage_foopies.py -f ~/hg/tools/buildfarm/mobile/devices.json update -H all

bug 827903 has been logged for suggestions on how to improve this process

How to tell what version of SUTagent is running on a device

telnet device-hostname 20701
ver
Escape character is '^]'.
$>activity
##AGENT-WARNING## [activity] command is currently not implemented.
$>ver
SUTAgentAndroid Version 1.15
$>activity
##AGENT-WARNING## [activity] command is currently not implemented.
$>

If the version on the device doesn't match the one on /builds/tools/sut_tools/updateSUT.py, you have a problem where verify is failing to update the SUTagent on the device.

Environment

The panda builders are run on multiple "foopy" servers with about 10-12 panda's per foopy. Each panda has it's own buildslave environment and they share common tool and talos environments -- all found inside of /builds.

  • Each panda has a /builds/panda-### containing a buildslave and a symlink to /builds/tools/sut_tools/clientproxy.py
  • All of the shared talos info is in /builds/talos-data and HG is used to maintain it
  • All of the sut related helper code is found /builds/sut_tools (a symlink to /builds/tools/sut_tools/)

Foopies

Update software on the foopies

You can update the version of the tools repo on all foopies using fabric.

python manage_foopies.py -H [all|foopyNN] [-H [foopyNN]]* <action>

where action can be (currently) "update" or "show_revision"

See bug 809763 or bug 846354 for details. This usually happens automatically on reconfig so it's not something we need to do manually as a general rule.

Pandas

Determining what Pandas are attached to a Chassis

At the moment, there are two ways you could look up this information:

  • Check the name of the relay board in the notes section for a given panda. That will also tell you the chassis name.
  • Use the rack-level view in inventory, like so:

https://inventory.mozilla.org/en-US/systems/racks/?rack=224

You can use the rack location in the first column to identify which boards and which relay are sharing a chassis.

Also, in general what you probably care more about is which relay it's attached to. Each production panda has a k/v store key of system.relay.0 that will tell you which relay it's hooked up to (which corresponds to chassis as well) and which bank and slot on the relay.

Disabling a Panda

All pandas can be disabled via the disabled.flg file on its corresponding foopy (as listed in devices.json). The disabled.flg file goes in the /builds/panda-XXXX/ dir on the foopy. Please add any information as to *why* the panda is being disabled to the disabled.flg file (bug #, etc).

There is no need to disable a panda via the Mozpool/Lifeguard UI, except in rare circumstances.

If you find a panda that does not have an entry in devices.json, that is a bug, not a feature. To deal with that, the best way is to do a cssh across all foopies looking for /builds/panda-XXXX/.

Rebooting Pandas

Go to lifeguard ui http://mobile-imaging-010.p10.releng.scl3.mozilla.com/ui/lifeguard.html. Select the pandas you want to reboot. Select please_power_cycle in the drop down menu on the bottom left hand corner of the page. Select go. Their status should show rebooting.

Moving replacement pandas into production

Pandas die and the hardware needs to be replaced. DC Ops have several spare pandas they use for replacement purposes. The new pandas will have a different hostnames that the old ones. The procedure to replace a dead panda from the releng side is:

  • If the panda is used for Android tests, it needs to be imaged in Lifeguard with the android image. After the imaging process has completed, you need to set its status to "locked_out".
  • Since the hostnames are different, the range of pandas allocated to production may need to be modified in mozilla-tests/production_config.py.
  • The devices.json file needs to be modified to remove the old panda's definition and include the new pandas. See bug 841486 for an example of the changes required. See https://bugzilla.mozilla.org/show_bug.cgi?id=836857#c2 for an example of the old to new panda mapping from DC Ops.
  • Update or add the directory on the appropriate foopy with the the new panda name. Ensure the appropriate buildbot master is listed in the buildbot.tac
  • Update slavealloc as appropriate to include or enable the pandas

How to run smoketests on the pandas

jmaher blogged about it

http://elvis314.wordpress.com/2013/03/01/smoketest-for-firefox-android-on-panda-boards/

How to determine the version of preseed running on a panda

  • Put the panda in question into maintenance mode via mozpool. In lifeguard, ensure that the panda is in an unlocked state, if not, unlock it. Then from the drop down image menu, select please start_maintenance. (This is the same menu that has please_image.
  • Wait for it to reboot and appear as in maintenance_mode state.
  • ssh to the device as the build user (current password is from the time the devices were imaged so n-1)
  • mkdir /home/cltbld/tmp
  • sudo -i mount /dev/mmcblk0p1 /home/cltbld/tmp
  • cat /home/cltbld/tmp/boot.txt

The current version looks like this

setenv bootargs "console=ttyO2,115200n8 rootwait ro earlyprintk fixrtc nocompcache vram=48M omapfb.vram=0:24M,1:24M mem=456M@0x80000000 mem=512M@0xA0000000 init=/init andro
idboot.console=ttyO2 omapdss.def_disp=dvi omapdss.hdmimode=0 omapdss.mode=dvi:640x480MR-24@60 consoleblank=0"
setenv bootandroid "echo Booting Android from SDcard; fatload mmc 0:1 0x80200000 uImage; fatload mmc 0:1 0x81600000 uInitrd; bootm 0x80200000 0x81600000"
setenv bootpxefirst "echo Launching PXE boot... ; if usb start; then set autoload no; bootp; if pxe get; then pxe boot; fi; fi; usb stop; run bootandroid"

Which dividehex assures me is the current version

kmoir: that looks correct. the previous version had the bootargs truncated like so "setenv bootargs "console=ttyO2,115200n8 rootwait ro earlyprintk fixrtc nocompcache vram=48M omapfb.vram=0:24M,1:24M mem=456M@0x80000000 mem=512M@0xA0000000 init=/init androidboot.console=ttyO2"
  • Reimage the devices with the correct image as appropriate. You'll have to force the devices into the free state, reimage, and then lock if required.

How to deal with failing pandas in mozpool

Non-talos tests for Android run via mozpool. bug 829211. The devices should be in the "ready" state with the android imaqe in mozpool to be able to be allocated to test jobs. The device is allocated by buildbot based on an idle slave that is available to the master. So mozpool doesn't allocate the specific device, buildbot does. When the "run_script" part of the job runs and invokes mozharness, the device is verified by mozpool. The mozharness scripts also ensure that the correct version of the sutagent is installed and updates it if necessary.

  • These mesages in the buildbot log means that the panda requested by buildbot was not made available in the ready state in within the timeout parameters specified by mozpool. Sometimes this seems to be an intermittent error. If not, the panda should be disabled from production runs and a bug opened with DC Ops to investigate this device.
13:17:41     INFO - Waiting for request 'ready' stage.  Current state: 'contacting_lifeguard'
13:18:41     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:19:41     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:20:41     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:21:41     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:22:41     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:23:41     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:24:41     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:25:41     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:26:41     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:27:41     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:28:42     INFO - Waiting for request 'ready' stage.  Current state: 'pending'
13:29:42     INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:30:42     INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:31:42     INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:32:42     INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:33:42     INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:34:42     INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:35:42     INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:36:42     INFO - Waiting for request 'ready' stage.  Current state: 'failed_device_busy'
13:37:42    ERROR - INFRA-ERROR: Request did not become ready in time
  • Devices in the "failed_pxe_booting" state in mozpool. Not sure what to do in this state since requests to reboot or reimage are refused because the device isn't available on the network. Opening a bug with DC Ops to investigate this device is a good first step.

Releng changes required when moving panda racks from one data centre to another

We moved the pandas from scl1 to scl3 in bug 1001705. Here are the changes needed to implement that from the releng side of things.

To disable pandas

Follow the steps here https://bugzilla.mozilla.org/show_bug.cgi?id=1001705#c1

To enable pandas

 tools/buildfarm/maintenance/manage_foopies.py -f ../mobile/devices.json -H $foopy create_device_dirs
    • Make sure the "tools" checkout is up to date. Use:
   for f in $(grep ^foop ~1/train_2_short.txt | cut -d, -f1) ; do
       ./manage_foopies.py -f ../mobile/devices.json -H $f update
   done
   for f in $(grep ^foop ~1/train_2_short.txt | cut -d, -f1) ; do
       ./manage_foopies.py -f ../mobile/devices.json -H $f show_revision
   done

To loan pandas

  • disable the panda and all associated pandas on that foopy in slavealloc
  • remove the buildbot.tac on the associated foopy and any keys

References

One source of truth

As of Oct 2012, tools/buildfarm/mobile/devices.json should be the most authoritative document.

  • if you find a tegra deployed that is not listed here, check slavealloc. Once found, file a releng bug to get devices.json updated.
  • if you find a PDU not labeled per the device.json file, file a releng bug to update the human labels.