ReleaseEngineering/Archive/Android Tegras: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
Line 33: Line 33:


= What Do I Do When... =
= What Do I Do When... =
Some info wrt what process to follow to repair a board:
https://etherpad.mozilla.org/panda-repair-workflow


== PING checks are failing ==
== PING checks are failing ==

Revision as of 21:02, 2 November 2012

First time?

Is it the first time dealing with tegras and foopies? Here are some no no:

  1. Do not update the talos checkout under /builds or you will hit new bugs
  2. Do not start a tegra unless you use "screen -x"

Tegra Dashboard

The current status of each Tegra, and other informational links, can be seen on the Tegra Dashboard. Dashboard is only updated every 8 minutes; use ./check.sh on the foopy for live status.

The page is broken up into three sections: Summary, Production and Staging where Production/Staging have the same information but focus on the named set of Tegras.

The Summary section has the current start/end date range of the displayed Tegras and a grid of counts.

                                      Production  Staging
Tegra and buildslave online           57          8
Tegra online but buildslave is not    0           0
Both Tegra and buildslave are offline 19          2


The Production/Staging section is a detailed list of all Tegras that fall into the given category.

ID 	Tegra 	CP 	BS 	Msg 	Online 	Active 	Foopy 	PDU 	active bar
  • ID Tegra-### identifier. Links to the buildslave detail page on the master
  • Tegra Shows if the Tegra is powered and responding: online|OFFLINE
  • CP Shows if the ClientProxy daemon is running: active|INACTIVE
  • BS Shows if the buildslave for the Tegra is running: active|OFFLINE
  • Msg The info message from the last check.py run for that Tegra
  • Foopy Which foopy server the Tegra is run on. Links to the hostname:tegra-dir
  • PDU Which PDU page can be used to power-cycle the Tegra. PDU0 is used for those not connected as of yet
  • Log Links to the text file that contains the cumulative check.py log entries
  • active bar A single character summary of the last 10 status checks where '_' is offline and 'A' is active

What Do I Do When...

Some info wrt what process to follow to repair a board: https://etherpad.mozilla.org/panda-repair-workflow

PING checks are failing

See the section power cycle a tegra.

tegra agent check is CRITICAL

Check the dashboard, may be rebooting. Give it up to 15 minutes, then verify current status. If still "rebooting", then treat as if #PING checks are failing

Unable to properly remove /mnt/sdcard/tests

You want to stop clientproxy on the tegra (#stop_Tegra.28s.29), then format the SDCard (#Remote_format_an_SDCard, then start clientproxy back up (#start_Tegra.28s.29).

How Do I...

recover a foopy

If a foopy has been shutdown without having cleanly stopped all Tegras, you will need to do the following:

Note: Establish the base screen session, if needed by trying screen -x first

ssh cltbld@foopy##
screen -x
cd /builds
./stop_cp.sh
./start_cp.sh

find what foopy a Tegra is on

Open the Tegra Dashboard - the foopy number is shown to the right

check status of Tegra(s)

Find the Tegra on the Dashboard and then ssh to that foopy

ssh cltbld@foopy##
cd /builds
./check.sh -t tegra-###

To check on the status of all Tegras covered by that foopy

./check.sh

check.sh is found in /builds on a foopy

power cycle a Tegra

Find the Tegra on the Dashboard and then ssh to that foopy

ssh cltbld@foopy##
/builds/check.sh -t tegra-## -c

You have to wait approximately 5 minutes before you can check the status of the slave.

What "check.py -c" does is to check that a tegra is really offline and then reboot through the PDU. "Reboot a Tegra through the PDU" is doing a hardcore reboot without checking that the tegra is really down. This means that if this section does not recover you will need to file a bug for ServerOps::Releng to get to it.

If the above did not work, then you will need to #Reboot a Tegra through the PDU.

clear an error flag

Find the Tegra on the Dashboard, ssh to that foopy and then

ssh cltbld@foopy05
./check.sh -t tegra-002 -r

restart Tegra(s)

Find out which foopy server you need to be on and then run:

ssh cltbld@foopy##
screen -x # or you will hit bug 642369
cd /builds
./stop_cp.sh tegra-###

check the ps output that is generated at the end to ensure that nothing has glitched. If any zombie processes are found then you will need to kill them manually. Once clear, run

./start_cp.sh tegra-###

start Tegra(s)

Find out which foopy server you need to be on and then run:

screen -x # or you will hit bug 642369
cd /builds
./start_cp.sh [tegra-###]

If you specify the tegra-### parameter then it will only attempt to start that Tegra, otherwise it will walk thru all Tegras found in /builds/tegra-*

stop Tegra(s)

First find the foopy server for the Tegra and then run:

screen -x # or you will hit bug 642369
cd /builds
./stop_cp.sh [tegra-###]

If you specify the tegra-### parameter then it will only attempt to stop that Tegra, otherwise it will walk thru all Tegras found in /builds/tegra-*

At the end of the startup process, stop_cp.sh will run

ps auxw | grep "tegra-###"

to allow you to check that all associated or spawned child processes have been also stopped. Sadly some of them love to zombie and that just ruins any summer picnic.

find Tegras that are hung

If you see a Tegra that has been running for 4+ hours, then it most likely has a hung fennec process. There will be a matching server.js daemon on the foopy.

The easiest way to find Tegras that are in this state is via the buildbot-master. (N.B. in buildbot reports, all tegras report their model #, e.g. "Tegra 250". Do not confuse model name with a tegra host name, e.g. tegra-250.). Currently (2011-12-20) all tegras on a foopy use the same build master:

foopy # Master URL
<18 test-master01
>=18 & even buildbot-master20
>18 & odd buildbot-master19

Look for Tegras that have a "Last heard from" of >4 hours. If the list of "Recent builds" for the Tegra are flapping between exceptions/failures/warnings, i.e. the status is all sorts of different pretty colours, that's a good sign that there's a stray fennec process fouling things up.

Another way to identify tegras for stalls is to look on the dashboard for tegras showing INACTIVE status for both the tegra and the client proxy. (These often also have a "not connected" status on the buildslaves page.)

whack a hung Tegra

The only way currently to kick Tegras in this state it is to kill the server.js daemon on the appropriate foopy.

The manual way to do it is to run:

ps auxw | grep server.js | grep tegra-### 

and then kill the result PID. To keep from going crazy typing that over and over again, I created kill_stalled.sh which automates that task.

cd /builds
./kill_stalled.sh 042 050 070 099

This will run the above ps and grep for each tegra id given and if a PID is found, kill it. This will cause the Tegra to be power-cycled automatically, getting it back into service.

If ./kill_stalled.sh reports "none found", then manually powercycle the tegra.

Reboot a Tegra through the PDU

cd /builds
python sut_tools/tegra_powercycle.py ###

You will see the snmpset call result if it worked.

You can run this from your own machine rather than a foopy by doing this:

cd ~/repos/tools/sut_tools
ln -s ~/repos/talos/talos/devicemanager* .

If rebooting via PDU does not clear the problem, here are things to try:

  • reboot again - fairly common to have 2nd one clear it
    • especially if box responsive to ping & telnet (port 20701) after first reboot

check.py options

To manually run check.py find the appropriate foopy server and

cd /builds
python sut_tools/check.py [-m [s|p]] [-r] [-c] [-t tegra-###]
  • -m [s|p] restrict Tegra list to 's'taging or 'p'roduction
  • -r reset any error.flg semaphore if found and send "rebt" command to tegra
  • -c powercycle the Tegra by telneting to the appropriate PDU

This will scan a given Tegra (or all of them) and report back it's status.

Start ADB

On the Tegra do:

telnet tegra-### 20701
exec su -c "setprop service.adb.tcp.port 5555"
exec su -c "stop adbd"
exec su -c "start adbd"

On your computer do:

adb tcpip 5555
adb connect <ipaddr of tegra>
adb shell

Removing a Tegra temporarily from service

If you receive a request to remove a tegra from service that is burning builds. Check the recent list to see if it's still behaving badly. Auth required:

      https://secure.pub.build.mozilla.org/buildapi/recent/tegra-338 

  • If it's burning builds, connect to the associated foopy listed in the dashboard and stop_cp the tegras
  • If has a chance of being green but still needs to be shut down move the buildbot.tac (to .disabled or something) in the tegra-NNN dir and graceful it in buildbot.

Add a tegra to staging

  • on the new foopy, if it is already in service (if the foopy is brand new, the initial install will do all these steps):
    • update the local tools: cd /builds/tools ; hg pull --update; cd -
    • manually add the tegra to the /builds/create_dirs.sh file
    • manually run cd /builds; ./create_dirs.sh
  • copy a buildbot.tac file from an existing tegra directory and modify the contents to point to the new tegra
  • add new tegra to slavealloc
  • add new tegra to localconfig.py on the dev-master
    • don't forget to checkconfig && reconfig the dev-master

Move a tegra from one foopy to another

The steps are written for moving one tegra. If you're moving a bunch, then you may want to apply each major step to all tegras involved, and use the "reconfigure foopy" approach to save work.

NOTE: use this technique to replace a tegra as well. (It's really two moves: move old to dust bin, then move replacement to live.)

  1. update foopies.sh & tegras.json in your working directory
  2. commit the changes to foopies.sh and tegras.json
    • make sure json is clean: python -c 'import json; json.loads(open("tegras.json").read())'
  3. in buildbot, request a "graceful shutdown"
    • wait for tegra to show "idle"
  4. on the old foopy:
    • stop the tegra via /builds/stop_cp.sh
    • manually remove the tegra from the /builds/create_dirs.sh file
      • OR run ./foopies.sh old_foopy_number from your working directory blocked by: bug 713690
  5. on the new foopy, if it is already in service (if the foopy is brand new, the initial install will do all these steps):
    • update the local tools: cd /builds/tools ; hg pull --update; cd -
    • manually add the tegra to the /builds/create_dirs.sh file
    • manually run cd /builds; ./create_dirs.sh
    • if this is a replacement tegra, manually push the ini files by judicious use of: grep python update_tegra_ini.sh | sed 's/$TEGRA/tegra-xxx/'
  6. on the new foopy:
    • restart the tegras using screen -x # or you will hit bug 642369; cd /builds ; ./start_cp.sh
      • NOTE: do not start any new tegras, which require a reconfig to be active, until after the reconfig is complete.

Move a tegra from staging to production

  1. If the tegra is running, stop it: /builds/stop_cp.sh tegra-###
  2. Edit the tegra's buildbot.tac: /builds/tegra-###/buildbot.tac
  3. Adjust the master, port and password to the appropriate server
  4. Save and restart the Tegra: /screen -x # or you will hit bug 642369; builds/start_cp.sh tegra-###

Note - yes, it's a manual process (sorry) until Tegras are in SlaveAlloc

This process also applies to moving tegras from one dev staging master to another.

Remote format an SDCard

To remote format the SDCard you want to ensure the bare-minimum has even a chance of touching the tegra until you are done. this is usually a simple matter of making sure clientproxy and buildbot are not running for the tegra.

Then you want to telnet in:

telnet tegra-### 20701

Assuming the tegra is up, this will get you to a command-prompt, you'll then be checking the current mount info (used for sanity check, incase anything goes wrong) and executing the command to format, then rebooting the tegra to ensure it gets started in a clean/known state. Optionally you can telnet back in and recheck the mount info

exec mount
exec newfs_msdos -F 32 /dev/block/vold/179:9
exec rebt
^]
quit

Yes the ^] is a ctrl+] so that you can get out of the telnet prompt, since rebt does not return a value, and otherwise might have you stuck in the prompt until the connection closes remotely.

You should PDU reboot the Tegra and clear the error flag after formatting the SDcard.

Update the SUTagent in the staging environment

Note: Foopies 05, 06 and 26 are staging foopies

ssh -l userid to relengweb1 and sudo to root

copy the new SUTagent apk here cd /var/www/html/build/tools/talos/mobile

It should be called something like SUTagentAndroid.1.13.apk. To ensure it can be read, chmod o+r the file.

On each staging foopy, connect as cltbld

screen -x
cd sut_tools
edit target_version in updateSUT.py to reflect the new version
cd ..
./stop_cp.sh
detach from screen, this will take 45-60 minutes to bring down all the attached tegras
./start_cp.sh

Environment

The Tegra builders are run on multiple "foopy" servers with about 15-20 Tegra's per foopy. Each Tegra has it's own buildslave environment and they share common tool and talos environments -- all found inside of /builds.

  • Each Tegra has a /builds/tegra-### containing a buildslave and a symlink to /builds/tools/sut_tools/clientproxy.py
  • All of the shared talos info is in /builds/talos-data and HG is used to maintain it
  • All of the sut related helper code is found /builds/sut_tools (a symlink to /builds/tools/sut_tools/)

Tegra is the short name for the Tegra 250 Developer Kit test board, see http://developer.nvidia.com/tegra-250-tango-development-kits for details. It allows us to install and test Firefox on a device that runs Android Froyo while also allowing for debugging.

Unlike the N900's we don't run a buildbot environment on the device, but rather communicate to the device via the sutAgentAndroid program that the a-team maintains. All of the buildslave activities are handled by the clientproxy.py program which monitors the Tegra and it's state and starts/stops the buildslave as needed.

References

One source of truth

As of Oct 2012, tools/buildfarm/mobile/devices.json should be the most authoritative document.

  • if you find a tegra deployed that is not listed here, check bear's master list. If there, file a releng bug to get devices.json updated.
  • if you find a PDU not labeled per the device.json file, file a releng bug to update the human labels.

Determining what Pandas are attached to a Chassis

At the moment, there are two ways you could look up this information:

  • Check the name of the relay board in the notes section for a given panda. That will also tell you the chassis name.
  • Use the rack-level view in inventory, like so:

https://inventory.mozilla.org/en-US/systems/racks/?rack=224

You can use the rack location in the first column to identify which boards and which relay are sharing a chassis.

Also, in general what you probably care more about is which relay it's attached to. Each production panda has a k/v store key of system.relay.0 that will tell you which relay it's hooked up to (which corresponds to chassis as well) and which bank and slot on the relay.


Rebooting Pandas