ReleaseEngineering/Archive/Android Tegras

From MozillaWiki
Jump to navigation Jump to search

Tegra Dashboard

The current status of each Tegra, and other informational links, can be seen on the Tegra Dashboard. Dashboard is only updated every 8 minutes; use ./check.sh on the foopy for live status.

The page is broken up into three sections: Summary, Production and Staging where Production/Staging have the same information but focus on the named set of Tegras.

The Summary section has the current start/end date range of the displayed Tegras and a grid of counts.

                                      Production  Staging
Tegra and buildslave online           57          8
Tegra online but buildslave is not    0           0
Both Tegra and buildslave are offline 19          2


The Production/Staging section is a detailed list of all Tegras that fall into the given category.

ID 	Tegra 	CP 	BS 	Msg 	Online 	Active 	Foopy 	PDU 	active bar
  • ID Tegra-### identifier. Links to the buildslave detail page on the master
  • Tegra Shows if the Tegra is powered and responding: online|OFFLINE
  • CP Shows if the ClientProxy daemon is running: active|INACTIVE
  • BS Shows if the buildslave for the Tegra is running: active|OFFLINE
  • Msg The info message from the last check.py run for that Tegra
  • Foopy Which foopy server the Tegra is run on. Links to the hostname:tegra-dir
  • PDU Which PDU page can be used to power-cycle the Tegra. PDU0 is used for those not connected as of yet
  • Log Links to the text file that contains the cumulative check.py log entries
  • active bar A single character summary of the last 10 status checks where '_' is offline and 'A' is active

What Do I Do When...

PING checks are failing

Reboot the Tegra through the PDU

tegra agent check is CRITICAL

Check the dashboard, may be rebooting. Give it up to 15 minutes, then verify current status. If still "rebooting", then treat as if #PING checks are failing

How Do I...

recover a foopy

If a foopy has been shutdown without having cleanly stopped all Tegras, you will need to do the following:

Note: Establish the base screen session, if needed by trying screen -x first

ssh cltbld@foopy##
screen -x
cd /builds
./stop_cp.sh
./start_cp.sh

find what foopy a Tegra is on

Open the Tegra Dashboard - the foopy number is shown to the right

check status of Tegra(s)

Find the Tegra on the Dashboard and then ssh to that foopy

ssh cltbld@foopy##
cd /builds
./check.sh -t tegra-###

To check on the status of all Tegras covered by that foopy

./check.sh

check.sh is found in /builds on a foopy

power cycle a Tegra

Find the Tegra on the Dashboard and then ssh to that foopy

ssh cltbld@foopy##
./check.sh -t tegra-## -c

If the above did not work, then you will need to #Reboot a Tegra through the PDU.

clear an error flag

Find the Tegra on the Dashboard, ssh to that foopy and then

ssh cltbld@foopy05
./check.sh -t tegra-002 -r

restart Tegra(s)

Find out which foopy server you need to be on and then run:

ssh cltbld@foopy##
cd /builds
./stop_cp.sh tegra-###

check the ps output that is generated at the end to ensure that nothing has glitched. If any zombie processes are found then you will need to kill them manually. Once clear, run

./start_cp.sh tegra-###

start Tegra(s)

Find out which foopy server you need to be on and then run:

cd /builds
./start_cp.sh [tegra-###]

If you specify the tegra-### parameter then it will only attempt to start that Tegra, otherwise it will walk thru all Tegras found in /builds/tegra-*

stop Tegra(s)

First find the foopy server for the Tegra and then run:

cd /builds
./stop_cp.sh [tegra-###]

If you specify the tegra-### parameter then it will only attempt to stop that Tegra, otherwise it will walk thru all Tegras found in /builds/tegra-*

At the end of the startup process, stop_cp.sh will run

ps auxw | grep "tegra-###"

to allow you to check that all associated or spawned child processes have been also stopped. Sadly some of them love to zombie and that just ruins any summer picnic.

find Tegras that are hung

If you see a Tegra that has been running for 4+ hours, then it most likely has a hung fennec process. There will be a matching server.js daemon on the foopy.

The easiest way to find Tegras that are in this state is via the buildbot-master. (N.B. in buildbot reports, all tegras report their model #, e.g. "Tegra 250". Do not confuse model name with a tegra host name, e.g. tegra-250.). Currently (2011-12-20) all tegras on a foopy use the same build master:

foopy # Master URL
<18 test-master01
>=18 & even buildbot-master20
>18 & odd buildbot-master19

Look for Tegras that have a "Last heard from" of >4 hours. If the list of "Recent builds" for the Tegra are flapping between exceptions/failures/warnings, i.e. the status is all sorts of different pretty colours, that's a good sign that there's a stray fennec process fouling things up.

Another way to identify tegras for stalls is to look on the dashboard for tegras showing INACTIVE status for both the tegra and the client proxy. (These often also have a "not connected" status on the buildslaves page.)

whack a hung Tegra

The only way currently to kick Tegras in this state it is to kill the server.js daemon on the appropriate foopy.

The manual way to do it is to run:

ps auxw | grep server.js | grep tegra-### 

and then kill the result PID. To keep from going crazy typing that over and over again, I created kill_stalled.sh which automates that task.

cd /builds
./kill_stalled.sh 042 050 070 099

This will run the above ps and grep for each tegra id given and if a PID is found, kill it. This will cause the Tegra to be power-cycled automatically, getting it back into service.

If ./kill_stalled.sh reports "none found", then manually powercycle the tegra.

Reboot a Tegra through the PDU

cd /builds
python sut_tools/tegra_powercycle.py ###

You will see the snmpset call result if it worked.

If rebooting via PDU does not clear the problem, here are things to try:

  • reboot again - fairly common to have 2nd one clear it
    • especially if box responsive to ping & telnet (port 20701) after first reboot

check.py options

To manually run check.py find the appropriate foopy server and

cd /builds
python sut_tools/check.py [-m [s|p]] [-r] [-c] [-t tegra-###]
  • -m [s|p] restrict Tegra list to 's'taging or 'p'roduction
  • -r reset any error.flg semaphore if found and send "rebt" command to tegra
  • -c powercycle the Tegra by telneting to the appropriate PDU

This will scan a given Tegra (or all of them) and report back it's status.

Start ADB

On the Tegra do:

telnet tegra-### 20701
exec su -c "setprop service.adb.tcp.port 5555"
exec su -c "stop adbd"
exec su -c "start adbd"

On your computer do:

adb tcpip 5555
adb connect <ipaddr of tegra>
adb shell

Move a tegra from one foopy to another

The steps are written for moving one tegra. If you're moving a bunch, then you may want to apply each major step to all tegras involved, and use the "reconfigure foopy" approach to save work.

NOTE: use this technique to replace a tegra as well. (It's really two moves: move old to dust bin, then move replacement to live.)

  1. update foopies.sh & tegras.json in your working directory
  2. commit the changes to foopies.sh and tegras.json
    • make sure json is clean: python -c 'import json; json.loads(open("tegras.json").read())'
  3. in buildbot, request a "graceful shutdown"
    • wait for tegra to show "idle"
  4. on the old foopy:
    • stop the tegra via /builds/stop_cp.sh
    • manually remove the tegra from the /builds/create_dirs.sh file
      • OR run ./foopies.sh old_foopy_number from your working directory blocked by: bug 713690
  5. on the new foopy, if it is already in service (if the foopy is brand new, the initial install will do all these steps):
    • update the local tools: cd /builds/tools ; hg pull --update; cd -
    • manually add the tegra to the /builds/create_dirs.sh file
    • manually run cd /builds; ./create_dirs.sh
    • if this is a replacement tegra, manually push the ini files by judicious use of: grep python update_tegra_ini.sh | sed 's/$TEGRA/tegra-xxx/'
  6. on the new foopy:
    • restart the tegras using cd /builds ; ./start_cp.sh
      • NOTE: do not start any new tegras, which require a reconfig to be active, until after the reconfig is complete.

Move a tegra from staging to production

  1. If the tegra is running, stop it: /builds/stop_cp.sh tegra-###
  2. Edit the tegra's buildbot.tac: /builds/tegra-###/buildbot.tac
  3. Adjust the master, port and password to the appropriate server
  4. Save and restart the Tegra: /builds/start_cp.sh tegra-###

Note - yes, it's a manual process (sorry) until Tegras are in SlaveAlloc

Environment

The Tegra builders are run on multiple "foopy" servers with about 15-20 Tegra's per foopy. Each Tegra has it's own buildslave environment and they share common tool and talos environments -- all found inside of /builds.

  • Each Tegra has a /builds/tegra-### containing a buildslave and a symlink to /builds/tools/sut_tools/clientproxy.py
  • All of the shared talos info is in /builds/talos-data and HG is used to maintain it
  • All of the sut related helper code is found /builds/sut_tools (a symlink to /builds/tools/sut_tools/)

Tegra is the short name for the Tegra 250 Developer Kit test board, see http://developer.nvidia.com/tegra/tegra-devkit-features for details. It allows us to install and test Firefox on a device that runs Android Froyo while also allowing for debugging.

Unlike the N900's we don't run a buildbot environment on the device, but rather communicate to the device via the sutAgentAndroid program that the a-team maintains. All of the buildslave activities are handled by the clientproxy.py program which monitors the Tegra and it's state and starts/stops the buildslave as needed.

References

One source of truth

As of Oct 2011, tools/buildfarm/mobile/tegras.json should be the most authoritative document.

  • if you find a tegra deployed that is not listed here, check bear's master list. If there, file a releng bug to get tegras.json updated.
  • if you find a PDU not labeled per the tegras.json file, file a releng bug to update the human labels.