ReleaseEngineering/Archive/Android Tegras: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
 
(146 intermediate revisions by 14 users not shown)
Line 1: Line 1:
{{Release Engineering How To|Android Tegras}}
This page is obsolete.
= Tegra Dashboard =
The current status of each Tegra, and other informational links, can be seen on the [http://bm-remote-talos-webhost-01.build.mozilla.org/tegras/ Tegra Dashboard]. ''Dashboard is only updated every 8 minutes; use [[#check status of Tegra(s)|./check.sh]] on the foopy for live status.''
 
The page is broken up into three sections: Summary, Production and Staging where Production/Staging have the same information but focus on the named set of Tegras.
 
The Summary section has the current start/end date range of the displayed Tegras and a grid of counts.
 
                                      Production  Staging
Tegra and buildslave online          57          8
Tegra online but buildslave is not    0          0
Both Tegra and buildslave are offline 19          2
 
 
The Production/Staging section is a detailed list of all Tegras that fall into the given category.
 
ID Tegra CP BS Msg Online Active Foopy PDU active bar
 
* '''ID''' Tegra-### identifier. Links to the buildslave detail page on the master
* '''Tegra''' Shows if the Tegra is powered and responding: online|OFFLINE
* '''CP''' Shows if the ClientProxy daemon is running: active|INACTIVE
* '''BS''' Shows if the buildslave for the Tegra is running: active|OFFLINE
* '''Msg''' The info message from the last [http://hg.mozilla.org/build/tools/file/default/sut_tools/check.py check.py] run for that Tegra
* '''Foopy''' Which foopy server the Tegra is run on. Links to the hostname:tegra-dir
* '''PDU''' Which PDU page can be used to power-cycle the Tegra.  PDU0 is used for those not connected as of yet
* '''Log''' Links to the text file that contains the cumulative [http://hg.mozilla.org/build/tools/file/default/sut_tools/check.py check.py] log entries
* '''active bar''' A single character summary of the last 10 status checks where '_' is offline and 'A' is active
 
= What Do I Do When... =
 
== PING checks are failing ==
Reboot the Tegra through the PDU
 
== tegra agent check is CRITICAL ==
Check the dashboard, may be rebooting. Give it up to 15 minutes, then [[#check status of Tegra(s)|verify current status]]. If still "rebooting", then treat as if [[#PING checks are failing]]
 
= How Do I... =
 
== recover a foopy ==
 
If a foopy has been shutdown without having cleanly stopped all Tegras, you will need to do the following:
 
'''Note''': Establish the base screen session, if needed by trying screen -x first
 
ssh cltbld@foopy##
screen -x
cd /builds
./stop_cp.sh
./start_cp.sh
 
== find what foopy a Tegra is on ==
 
Open the Tegra Dashboard - the foopy number is shown to the right
 
== check status of Tegra(s) ==
 
Find the Tegra on the Dashboard and then ssh to that foopy
 
ssh cltbld@foopy##
cd /builds
./check.sh -t tegra-###
 
To check on the status of all Tegras covered by that foopy
 
./check.sh
 
check.sh is found in /builds on a foopy
 
== power cycle a Tegra ==
 
Find the Tegra on the Dashboard and then ssh to that foopy
 
ssh cltbld@foopy##
./check.sh -t tegra-## -c
 
If the above did not work, then you will need to [[#Reboot a Tegra through the PDU]].
 
== clear an error flag ==
 
Find the Tegra on the Dashboard, ssh to that foopy and then
 
ssh cltbld@foopy05
./check.sh -t tegra-002 -r
 
== restart Tegra(s) ==
 
Find out which foopy server you need to be on and then run:
 
ssh cltbld@foopy##
cd /builds
./stop_cp.sh tegra-###
 
check the '''ps''' output that is generated at the end to ensure that nothing has glitched. If any zombie processes are found then you will need to kill them manually.  Once clear, run
 
./start_cp.sh tegra-###
 
== start Tegra(s) ==
 
Find out which foopy server you need to be on and then run:
 
cd /builds
./start_cp.sh [tegra-###]
 
If you specify the tegra-### parameter then it will only attempt to start that Tegra, otherwise it will walk thru all Tegras found in /builds/tegra-*
 
== stop Tegra(s) ==
 
First find the foopy server for the Tegra and then run:
 
cd /builds
./stop_cp.sh [tegra-###]
 
If you specify the tegra-### parameter then it will only attempt to stop that Tegra, otherwise it will walk thru all Tegras found in /builds/tegra-*
 
At the end of the startup process, stop_cp.sh will run
 
ps auxw | grep "tegra-###"
 
to allow you to check that all associated or spawned child processes have been also stopped. Sadly some of them love to zombie and that just ruins any summer picnic.
 
== find Tegras that are hung ==
If you see a Tegra that has been running for 4+ hours, then it most likely has a hung fennec process. There will be a matching server.js daemon on the foopy.
 
The easiest way to find Tegras that are in this state is via the buildbot-master. ''(N.B. in buildbot reports, all tegras report their [https://en.wikipedia.org/wiki/Nvidia_Tegra#Tegra_2_series model #], e.g. "Tegra 250". Do not confuse model name with a tegra host name, e.g. <tt>tegra-250</tt>.)''. Currently (2011-12-20) all tegras on a foopy use the same build master:
 
{| border="1" cellpadding="2"
!foopy #!!Master URL
|-
| <18
| [http://test-master01.build.mozilla.org:8012/buildslaves?no_builders=1 test-master01]
|-
| >=18 & even
| [http://buildbot-master20.build.mozilla.org:8201/buildslaves?no_builders=1 buildbot-master20]
|-
| >18 & odd
| [http://buildbot-master19.build.mozilla.org:8201/buildslaves?no_builders=1 buildbot-master19]
|}
 
Look for Tegras that have a "Last heard from" of >4 hours. If the list of "Recent builds" for the Tegra are flapping between exceptions/failures/warnings, i.e. the status is all sorts of different pretty colours, that's a good sign that there's a stray fennec process fouling things up.
 
Another way to identify tegras for stalls is to look on the dashboard for tegras showing INACTIVE status for both the tegra ''and'' the client proxy. (These often also have a "not connected" status on the buildslaves page.)
 
=== whack a hung Tegra ===
The only way currently to kick Tegras in this state it is to kill the server.js daemon on the appropriate foopy.
 
The manual way to do it is to run:
 
ps auxw | grep server.js | grep tegra-###
 
and then kill the result PID.  To keep from going crazy typing that over and over again, I created <code>kill_stalled.sh</code> which automates that task.
 
cd /builds
./kill_stalled.sh 042 050 070 099
 
This will run the above ps and grep for each tegra id given and if a PID is found, kill it. This will cause the Tegra to be power-cycled automatically, getting it back into service.
 
If <tt>./kill_stalled.sh</tt> reports "none found", then manually powercycle the tegra.
 
== Reboot a Tegra through the PDU ==
cd /builds
python sut_tools/tegra_powercycle.py ###
 
You will see the snmpset call result if it worked.
 
If rebooting via PDU does not clear the problem, here are things to try:
* reboot again - fairly common to have 2nd one clear it
** especially if box responsive to ping & telnet (port 20701) after first reboot
 
== check.py options ==
 
To manually run [http://hg.mozilla.org/build/tools/file/default/sut_tools/check.py check.py] '''find the appropriate foopy server''' and
 
cd /builds
python sut_tools/check.py [-m [s|p]] [-r] [-c] [-t tegra-###]
 
* -m [s|p] restrict Tegra list to 's'taging or 'p'roduction
* -r reset any error.flg semaphore if found and send "rebt" command to tegra
* -c powercycle the Tegra by telneting to the appropriate PDU
 
This will scan a given Tegra (or all of them) and report back it's status.
 
== Start ADB ==
On the Tegra do:
telnet tegra-### 20701
exec su -c "setprop service.adb.tcp.port 5555"
exec su -c "stop adbd"
exec su -c "start adbd"
 
On your computer do:
adb tcpip 5555
adb connect <ipaddr of tegra>
adb shell
 
== Move a tegra from one foopy to another ==
The steps are written for moving one tegra. If you're moving a bunch, then you may want to apply each major step to all tegras involved, and use the "reconfigure foopy" approach to save work.
 
'''NOTE:''' use this technique to replace a tegra as well. (It's really two moves: move old to dust bin, then move replacement to live.)
 
# update foopies.sh & tegras.json in your working directory
# commit the changes to <tt>foopies.sh</tt> and <tt>tegras.json</tt>
#* make sure json is clean: <tt>python -c 'import json; json.loads(open("tegras.json").read())'</tt>
# in buildbot, request a "graceful shutdown"
#* wait for tegra to show "idle"
# on the old foopy:
#* stop the tegra via <tt>/builds/stop_cp.sh</tt>
#* manually remove the tegra from the <tt>/builds/create_dirs.sh</tt> file
#** <strike>'''OR''' run <tt>./foopies.sh old_foopy_number</tt> from your working directory</strike> blocked by: {{bug|713690}}
# on the new foopy, if it is already in service (if the foopy is brand new, the initial install will do all these steps):
#* update the local tools: <tt>cd /builds/tools ; hg pull --update; cd -</tt>
#* manually add the tegra to the <tt>/builds/create_dirs.sh</tt> file
#* manually run <tt>cd /builds; ./create_dirs.sh</tt>
#* if this is a replacement tegra, manually push the ini files by judicious use of: <tt>grep python update_tegra_ini.sh | sed 's/$TEGRA/tegra-xxx/'</tt>
# on the new foopy:
#* restart the tegras using <tt>cd /builds ; ./start_cp.sh</tt>
#** '''NOTE:''' do not start any new tegras, which require a reconfig to be active, until after the reconfig is complete.
 
= Environment =
 
The Tegra builders are run on multiple "foopy" servers with about 15-20 Tegra's per foopy.  Each Tegra has it's own buildslave environment and they share common tool and talos environments -- all found inside of '''/builds'''.
 
* Each Tegra has a '''/builds/tegra-###''' containing a buildslave and a symlink to /builds/tools/sut_tools/clientproxy.py
* All of the shared talos info is in '''/builds/talos-data''' and HG is used to maintain it
* All of the sut related helper code is found '''/builds/sut_tools''' (a symlink to /builds/tools/sut_tools/)
 
Tegra is the short name for the Tegra 250 Developer Kit test board, see http://developer.nvidia.com/tegra/tegra-devkit-features for details.  It allows us to install and test Firefox on a device that runs Android Froyo while also allowing for debugging.
 
Unlike the N900's we don't run a buildbot environment on the device, but rather communicate to the device via the sutAgentAndroid program that the a-team maintains.  All of the buildslave activities are handled by the clientproxy.py program which monitors the Tegra and it's state and starts/stops the buildslave as needed.
 
= References =
 
== One source of truth ==
 
As of Oct 2011, [https://hg.mozilla.org/build/tools/file/default/buildfarm/mobile/tegras.json <tt>tools/buildfarm/mobile/tegras.json</tt>] should be the most authoritative document.
* if you find a tegra deployed that is not listed here, check [https://docs.google.com/spreadsheet/ccc?key=0AlIN8kWEeaF0dFJHSWN4WVNVZEhlREtUNWdTYnVtMlE&hl=en_US#gid=0 bear's master list]. If there, file a releng bug to get <tt>tegras.json</tt> updated.
* if you find a PDU not labeled per the <tt>tegras.json</tt> file, file a releng bug to update the human labels.

Latest revision as of 01:01, 11 July 2017

This page is obsolete.