ReleaseEngineering/Archive/Android Tegras: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
 
(116 intermediate revisions by 12 users not shown)
Line 1: Line 1:
{{Release Engineering How To|Android Tegras}}
This page is obsolete.
= First time? =
Is it the first time dealing with tegras and foopies? Here are some no no:
# Do not update the talos checkout under /builds or you will hit new bugs
# Do not start a tegra unless you use "screen -x"
 
= Tegra Dashboard =
The current status of each Tegra, and other informational links, can be seen on the [http://mobile-dashboard.pub.build.mozilla.org/ Tegra Dashboard]. ''Dashboard is only updated every 8 minutes; use [[#check status of Tegra(s)|./check.sh]] on the foopy for live status.''
 
The page is broken up into three sections: Summary, Production and Staging where Production/Staging have the same information but focus on the named set of Tegras.
 
The Summary section has the current start/end date range of the displayed Tegras and a grid of counts.
 
                                      Production  Staging
Tegra and buildslave online          57          8
Tegra online but buildslave is not    0          0
Both Tegra and buildslave are offline 19          2
 
 
The Production/Staging section is a detailed list of all Tegras that fall into the given category.
 
ID Tegra CP BS Msg Online Active Foopy PDU active bar
 
* '''ID''' Tegra-### identifier. Links to the buildslave detail page on the master
* '''Tegra''' Shows if the Tegra is powered and responding: online|OFFLINE
* '''CP''' Shows if the ClientProxy daemon is running: active|INACTIVE
* '''BS''' Shows if the buildslave for the Tegra is running: active|OFFLINE
* '''Msg''' The info message from the last [http://hg.mozilla.org/build/tools/file/default/sut_tools/check.py check.py] run for that Tegra
* '''Foopy''' Which foopy server the Tegra is run on. Links to the hostname:tegra-dir
* '''PDU''' Which PDU page can be used to power-cycle the Tegra.  PDU0 is used for those not connected as of yet
* '''Log''' Links to the text file that contains the cumulative [http://hg.mozilla.org/build/tools/file/default/sut_tools/check.py check.py] log entries
* '''active bar''' A single character summary of the last 10 status checks where '_' is offline and 'A' is active
 
= What Do I Do When... =
 
== PING checks are failing ==
See the section [[ReleaseEngineering:How_To:Android_Tegras#power_cycle_a_Tegra|power cycle a tegra]].
 
== tegra agent check is CRITICAL ==
Check the dashboard, may be rebooting. Give it up to 15 minutes, then [[#check status of Tegra(s)|verify current status]]. If still "rebooting", then treat as if [[#PING checks are failing]]
 
== Unable to properly remove /mnt/sdcard/tests ==
You want to stop clientproxy on the tegra ([[#stop_Tegra.28s.29]]), then format the SDCard ([[#Remote_format_an_SDCard]], then start clientproxy back up ([[#start_Tegra.28s.29]]).
 
= How Do I... =
 
== recover a foopy ==
 
If a foopy has been shutdown without having cleanly stopped all Tegras, you will need to do the following:
 
'''Note''': Establish the base screen session, if needed by trying screen -x first
 
ssh cltbld@foopy##
screen -x
cd /builds
./stop_cp.sh
./start_cp.sh
 
== find what foopy a Tegra is on ==
 
Open the Tegra Dashboard - the foopy number is shown to the right
 
== check status of Tegra(s) ==
 
Find the Tegra on the Dashboard and then ssh to that foopy
 
ssh cltbld@foopy##
cd /builds
./check.sh -t tegra-###
 
To check on the status of all Tegras covered by that foopy
 
./check.sh
 
check.sh is found in /builds on a foopy
 
== power cycle a Tegra ==
Find the Tegra on the Dashboard and then ssh to that foopy
ssh cltbld@foopy##
/builds/check.sh -t tegra-## -c
 
You have to wait approximately 5 minutes before you can check the status of the slave.
 
What "check.py -c" does is to check that a tegra is really offline and then reboot through the PDU.
"Reboot a Tegra through the PDU" is doing a hardcore reboot without checking that the tegra is really down.
This means that if this section does not recover you will need to file a bug for ServerOps::Releng to get to it.
 
<strike>
If the above did not work, then you will need to [[#Reboot a Tegra through the PDU]].
</strike>
 
== clear an error flag ==
 
Find the Tegra on the Dashboard, ssh to that foopy and then
 
ssh cltbld@foopy05
./check.sh -t tegra-002 -r
 
== restart Tegra(s) ==
 
Find out which foopy server you need to be on and then run:
 
ssh cltbld@foopy##
screen -x # or you will hit bug 642369
cd /builds
./stop_cp.sh tegra-###
 
check the '''ps''' output that is generated at the end to ensure that nothing has glitched. If any zombie processes are found then you will need to kill them manually.  Once clear, run
 
./start_cp.sh tegra-###
 
== start Tegra(s) ==
 
Find out which foopy server you need to be on and then run:
 
screen -x # or you will hit bug 642369
cd /builds
./start_cp.sh [tegra-###]
 
If you specify the tegra-### parameter then it will only attempt to start that Tegra, otherwise it will walk thru all Tegras found in /builds/tegra-*
 
== stop Tegra(s) ==
 
First find the foopy server for the Tegra and then run:
screen -x # or you will hit bug 642369
cd /builds
./stop_cp.sh [tegra-###]
 
If you specify the tegra-### parameter then it will only attempt to stop that Tegra, otherwise it will walk thru all Tegras found in /builds/tegra-*
 
At the end of the startup process, stop_cp.sh will run
 
ps auxw | grep "tegra-###"
 
to allow you to check that all associated or spawned child processes have been also stopped. Sadly some of them love to zombie and that just ruins any summer picnic.
 
== find Tegras that are hung ==
If you see a Tegra that has been running for 4+ hours, then it most likely has a hung fennec process. There will be a matching server.js daemon on the foopy.
 
The easiest way to find Tegras that are in this state is via the buildbot-master. ''(N.B. in buildbot reports, all tegras report their [https://en.wikipedia.org/wiki/Nvidia_Tegra#Tegra_2_series model #], e.g. "Tegra 250". Do not confuse model name with a tegra host name, e.g. <tt>tegra-250</tt>.)''. Currently (2011-12-20) all tegras on a foopy use the same build master:
 
{| border="1" cellpadding="2"
!foopy #!!Master URL
|-
| <18
| [http://test-master01.build.mozilla.org:8012/buildslaves?no_builders=1 test-master01]
|-
| >=18 & even
| [http://buildbot-master20.build.mozilla.org:8201/buildslaves?no_builders=1 buildbot-master20]
|-
| >18 & odd
| [http://buildbot-master19.build.mozilla.org:8201/buildslaves?no_builders=1 buildbot-master19]
|}
 
Look for Tegras that have a "Last heard from" of >4 hours. If the list of "Recent builds" for the Tegra are flapping between exceptions/failures/warnings, i.e. the status is all sorts of different pretty colours, that's a good sign that there's a stray fennec process fouling things up.
 
Another way to identify tegras for stalls is to look on the dashboard for tegras showing INACTIVE status for both the tegra ''and'' the client proxy. (These often also have a "not connected" status on the buildslaves page.)
 
=== whack a hung Tegra ===
The only way currently to kick Tegras in this state it is to kill the server.js daemon on the appropriate foopy.
 
The manual way to do it is to run:
 
ps auxw | grep server.js | grep tegra-###
 
and then kill the result PID.  To keep from going crazy typing that over and over again, I created <code>kill_stalled.sh</code> which automates that task.
 
cd /builds
./kill_stalled.sh 042 050 070 099
 
This will run the above ps and grep for each tegra id given and if a PID is found, kill it. This will cause the Tegra to be power-cycled automatically, getting it back into service.
 
If <tt>./kill_stalled.sh</tt> reports "none found", then manually powercycle the tegra.
 
== Reboot a Tegra through the PDU ==
cd /builds
python sut_tools/tegra_powercycle.py ###
 
You will see the snmpset call result if it worked.
 
You can run this from your own machine rather than a foopy by doing this:
cd ~/repos/tools/sut_tools
ln -s ~/repos/talos/talos/devicemanager* .
 
If rebooting via PDU does not clear the problem, here are things to try:
* reboot again - fairly common to have 2nd one clear it
** especially if box responsive to ping & telnet (port 20701) after first reboot
 
== check.py options ==
 
To manually run [http://hg.mozilla.org/build/tools/file/default/sut_tools/check.py check.py] '''find the appropriate foopy server''' and
 
cd /builds
python sut_tools/check.py [-m [s|p]] [-r] [-c] [-t tegra-###]
 
* -m [s|p] restrict Tegra list to 's'taging or 'p'roduction
* -r reset any error.flg semaphore if found and send "rebt" command to tegra
* -c powercycle the Tegra by telneting to the appropriate PDU
 
This will scan a given Tegra (or all of them) and report back it's status.
 
== Start ADB ==
On the Tegra do:
telnet tegra-### 20701
exec su -c "setprop service.adb.tcp.port 5555"
exec su -c "stop adbd"
exec su -c "start adbd"
 
On your computer do:
adb tcpip 5555
adb connect <ipaddr of tegra>
adb shell
 
== Removing a Tegra temporarily from service ==
 
If you receive a request to remove a tegra from service that is burning builds. Check the recent list to see if it's still behaving badly.  Auth required:
 
      https://secure.pub.build.mozilla.org/buildapi/recent/tegra-338
* If it's burning builds, connect to the associated foopy listed in the dashboard and stop_cp the tegras
* If has a chance of being green but still needs to be shut down move the buildbot.tac (to .disabled or something) in the tegra-NNN dir and graceful it in buildbot.
 
== Add a tegra to staging ==
* on the new foopy, if it is already in service (if the foopy is brand new, the initial install will do all these steps):
** update the local tools: <tt>cd /builds/tools ; hg pull --update; cd -</tt>
** manually add the tegra to the <tt>/builds/create_dirs.sh</tt> file
** manually run <tt>cd /builds; ./create_dirs.sh</tt>
* copy a buildbot.tac file from an existing tegra directory and modify the contents to point to the new tegra
* add new tegra to [http://slavealloc.build.mozilla.org/ui/#slaves slavealloc]
* add new tegra to localconfig.py on the [http://dev-master01.build.scl1.mozilla.com:8160/ dev-master]
** don't forget to checkconfig && reconfig the dev-master
 
== Move a tegra from one foopy to another ==
The steps are written for moving one tegra. If you're moving a bunch, then you may want to apply each major step to all tegras involved, and use the "reconfigure foopy" approach to save work.
 
'''NOTE:''' use this technique to replace a tegra as well. (It's really two moves: move old to dust bin, then move replacement to live.)
 
# update foopies.sh & tegras.json in your working directory
# commit the changes to <tt>foopies.sh</tt> and <tt>tegras.json</tt>
#* make sure json is clean: <tt>python -c 'import json; json.loads(open("tegras.json").read())'</tt>
# in buildbot, request a "graceful shutdown"
#* wait for tegra to show "idle"
# on the old foopy:
#* stop the tegra via <tt>/builds/stop_cp.sh</tt>
#* manually remove the tegra from the <tt>/builds/create_dirs.sh</tt> file
#** <strike>'''OR''' run <tt>./foopies.sh old_foopy_number</tt> from your working directory</strike> blocked by: {{bug|713690}}
# on the new foopy, if it is already in service (if the foopy is brand new, the initial install will do all these steps):
#* update the local tools: <tt>cd /builds/tools ; hg pull --update; cd -</tt>
#* manually add the tegra to the <tt>/builds/create_dirs.sh</tt> file
#* manually run <tt>cd /builds; ./create_dirs.sh</tt>
#* if this is a replacement tegra, manually push the ini files by judicious use of: <tt>grep python update_tegra_ini.sh | sed 's/$TEGRA/tegra-xxx/'</tt>
# on the new foopy:
#* restart the tegras using <tt>screen -x # or you will hit bug 642369; cd /builds ; ./start_cp.sh</tt>
#** '''NOTE:''' do not start any new tegras, which require a reconfig to be active, until after the reconfig is complete.
 
== Move a tegra from staging to production ==
 
# If the tegra is running, stop it: <tt>/builds/stop_cp.sh tegra-###</tt>
# Edit the tegra's buildbot.tac: <tt>/builds/tegra-###/buildbot.tac</tt>
# Adjust the master, port and password to the appropriate server
# Save and restart the Tegra: <tt>/screen -x # or you will hit bug 642369; builds/start_cp.sh tegra-###</tt>
 
'''Note''' - yes, it's a manual process (sorry) until Tegras are in SlaveAlloc
 
This process also applies to moving tegras from one dev staging master to another.
 
== Remote format an SDCard ==
 
To remote format the SDCard you want to ensure the bare-minimum has even a chance of touching the tegra until you are done. this is usually a simple matter of making sure clientproxy and buildbot are not running for the tegra.
 
Then you want to telnet in:
 
telnet tegra-### 20701
 
Assuming the tegra is up, this will get you to a command-prompt, you'll then be checking the current mount info (used for sanity check, incase anything goes wrong) and executing the command to format, then rebooting the tegra to ensure it gets started in a clean/known state. Optionally you can telnet back in and recheck the mount info
 
exec mount
exec newfs_msdos -F 32 /dev/block/vold/179:9
exec rebt
^]
quit
 
Yes the ^] is a ctrl+] so that you can get out of the telnet prompt, since rebt does not return a value, and otherwise might have you stuck in the prompt until the connection closes remotely.
 
You should [[#Reboot_a_Tegra_through_the_PDU PDU reboot]] the Tegra after formatting the SDcard.
 
== Update the SUTagent in the staging environment ==
 
Note: Foopies 05, 06 and 26 are staging foopies
 
ssh -l userid to relengweb1 and sudo to root
 
copy the new SUTagent apk here
cd /var/www/html/build/tools/talos/mobile
 
It should be called something like SUTagentAndroid.1.13.apk. To ensure it can be read, chmod o+r the file.
 
On each staging foopy, connect as cltbld
screen -x
cd sut_tools
edit target_version in updateSUT.py to reflect the new version
cd ..
./stop_cp.sh
detach from screen, this will take 45-60 minutes to bring down all the attached tegras
./start_cp.sh
 
= Environment =
 
The Tegra builders are run on multiple "foopy" servers with about 15-20 Tegra's per foopy.  Each Tegra has it's own buildslave environment and they share common tool and talos environments -- all found inside of '''/builds'''.
 
* Each Tegra has a '''/builds/tegra-###''' containing a buildslave and a symlink to /builds/tools/sut_tools/clientproxy.py
* All of the shared talos info is in '''/builds/talos-data''' and HG is used to maintain it
* All of the sut related helper code is found '''/builds/sut_tools''' (a symlink to /builds/tools/sut_tools/)
 
Tegra is the short name for the Tegra 250 Developer Kit test board, see http://developer.nvidia.com/tegra-250-tango-development-kits for details.  It allows us to install and test Firefox on a device that runs Android Froyo while also allowing for debugging.
 
Unlike the N900's we don't run a buildbot environment on the device, but rather communicate to the device via the sutAgentAndroid program that the a-team maintains.  All of the buildslave activities are handled by the clientproxy.py program which monitors the Tegra and it's state and starts/stops the buildslave as needed.
 
= References =
 
== One source of truth ==
 
As of Oct 2012, [https://hg.mozilla.org/build/tools/file/default/buildfarm/mobile/devices.json <tt>tools/buildfarm/mobile/devices.json</tt>] should be the most authoritative document.
* if you find a tegra deployed that is not listed here, check [https://docs.google.com/spreadsheet/ccc?key=0AlIN8kWEeaF0dFJHSWN4WVNVZEhlREtUNWdTYnVtMlE&hl=en_US#gid=0 bear's master list]. If there, file a releng bug to get <tt>devices.json</tt> updated.
* if you find a PDU not labeled per the <tt>device.json</tt> file, file a releng bug to update the human labels.
 
== Determining what Pandas are attached to a Chassis ==
 
At the moment, there are two ways you could look up this information:
 
* Check the name of the relay board in the notes section for a given panda. That will also tell you the chassis name.
* Use the rack-level view in inventory, like so:
 
https://inventory.mozilla.org/en-US/systems/racks/?rack=224
 
You can use the rack location in the first column to identify which boards and which relay are sharing a chassis.
 
Also, in general what you probably care more about is which relay it's attached to.  Each production panda has a k/v store key of system.relay.0 that will tell you which relay it's hooked up to (which corresponds to chassis as well) and which bank and slot on the relay.
 
 
== Rebooting Pandas ==
 
* Clone http://hg.mozilla.org/users/tmielczarek_mozilla.com/relay-control to get relay.py
* Look in inventory for the panda you want to reboot, https://inventory.mozilla.org/en-US/#inventory, for example panda-0010: https://inventory.mozilla.org/en-US/systems/show/6812/
* At the bottom of the page, you can see the server, bank and relay: panda-relay-001.build.scl1.mozilla.com:bank1:relay4
* These are the parameters to relay.py: "python relay.py powercycle panda-relay-001.build.scl1.mozilla.com 1 4"
* You can run it on any machine that can access the panda-relay-001 via ping, it's just TCP
* Soon they'll all be hooked up to BMM and you'll just be able to HTTP POST to reboot a panda
* relay.py just talks to the relay control board that the pandas' power supplies are wired through so it's a hard powercycle

Latest revision as of 01:01, 11 July 2017

This page is obsolete.