ReleaseEngineering/Archive/Android Tegras: Difference between revisions

Jump to navigation Jump to search
deprecate tegra docs
(scl1 -> scl3)
(deprecate tegra docs)
Line 2: Line 2:


'''<font color='red'>Tegras have been decommisioned, but this doc contains relevant documentation for pandas.</font>'''
'''<font color='red'>Tegras have been decommisioned, but this doc contains relevant documentation for pandas.</font>'''
= Notes for buildduty =
== Tegra Problem Tracking ==
If a Tegra is failing tests, the basic "escalation path" is as follows:
* reformat the sdcard
* failing that, replace the sdcard
* failing that, sometimes a second sdcard replacement works.  Refer to job history success rate as to whether it's worth replacing the sdcard a second time.
* as a last resort, decomm the Tegra.  Only do this if the Tegra looks especially bad compared to the others since we have a limited number of these devices.
Terminology note:
* reformat = wipe out the SD card
* reimage = re-flash the Tegra
"always reimage" is a good rule of thumb when attempting to recover Tegras.
== Common tegra error states and how to recover from them ==
=== PING checks are failing ===
Nagios will alert us in channel (and send email) after the it hits the retry limit for ping attempts.
See the section [[#Reboot_a_tegra|power cycle a tegra]].
=== tegra agent check is CRITICAL ===
Verify that the tegra is not rebooting. This can take up to 15 minutes, then [[#Check_status_of_Tegra.28s.29|verify current status]]. If still "rebooting", then treat as if [[#PING checks are failing]]
=== Unable to properly remove /mnt/sdcard/tests ===
* Disable the tegra: [[#Disable a tegra|Stop tegra(s)]]
* Format the SDCard: [[#Remote_format_an_SDCard|Remote format an SDCard]]
* Start back up: [[#start_Tegra.28s.29|Start Tegra(s)]]
== Common buildduty requests ==
=== Removing a Tegra temporarily from service ===
You may receive a request to remove a tegra from service that is burning builds (or you may find one yourself). Check the recent list to see if it's still behaving badly, e.g.: (LDAP auth required)
    https://secure.pub.build.mozilla.org/buildapi/recent/tegra-338
* If it's burning builds, connect to the associated foopy listed in the dashboard and [[#Disable a tegra|stop the tegra(s)]].
<!--
TODO: This maintenance script needs updating, but docs will be (almost) perfect when done, so don't remove from page
If you have more than one or two devices to shut down, it might make sense to use the foopy fabric script (needs up to date tools checkout)
cd ~/tools/buildfarm/maintenance
# with a python that has fabric installed
python manage_foopies.py -f ../mobile/devices.json -D tegra-338 -D tegra-339 ... stop_cp
# Or a whole foopy at once
python manage_foopies.py -f devices.json -H foopy22 stop_cp
# Or a combination of the two
python manage_foopies.py -f devices.json -D tegra-031 -H foopy22 stop_cp
-->
=== Return a tegra to production ===
When dcops returns tegras to production by closing the "tegra-recovery" bug:
# [[#Check_status_of_Tegra.28s.29|Check the status of the tegras]] to ensure everything is up
# [[#start Tegra.28s.29|Start the tegras]] (if it was manually stopped) to return them to production
# Update & resolve the tegras' problem tracker bugs
=== Deploying a new version of SUTAgent ===
While infrequent, you may be asked to help [[#Update_the_SUTagent_in_production|deploy a new version of the SUTAgent]].
= Basic tegra management =
== Find what foopy a Tegra is on ==
Open the [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=tegra Tegra page] in Slave Health, click on the name of the Tegra in question, and look for the foopy name.  The foopy name is also in [http://hg.mozilla.org/build/tools/file/default/buildfarm/mobile/devices.json devices.json].
== Check status of Tegra(s) ==
Find the Tegra on the [[#Find_what_foopy_a_Tegra_is_on|Dashboard]] and then ssh to that foopy
ssh cltbld@foopy##
cd /builds
./check.sh -t tegra-###
To check on the status of all Tegras covered by that foopy
./check.sh
check.sh is found in /builds on a foopy
== Clear an error flag ==
This is done automatically, once an hour. But if you need to do it manually for some reason...
Find the Tegra on the Dashboard, ssh to that foopy and then
ssh cltbld@foopy05
rm -f /builds/tegra-NNN/error.flg
== start Tegra(s) ==
Find out which foopy server you need to be on and then run:
cd /builds
rm -f /builds/tegra-###/{disabled,error}.flg
The device should then attempt to startup within 5 minutes, running through verify then starting buildbot it verify succeeds.
Should it seem to have trouble starting, you can check its watcher log:
tail /builds/tegra-###/watcher.log
And if that is stale you might want to peek at [[#Recover_a_foopy|recover a foopy]]
== Disable a tegra ==
we no longer touch disable.flg manually now. disable/enable via slavealloc instead
Should it seem to have trouble starting, you can check its watcher log:
tail /builds/tegra-###/watcher.log
And if that is stale you might want to peek at [[#Recover_a_foopy|recover a foopy]]
== Reboot a tegra ==
To reboot a batch of tegras, specify '-D <tegra>' for each:
cd tools; hg pull -u
python buildfarm/maintenance/manage_foopies.py -f buildfarm/mobile/devices.json -D tegra-xxx -D tegra-yyy reboot
Note: double-check the enabled/disabled state before rebooting.
The more manual method to reboot a tegra:
ssh cltbld@<foopy>
cd /builds
python sut_tools/tegra_powercycle.py ###
You will see the snmpset call result if it worked.
You can run this from your own machine rather than a foopy by doing this:
cd ~/repos/tools/sut_tools
# If tools not up to date, run risk of toggling wrong PDU, if info changed
PYTHONPATH=. tegra_powercycle.py ###
If rebooting via PDU does not clear the problem, here are things to try:
* reboot again - fairly common to have 2nd one clear it
** especially if device responsive to ping & telnet (port 20701) after first reboot
== Recover a foopy ==
If a foopy has been shutdown without having cleanly stopped all Tegras, you will need to do the following:
'''Note''': Establish the base screen session, if needed by trying screen -x first
ssh cltbld@foopy##
screen -x
cd /builds
rm -f tegra-*/watcher.lock
./watch_devices.sh
== Decommission a Tegra ==
* file a RelOps bug to decommission the tegra and have it block the build slave tracking bug
* once RelOps and DcOps have removed the tegra from DNS and Inventory:
** in slavealloc, disable the tegra, set the environment to 'decomm', and set the comment to "decommissioned in bug XXXXXX"
** remove the /builds/tegra-XXX directory from the foopy
** locate the tegra in [tools]/buildfarm/mobile/devices.json and set the following fields ([https://hg.mozilla.org/build/tools/rev/a31ced7e9957 example]):
  "foopy":"None"
  "_comment": "disabled in bug XXXXXX"
** add tegra to the list of decommissioned tegras in [buildbot-configs]/mozilla-tests/production_config.py ([https://hg.mozilla.org/build/buildbot-configs/rev/7b9578c9a135 example])
* once all the above is done and landed, close the build slave tracking bug
= Advanced tegra management =
== Remote format an SDCard ==
To remote format the SDCard you want to ensure the bare-minimum has even a chance of touching the tegra until you are done. this is usually a simple matter of making sure clientproxy and buildbot are not running for the tegra.
Then you want to telnet in:
telnet tegra-### 20701
Assuming the tegra is up, this will get you to a command-prompt, you'll then be checking the current mount info (used for sanity check, incase anything goes wrong) and executing the command to format, then rebooting the tegra to ensure it gets started in a clean/known state. Optionally you can telnet back in and recheck the mount info
exec mount
exec newfs_msdos -F 32 /dev/block/vold/179:9
exec rebt
^]
quit
Yes the ^] is a ctrl+] so that you can get out of the telnet prompt, since rebt does not return a value, and otherwise might have you stuck in the prompt until the connection closes remotely.
If you receive this error when attempting the format:
newfs_msdos: /dev/block/vold/179:9: No such file or directory
That is usually an indication of a bad SDCard or bad SDCard reader (the OS doesn't think there is an SDCard).
You should [[ReleaseEngineering/How_To/Android_Tegras#Reboot_a_tegra | PDU reboot]] the Tegra and [[ReleaseEngineering/How_To/Android_Tegras#Clear_an_error_flag | clear the error flag ]] after formatting the SDcard.
== check.py options ==
To manually run [http://hg.mozilla.org/build/tools/file/default/sut_tools/check.py check.py], find the [http://hg.mozilla.org/build/tools/file/default/buildfarm/mobile/devices.json appropriate foopy server] and
cd /builds
python sut_tools/check.py [-m [s|p]] [-r] [-c] [-t tegra-###]
* -m [s|p] restrict Tegra list to 's'taging or 'p'roduction
* -r reset any error.flg semaphore if found and send "rebt" command to tegra
* -c powercycle the Tegra by telneting to the appropriate PDU
This will scan a given Tegra (or all of them) and report back it's status.
== Start ADB ==
On the Tegra do:
telnet tegra-### 20701
exec su -c "setprop service.adb.tcp.port 5555"
exec su -c "stop adbd"
exec su -c "start adbd"
On your computer do:
adb tcpip 5555
adb connect <ipaddr of tegra>
adb shell
== Add a tegra to staging ==
* on the new foopy, if it is already in service (if the foopy is brand new, the initial install will do all these steps):
** update the local tools: <tt>cd /builds/tools ; hg pull --update; cd -</tt>
** manually add the tegra to the <tt>/builds/create_dirs.sh</tt> file
** manually run <tt>cd /builds; ./create_dirs.sh</tt>
* copy a buildbot.tac file from an existing tegra directory and modify the contents to point to the new tegra
* add new tegra to [http://slavealloc.build.mozilla.org/ui/#slaves slavealloc]
* add new tegra to localconfig.py on the [http://dev-master01.build.scl1.mozilla.com:8160/ dev-master]
** don't forget to checkconfig && reconfig the dev-master
== Move a tegra from one foopy to another ==
The steps are written for moving one tegra. If you're moving a bunch, then you may want to apply each major step to all tegras involved, and use the "reconfigure foopy" approach to save work.
'''NOTE:''' use this technique to replace a tegra as well. (It's really two moves: move old to dust bin, then move replacement to live.)
# update foopies.sh & tegras.json in your working directory
# commit the changes to <tt>foopies.sh</tt> and <tt>tegras.json</tt>
#* make sure json is clean: <tt>python -c 'import json; json.loads(open("tegras.json").read())'</tt>
# in buildbot, request a "graceful shutdown"
#* wait for tegra to show "idle"
# on the old foopy:
#* stop the tegra via <tt>/builds/stop_cp.sh</tt>
#* manually remove the tegra from the <tt>/builds/create_dirs.sh</tt> file
#** <strike>'''OR''' run <tt>./foopies.sh old_foopy_number</tt> from your working directory</strike> blocked by: {{bug|713690}}
# on the new foopy, if it is already in service (if the foopy is brand new, the initial install will do all these steps):
#* update the local tools: <tt>cd /builds/tools ; hg pull --update; cd -</tt>
#* manually add the tegra to the <tt>/builds/create_dirs.sh</tt> file
#* manually run <tt>cd /builds; ./create_dirs.sh</tt>
#* if this is a replacement tegra, manually push the ini files by judicious use of: <tt>grep python update_tegra_ini.sh | sed 's/$TEGRA/tegra-xxx/'</tt>
# on the new foopy:
#* restart the tegras using <tt>screen -x # or you will hit bug 642369; cd /builds ; ./start_cp.sh</tt>
#** '''NOTE:''' do not start any new tegras, which require a reconfig to be active, until after the reconfig is complete.
== Move a tegra from staging to production ==
# If the tegra is running, stop it: <tt>/builds/stop_cp.sh tegra-###</tt>
# Edit the tegra's buildbot.tac: <tt>/builds/tegra-###/buildbot.tac</tt>
# Adjust the master, port and password to the appropriate server
# Save and restart the Tegra: <tt>/screen -x # or you will hit bug 642369; builds/start_cp.sh tegra-###</tt>
'''Note''' - yes, it's a manual process (sorry) until Tegras are in SlaveAlloc
This process also applies to moving tegras from one dev staging master to another.


= Managing SUTagent =
= Managing SUTagent =
Line 274: Line 22:
  cd ..
  cd ..
  ./stop_cp.sh
  ./stop_cp.sh
  detach from screen, this will take 45-60 minutes to bring down all the attached tegras
  detach from screen, this will take 45-60 minutes to bring down all the attached pandas
  ./start_cp.sh
  ./start_cp.sh


Line 316: Line 64:
= Environment =
= Environment =


The Tegra builders are run on multiple "foopy" servers with about 15-20 Tegra's per foopy.  Each Tegra has it's own buildslave environment and they share common tool and talos environments -- all found inside of '''/builds'''.
The panda builders are run on multiple "foopy" servers with about 10-12 panda's per foopy.  Each panda has it's own buildslave environment and they share common tool and talos environments -- all found inside of '''/builds'''.


* Each Tegra has a '''/builds/tegra-###''' containing a buildslave and a symlink to /builds/tools/sut_tools/clientproxy.py
* Each panda has a '''/builds/panda-###''' containing a buildslave and a symlink to /builds/tools/sut_tools/clientproxy.py
* All of the shared talos info is in '''/builds/talos-data''' and HG is used to maintain it
* All of the shared talos info is in '''/builds/talos-data''' and HG is used to maintain it
* All of the sut related helper code is found '''/builds/sut_tools''' (a symlink to /builds/tools/sut_tools/)
* All of the sut related helper code is found '''/builds/sut_tools''' (a symlink to /builds/tools/sut_tools/)
Tegra is the short name for the Tegra 250 Developer Kit test board, see http://developer.nvidia.com/tegra-250-tango-development-kits for details.  It allows us to install and test Firefox on a device that runs Android Froyo while also allowing for debugging.
Unlike the N900's we don't run a buildbot environment on the device, but rather communicate to the device via the sutAgentAndroid program that the a-team maintains.  All of the buildslave activities are handled by the clientproxy.py program which monitors the Tegra and it's state and starts/stops the buildslave as needed.


= Foopies =
= Foopies =
Line 336: Line 80:
where action can be (currently) "update" or "show_revision"
where action can be (currently) "update" or "show_revision"


See {{bug|809763}} or {{bug|846354}} for details
See {{bug|809763}} or {{bug|846354}} for details.  This usually happens automatically on reconfig so it's not something we need to do manually as a general rule.


= Pandas =
= Pandas =
Confirmed users
1,989

edits

Navigation menu