ReleaseEngineering:Sheriffing:HowTo: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
No edit summary
Line 79: Line 79:
** [[ReferencePlatforms/Test/WinXP#Installing_OPSI|Here]] is how you turn a Windows machine to talk to staging-opsi after being re-imaged
** [[ReferencePlatforms/Test/WinXP#Installing_OPSI|Here]] is how you turn a Windows machine to talk to staging-opsi after being re-imaged
* Read "[[ReferencePlatforms#Tracking_ref_images_being_refreshed|refreshing ref images]]" to know if you need to ask IT to refresh the images of the ref machines.
* Read "[[ReferencePlatforms#Tracking_ref_images_being_refreshed|refreshing ref images]]" to know if you need to ask IT to refresh the images of the ref machines.
Checklist:
* is the machine a staging machine that needs to be re-imaged?
** adjust the buildbot.tac accordingly
** check that it is talking with the right puppet master or opsi server
*** for Linux check '' '' and for Mac check '' /Library/LaunchDaemons/com.reductivelabs.puppet.plist''
* very important to verify [[PostImage]] after re-imaging minis
* clean and sign the puppet certificate
** on slave - puppetd --test --server ${puppet_master_hostname}.build.mozilla.org
** puppetca --clean ${hostname}.build.mozilla.org and puppetca --clean ${hostname}.build.mozilla.org
List of puppet masters:
* mv-production-puppet.build (for whatever is left at Castro)
* mpt-production-puppet.build (for most build VMs)
* scl-production-puppet.build (most HW machines)
I have also filed {{bug|601174}} to improve how the staging machines get re-imaged.


==Mobile==
==Mobile==

Revision as of 13:21, 8 October 2010

Return to ReleaseEngineering:Sheriffing

NOTE: There is a newer RelEngWiki version of this page

This page serves as a clearinghouse of information on how to perform the various duties associated with buildduty.

Build dashboard

Re-run jobs

Trigger several talos jobs

  • Sometimes a developer would want more talos runs for a certain changeset
  • You want to do this fast as tinderbox-builds are removed
  • Load up the revision/branch/changeset page (e.g. bug 598153)
  • Use the filter for showing only the talos suites you want and load the link's on the right and rebuild
  • Remember that the queue collapses for talos so you will have to wait for the job to become a running job

Try Server

How do I trigger a talos run for a given try build?

  • When someone pings you in #build with their try run dir name (format: $email-$changeset eg: lsblkk@mozilla.org-4asf23fsd251d):
  • Either:
    • ssh into production-master{01,02,03}
    • OR run from your machine tools/buildfarm/maintenance/try_sendchange.py
      • on the production-masters, there is a ~/try_sendchange.sh wrapper script which uses argparse in /tools/buildbot-0.8.0/bin/python

Then run:

./try_sendchange.sh $email-$changeset
# OR to do custom set of talos suites
./try_sendchange.sh $email-$changeset --t scroll,svg,nochrome
# NOTICE no spaces between comma-separated suite names!
  • It will spew back to you all the sendchanges it does.

Using the TryChooser to submit build/test requests

buildduty can also use the same TryChooser syntax as developers use to (re)submit build and testing requests. Here is an example:

buildbot sendchange --master production-master01:9009 --revision 923103d5a656 --branch try --author mpalmgren@mozilla.com --comments "try: -b d -p linux -u all" doit

How do I cancel existing jobs?

The cancellator.py script is setup on pm02. Here is a standard example:

# Dry run first to see what would be cancelled. 
python cancellator.py --db mysql://USERNAME:PASSWORD@tm-b01-master01.mozilla.org/buildbot_schedulers -b try -r 5ff84b660e90
# Same command run again with the force option specified (--yes-really) to actually cancel the builds
python cancellator.py --db mysql://USERNAME:PASSWORD@tm-b01-master01.mozilla.org/buildbot_schedulers -b try -r 5ff84b660e90 --yes-really

You'll need to fill in the appropriate USERNAME and PASSWORD above. The easiest way to get those is by running 'history | grep cancell' on pm02.

To loan slaves

  • change cltbld's and root's password (passwd)
  • change vnc's password (Linux: vncpasswd / Windows: UltraVNC server "admin properties" on bottom right task bar / OSX: Control Panel -> Sharing -> Remote Management)
  • disable buildbot from running after reboot (rename buildbot.tac / rename startTalos.bat for Windows)
  • [only for build slaves] remember to remove all .ssh keys
  • provide to developer the IP address, cltbld's password and VNC's password

To change autologin

  • start -> run -> control userpasswords2
    • (on w7, start -> Search programs and files -> netplwiz)
  • check the option “Users must enter a user name and password to use this computer”
  • apply
  • uncheck the option “Users must enter a user name and password to use this computer”
  • apply
  • account: cltbld, enter new password twice

Dealing with machines

NOTE: We now have flags. TODO update the following two bullet points next time

  • The current downtime bug should always be aliased as "releng-downtime": http://is.gd/cQO7I
  • The current machine reboots bug should always be aliased as "reboots": http://is.gd/dqSV0

Re-image machines

  • Be careful when re-imaging slaves that are supposed to be staging slaves as it will require you to adjust the buildbot.tac manually and point to the right master
    • Here is how you turn a Windows machine to talk to staging-opsi after being re-imaged
  • Read "refreshing ref images" to know if you need to ask IT to refresh the images of the ref machines.

Checklist:

  • is the machine a staging machine that needs to be re-imaged?
    • adjust the buildbot.tac accordingly
    • check that it is talking with the right puppet master or opsi server
      • for Linux check and for Mac check /Library/LaunchDaemons/com.reductivelabs.puppet.plist
  • very important to verify PostImage after re-imaging minis
  • clean and sign the puppet certificate
    • on slave - puppetd --test --server ${puppet_master_hostname}.build.mozilla.org
    • puppetca --clean ${hostname}.build.mozilla.org and puppetca --clean ${hostname}.build.mozilla.org

List of puppet masters:

  • mv-production-puppet.build (for whatever is left at Castro)
  • mpt-production-puppet.build (for most build VMs)
  • scl-production-puppet.build (most HW machines)

I have also filed bug 601174 to improve how the staging machines get re-imaged.

Mobile

n810s

Once a device hits a hard state (100% of retries), it is dead. Please use this template to file a new bug with the device names.

  • 8 devices per bug max
  • if the newest open reimage bug has less then 8 devices, please add to it until it has 8
  • once the newest bug has 8 device in it open a new bug
  • any bug that is resolved should not have any devices added to it

Nagios

Coordinate downtime with IT

  • Some IT maintenance requires tree closure. Details here: ReleaseEngineering:RelEngITSharedDowntime
  • If possible, consolidate RelEng and IT downtimes that need tree closures to avoid having two tree closures soon after each other. This is "nice to do", not a "requirement"; if it reduces risk by doing two separate downtimes, thats fine!

IPMI

  • Go to the inventroy page
  • Find the machine you want to access
  • Click on the IP hyperlink
  • Log-in to the SuperMicro interface
  • Click on the "Remote Management" tab
  • Launch "Remote Console" which might download a jnlp file and you have to open
  • Grant access to the Java iKVM Viewer
  • NOTE: If nothing prompts up when trying to launch the console try downgrading your Java (I used 6.14 instead of 6.21)