Revision as of 13:21, 8 October 2010

Return to ReleaseEngineering:Sheriffing

NOTE: There is a newer RelEngWiki version of this page

This page serves as a clearinghouse of information on how to perform the various duties associated with buildduty.

Build dashboard

Build Dashboard Main Page
- You can get JSON dumps for people to analyze by adding &format=json
- You cam see all build and test jobs for a certain branch for a certain revision by appending branch/revision to this link (e.g. revision/places/c4f8232c7aef

Re-run jobs

Trigger several talos jobs

Sometimes a developer would want more talos runs for a certain changeset
You want to do this fast as tinderbox-builds are removed
Load up the revision/branch/changeset page (e.g. bug 598153)
Use the filter for showing only the talos suites you want and load the link's on the right and rebuild
Remember that the queue collapses for talos so you will have to wait for the job to become a running job

Try Server

How do I trigger a talos run for a given try build?

When someone pings you in #build with their try run dir name (format: $email-$changeset eg: lsblkk@mozilla.org-4asf23fsd251d):
Either:
- ssh into production-master{01,02,03}
- OR run from your machine tools/buildfarm/maintenance/try_sendchange.py
  - on the production-masters, there is a ~/try_sendchange.sh wrapper script which uses argparse in /tools/buildbot-0.8.0/bin/python

Then run:

./try_sendchange.sh $email-$changeset
# OR to do custom set of talos suites
./try_sendchange.sh $email-$changeset --t scroll,svg,nochrome
# NOTICE no spaces between comma-separated suite names!

It will spew back to you all the sendchanges it does.

Using the TryChooser to submit build/test requests

buildduty can also use the same TryChooser syntax as developers use to (re)submit build and testing requests. Here is an example:

buildbot sendchange --master production-master01:9009 --revision 923103d5a656 --branch try --author mpalmgren@mozilla.com --comments "try: -b d -p linux -u all" doit

How do I cancel existing jobs?

The cancellator.py script is setup on pm02. Here is a standard example:

# Dry run first to see what would be cancelled. 
python cancellator.py --db mysql://USERNAME:PASSWORD@tm-b01-master01.mozilla.org/buildbot_schedulers -b try -r 5ff84b660e90
# Same command run again with the force option specified (--yes-really) to actually cancel the builds
python cancellator.py --db mysql://USERNAME:PASSWORD@tm-b01-master01.mozilla.org/buildbot_schedulers -b try -r 5ff84b660e90 --yes-really

You'll need to fill in the appropriate USERNAME and PASSWORD above. The easiest way to get those is by running 'history | grep cancell' on pm02.

To loan slaves

change cltbld's and root's password (passwd)
change vnc's password (Linux: vncpasswd / Windows: UltraVNC server "admin properties" on bottom right task bar / OSX: Control Panel -> Sharing -> Remote Management)
disable buildbot from running after reboot (rename buildbot.tac / rename startTalos.bat for Windows)
[only for build slaves] remember to remove all .ssh keys
provide to developer the IP address, cltbld's password and VNC's password

To change autologin

start -> run -> control userpasswords2
- (on w7, start -> Search programs and files -> netplwiz)
check the option “Users must enter a user name and password to use this computer”
apply
uncheck the option “Users must enter a user name and password to use this computer”
apply
account: cltbld, enter new password twice

Dealing with machines

NOTE: We now have flags. TODO update the following two bullet points next time

The current downtime bug should always be aliased as "releng-downtime": http://is.gd/cQO7I
The current machine reboots bug should always be aliased as "reboots": http://is.gd/dqSV0

Re-image machines

Be careful when re-imaging slaves that are supposed to be staging slaves as it will require you to adjust the buildbot.tac manually and point to the right master
- Here is how you turn a Windows machine to talk to staging-opsi after being re-imaged
Read "refreshing ref images" to know if you need to ask IT to refresh the images of the ref machines.

Checklist:

is the machine a staging machine that needs to be re-imaged?
- adjust the buildbot.tac accordingly
- check that it is talking with the right puppet master or opsi server
  - for Linux check and for Mac check /Library/LaunchDaemons/com.reductivelabs.puppet.plist
very important to verify PostImage after re-imaging minis
clean and sign the puppet certificate
- on slave - puppetd --test --server ${puppet_master_hostname}.build.mozilla.org
- puppetca --clean ${hostname}.build.mozilla.org and puppetca --clean ${hostname}.build.mozilla.org

List of puppet masters:

mv-production-puppet.build (for whatever is left at Castro)
mpt-production-puppet.build (for most build VMs)
scl-production-puppet.build (most HW machines)

I have also filed bug 601174 to improve how the staging machines get re-imaged.

Mobile

n810s

Once a device hits a hard state (100% of retries), it is dead. Please use this template to file a new bug with the device names.

8 devices per bug max
if the newest open reimage bug has less then 8 devices, please add to it until it has 8
once the newest bug has 8 device in it open a new bug
any bug that is resolved should not have any devices added to it

Nagios

All unacknowledged problems:
- https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=10
all unacknowledged problems which have notifications enabled:
- https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=8202
All unacknowledged problems with notifications enabled with HARD failure states (i.e. have hit the retry attempt ceiling):
- https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=270346

Coordinate downtime with IT

Some IT maintenance requires tree closure. Details here: ReleaseEngineering:RelEngITSharedDowntime
If possible, consolidate RelEng and IT downtimes that need tree closures to avoid having two tree closures soon after each other. This is "nice to do", not a "requirement"; if it reduces risk by doing two separate downtimes, thats fine!

IPMI

Go to the inventroy page
Find the machine you want to access
Click on the IP hyperlink
Log-in to the SuperMicro interface
Click on the "Remote Management" tab
Launch "Remote Console" which might download a jnlp file and you have to open
Grant access to the Java iKVM Viewer
NOTE: If nothing prompts up when trying to launch the console try downgrading your Java (I used 6.14 instead of 6.21)

@@ Line 79: / Line 79: @@
 ** [[ReferencePlatforms/Test/WinXP#Installing_OPSI|Here]] is how you turn a Windows machine to talk to staging-opsi after being re-imaged
 * Read "[[ReferencePlatforms#Tracking_ref_images_being_refreshed|refreshing ref images]]" to know if you need to ask IT to refresh the images of the ref machines.
+Checklist:
+* is the machine a staging machine that needs to be re-imaged?
+** adjust the buildbot.tac accordingly
+** check that it is talking with the right puppet master or opsi server
+*** for Linux check '' '' and for Mac check '' /Library/LaunchDaemons/com.reductivelabs.puppet.plist''
+* very important to verify [[PostImage]] after re-imaging minis
+* clean and sign the puppet certificate
+** on slave - puppetd --test --server ${puppet_master_hostname}.build.mozilla.org
+** puppetca --clean ${hostname}.build.mozilla.org and puppetca --clean ${hostname}.build.mozilla.org
+List of puppet masters:
+* mv-production-puppet.build (for whatever is left at Castro)
+* mpt-production-puppet.build (for most build VMs)
+* scl-production-puppet.build (most HW machines)
+I have also filed {{bug|601174}} to improve how the staging machines get re-imaged.
 ==Mobile==

ReleaseEngineering:Sheriffing:HowTo: Difference between revisions

Revision as of 13:21, 8 October 2010

Contents

Build dashboard

Re-run jobs

Trigger several talos jobs

Try Server

How do I trigger a talos run for a given try build?

Using the TryChooser to submit build/test requests

How do I cancel existing jobs?

To loan slaves

To change autologin

Dealing with machines

Re-image machines

Mobile

n810s

Nagios

Coordinate downtime with IT

IPMI

Navigation menu

ReleaseEngineering:Sheriffing:HowTo: Difference between revisions

Revision as of 13:21, 8 October 2010

Build dashboard

Re-run jobs

Trigger several talos jobs

Try Server

How do I trigger a talos run for a given try build?

Using the TryChooser to submit build/test requests

How do I cancel existing jobs?

To loan slaves

To change autologin

Dealing with machines

Re-image machines

Mobile

n810s

Nagios

Coordinate downtime with IT

IPMI

Navigation menu

Search