CIDuty
Looking for who is on buildduty? - check the tree-info dropdown on tbpl
Buildduty not around? - please open a bug
Each week there is one person from the Release Engineering team dedicated to helping out developers with releng-related issues. This person will be available during his or her regular work hours for the whole week. This is similar to the sheriff role that rotates through the developer community. To avoid confusion, the releng sheriff position is known as "buildduty."
Here's now to do it.
Schedule
Mozilla Releng Sheriff Schedule (Google Calendar|iCal|XML)
General Duties
How should I make myself available for duty?
- Add 'buildduty' to your IRC nick
- be in at least #developers, #buildduty and #build (as well as #mozbuild of course)
- also useful to be in #mobile, #planning, #release-drivers, and #ateam
- watch http://tbpl.mozilla.org
What else should I take care of?
You will need to
- Direct people to http://mzl.la/tryhelp for self-serve documentation.
- Keep wiki.m.o/ReleaseEngineering:Maintenance up to date with any significant changes
You should keep on top of
- pending builds - available in graphs or in the "Infrastructure" pulldown on TBPL. The graphs are helpful for noticing anomalous behavior.
- all bugs tagged with buildduty in the whiteboard (make a saved search)
- The releng-triage search - part of buildduty is leaning on your colleagues to take bugs
- the BuildDuty person needs to go through all the bugs in releng-triage query at *least* a few times each day. Doesnt mean you have to *fix* them all immediately, finding other owners is part of triaging the queue. However, you do need to at least *see* them, know if there are any urgent problems and categorize appropriately. Sometimes we get urgent security bugs here, which need to be jumped on immediately, like example bug 635638
- Bum slaves - you should see to it that bum slaves aren't burning builds, and that all slaves are tracked on their way back to operational status
- Check the hardware whiteboard tag, too, for anything that slipped between the cracks.
- See the sections below on #Requesting Reboots
- Monitor dev.tree-management newsgroup (by email or by nntp)
- wait times - either this page or the emails (un-filter them in Zimbra). Respond to any unusually long wait times (hopefully with a reason)
- there is a cronjob in anamarias' account on cruncher that runs this for each pool:
/usr/local/bin/python $HOME/buildapi/buildapi/scripts/mailwaittimes.py \ -S smtp.mozilla.org \ -f nobody@cruncher.build.mozilla.org \ -p testpool \ -W http://cruncher.build.mozilla.org/buildapi/reports/waittimes \ -e $(date -d "$(date +%Y-%m-%d)" +%s) -t 10 -z 10 \ -a dev-tree-management@lists.mozilla.org
- You may need to plan a reconfig or a full downtime
- Reconfigs: look at releng-needs-reconfig broken query to see what's pending. Reconfigs can be done at any time.
- Downtimes: look at releng-needs-treeclosure broken query to see what's pending. Coordinate with Zandr and IT for send downtime notices with enough advance notice.
You will also be responsible for coordinating master reconfigs - see the releng-needs-reconfig search.
Scheduled Reconfigs
Buildduty is responsible for reconfiging the Buildbot masters every Monday and Thursday, their time. During this, buildduty needs to merge default -> production branches and reconfig the affected masters. This wiki page has step by step instructions. It is also valid to do other additional reconfigs anytime you want.
If the reconfig gets stuck, see How To/Unstick a Stuck Slave From A Master.
You should use Fabric to do the reconfig!
The person doing reconfigs should also update https://wiki.mozilla.org/ReleaseEngineering:Maintenance#Reconfigs_.2F_Deployments
Tree Maintenance
Repo Errors
If a dev reports a problem pushing to hg (either m-c or try repo) then you need to do the following:
- File a bug (or have dev file it) and then poke in #ops noahm
- If he doesn't respond, then escalate the bug to page on-call
- Follow the steps below for "How do I close the tree"
How do I see problems in TBPL?
All "infrastructure" (that's us!) problems should be purple at http://tbpl.mozilla.org. Some aren't, so keep your eyes open in IRC, but get on any purples quickly.
How do I close the tree?
See ReleaseEngineering/How_To/Close_or_Open_the_Tree
How do I claim a rentable project branch?
See ReleaseEngineering/DisposableProjectBranches#BOOKING_SCHEDULE
Re-run jobs
How to trigger Talos jobs
see ReleaseEngineering/How_To/Trigger_Talos_Jobs
How to re-trigger all Talos runs for a build (by using sendchange)
see ReleaseEngineering/How_To/Trigger_Talos_Jobs
How to re-run a build
Do not go to the page of the build you'd like to re-run and cook up a sendchange to try to re-create the change that caused it. Changes without revlinks trigger releases, which is not what you want.
Find the revision you want, find a builder page for the builder you want (preferably, but not necessarily, on the same master), and plug the revision, your name, and a comment into the "Force Build" form. Note that the YOU MUST specify the branch, so there's no null keys in the builds-running.js.
Try Server
Jobs not scheduled at all?
Recreate the comment of their change with http://people.mozilla.org/~lsblakk/trychooser/ and compare it to make sure is correct.
Then do a sendchange and tail the scheduler master:
buildbot sendchange --master buildbot-master10:9301 --revision 923103d5a656 --branch try --username mpalmgren@mozilla.com --comments "try: -b d -p linux -u all" doit
- If tryserver was just reset verify that the scheduler has been reset
How do I trigger additional talos/test runs for a given try build?
see ReleaseEngineering/How_To/Trigger_Talos_Jobs
Using the TryChooser to submit build/test requests
buildduty can also use the same TryChooser syntax as developers use to (re)submit build and testing requests. Here is an example:
buildbot sendchange --master buildbot-master10:9301 --revision 923103d5a656 --branch try --username mpalmgren@mozilla.com --comments "try: -b d -p linux -u all" doit
How do I cancel existing jobs?
The cancellator.py script is setup on pm02. Here is a standard example:
# Dry run first to see what would be cancelled. python cancellator.py -b try -r 5ff84b660e90 # Same command run again with the force option specified (--yes-really) to actually cancel the builds python cancellator.py -b try -r 5ff84b660e90 --yes-really
The script is intended for try builds, but can be used on other branches as long as you are careful to check that no other changes have been merged into the jobs. Use the revision/branch/rev report to check.
Bug Commenter
This is on cruncher and is run in a crontab in lsblakk's account:
source /home/lsblakk/autoland/bin/activate && cd /home/lsblakk/autoland/tools/scripts/autoland \ && time python schedulerDBpoller.py -b try -f -c schedulerdb_config.ini -u None -p None -v
You can see quickly if things are working by looking at:
/home/lsblakk/autoland/tools/scripts/autoland/postedbugs.log # this shows what's been posted lately /home/lsblakk/autoland/tools/scripts/autoland/try_cache # this shows what the script thinks is 'pending' completion
Nightlies
How do I re-spin mozilla-central nightlies?
To rebuild the same nightly, buildbot's Rebuild button works fine.
To build a different revision, Force build all builders matching /.*mozilla-central.*nightly/, on any of the regular build masters. Set revision to the desired revision. With no revision set, the tip of the default branch will be used, but it's probably best to get an explicit revision from hg.mozilla.org/mozilla-central.
You can use https://build.mozilla.org/buildapi/self-serve/mozilla-central to do initiate this build and use the changeset at the tip of http://hg.mozilla.org/mozilla-central. Sometimes the developer will request a specific changeset in the bug.
Mobile
Android Tegras
Android Updates aren't working!
- Did the version number just change? If so, you may be hitting bug 629528. Kick off another Android nightly.
- Check aus3-staging for size 0 complete.txt snippets:
- https://bugzilla.mozilla.org/show_bug.cgi?id=652667#c1
- https://bugzilla.mozilla.org/show_bug.cgi?id=651925#c5
- If so, copy a non-size-0 complete.txt over the size 0 one. Only the most recent buildid should be size 0.
- Check aus3-staging to see if the checksum is correct:
- https://bugzilla.mozilla.org/show_bug.cgi?id=652785#c2
- If so, either copy the complete.txt with the correct checksum to the 2nd-most-recent buildid directory, or kick off another Android nightly.
Update mobile talos webhosts
We have a balance loader (bm-remote) that is in front of three web hosts (bm-remote-talos-0{1,2,3}). Here is how you update them: Update Procedure:
ssh root@bm-remote-talos-webhost-01 cd /var/www/html/talos-repo # NOTICE that we have uncommitted files hg st # ? talos/page_load_test/tp4 # Take note of the current revision to revert to (just in case) hg id hg pull -u # 488bc187a3ef tip rsync -azv --delete /var/www/html/. bm-remote-talos-webhost-02:/var/www/html/. rsync -azv --delete /var/www/html/. bm-remote-talos-webhost-03:/var/www/html/.
Keep track of what revisions is being run.
Deploy new tegra-host-utils.zip
There are three hosts behind a balance loader.
- See bug 742597 for previous instance of this case.
ssh root@bm-remote-talos-webhost-01 cd /var/www/html/tegra wget -O tegra-host-utils.742597.zip http://people.mozilla.org/~jmaher/tegra-host-utils.zip rsync -azv /var/www/html/tegra/. bm-remote-talos-webhost-02:/var/www/html/tegra/ rsync -azv /var/www/html/tegra/. bm-remote-talos-webhost-03:/var/www/html/tegra/
Slave Maintenance
In general, slave maintenance involves:
- keeping as many slaves up as possible, including
- proactively checking for hung/broken slaves (see links below)
- moving known-down slaves toward an operational state
- handling nagios alerts for slaves
- interacting with IT regarding slave maintenance
Kitten
kitten.py is a command line tool designed to make gathering and basic host management tasks easier to do from the command line. You can get information about a host and also request to reboot it all in one command.
A buildduty environment has been created on Cruncher to make it easier to work with all of the briarpatch tools (of which Kitten.py is one.)
sudo su - buildduty cd /home/buildduty/briarpatch . bin/activate
From their you can run:
python kitten.py <hostname>
Example output (lines numbers added for reference):
1 talos-r3-xp-013: enabled 2 farm: moz 3 colo: scl1 4 distro: winxp 5 pool: tests-scl1-windows 6 trustlevel: try 7 master: bm16-tests1-windows 8 fqdn: talos-r3-xp-013.build.scl1.mozilla.com. 9 PDU?: False 10 IPMI?: False 11 reachable: True 12 buildbot: running; active; job 1 minute ago 13 tacfile: found 14 lastseen: 1 minute ago 15 master: buildbot-master16.build.scl1.mozilla.com
1. hostname and it's status according to slavealloc 2. farm: aws or moz 3. colo: what colo the host is located (from slavealloc) 4. distro: what OS distribution slavealloc lists 5. pool: what build/test pool slavealloc lists 6. trustlevel: the host's trustlevel per slavealloc 7. master: the master that slavealloc lists for the host 8. fqdn: the FQDN that was returned from the DNS lookup 9. PDU?: does Inventory (or tegras.json) list a PDU for this host 10. IPMI?: does a -mgmt DNS entry exist for this host 11. Was briarpatch able to successfully ping and SSH to the host 12. buildbot: the status of buildbot and what the last activity was 13. tacfile: was a buildbot.tac file found 14. lastseen: the timestamp of the last entry in twistd.log 15. master: what the buildbot.tac file lists as the host's master
Example of a host that cannot be reached:
(production)[buildduty@cruncher production]$ python kitten.py -v talos-r3-xp-019 talos-r3-xp-019: enabled farm: moz colo: scl1 distro: winxp pool: tests-scl1-windows trustlevel: try master: bm15-tests1-windows fqdn: talos-r3-xp-019.build.scl1.mozilla.com. PDU?: False IPMI?: False ERROR Unable to control host remotely reachable: False buildbot: tacfile: lastseen: unknown master: error: current master is different than buildbot.tac master []
The output up to the "ERROR" line shows all of the metadata for a host, and if the host was reachable via SSH the lines after would show the details of the buildbot environment and it's status.
Kitten.py has the following commands:
kitten.py [--info | -i ] [--reboot | -r] [--verbose | -v] [--debug]
- --info will show only the metadata and will not try to SSH to the host
- --reboot will try to graceful the buildbot and reboot the host if it appears to be idle or hung
- --verbose will show what SSH commands are being run
- --debug shows everything --verbose shows and also displays the SSH output
File a bug
- Use this template so it fills up few needed tags and priority
- Make the subject and alias of the bug to be the hostname
- Add any depend bugs IT actions or the slave's issue
- Submit
Slave Tracking
- Slave tracking is done via the Slave Allocator. Please disable/enable slaves in slavealloc and add relevant bug numbers to the Notes field.
Slavealloc
Adding a slave
Slaves are added to slavealloc via the 'dbimport' subcommand of the 'slavealloc' command. This is generally run as the slavealloc user on the slavealloc server, which is most easily accessed via su from root.
You'll want a command line something like
/tools/slavealloc/bin/slavealloc dbimport -D $db_url --slave-data mydata.csv
where $db_url is most easily found in slavealloc's shell history. The CSV file should have the headers specified by 'slavealloc dbimport --help':
name,basedir,distro,bitlength,purpose,datacenter,trustlevel,speed,environment,pool
Adding masters is similar - see dbimport's help for more information.
Removing slaves
Connect to slavealloc@slavealloc and look at the history for a command looking like this:
mysql -h $host_ip -p -u buildslaves buildslaves # type the password SELECT name FROM slaves WHERE notes LIKE '%bumblebumble%'; DELETE name FROM slaves WHERE notes LIKE '%bumblebumble%';
How Tos
see ReleaseEngineering/How_To for a list of public How To documents
see RelEngWiki/Category:HowTo for list of private How To documents
Nagios
What's the difference between a downtime and an ack?
Both will make nagios stop alerting, but there's an important difference: acks are forever. Never ack an alert unless the path to victory for that alert is tracked elsewhere (in a bug, probably). For example, if you're annoyed by tinderbox alerts every 5 minutes, which you can't address, and you ack them to make them disappear, then unless you remember to unack them later, nobody will ever see that alert again. For such a purpose, use a downtime of 12h or a suitable interval until someone who *should* see the alert is available.
How do I interact with the nagios IRC bot?
nagios: status (gives current server stats) nagios: status $regexp (gives status for a particular host) nagios: status host:svc (gives status for a particular service) nagios: ignore (shows ignores nagios: ignore $regexp (ignores alerts matching $regexp) nagios: unignore $regexp (unignores an existing ignore) nagios: ack $num $comment (adds an acknowledgement comment; $num comes from [brackets] in the alert) (note that the numbers only count up to 100, so ack things quickly or use the web interface) nagios: unack $num (reverse an acknowledgement) nagios: downtime $service $time $comment (copy/paste the $service from the alert; time suffixes are m,h,d) e.g.: nagios-sjc1: downtime buildbot-master06.build.scl1:buildbot 2h bug 712988
How do I scan all problems Nagios has detected?
- All unacknowledged problems:
- All unacknowledged problems with notifications enabled with HARD failure states (i.e. have hit the retry attempt ceiling):
- Group hosts check
How do I deal with Nagios problems?
Note that most of the nagios alerts are slave-oriented, and the slave duty person should take care of them. If you see something that needs to be rectified immediately (e.g., a slave burning builds), do so, and hand off to slave duty as soon as possible.
Nagios will alert every 2 hours for most problems. This can get annoying if you don't deal with the issues. However: do not ever disable notifications.
You can acknowledge a problem if it's tracked to be dealt with elsewhere, indicating that "elsewhere" in the comment. Nagios will stop alerting for ack'd services, but will continue monitoring them and clear the acknowledgement as soon as the service returns to "OK" status -- so we hear about it next time it goes down.
For example, this can point to a bug (often the reboots bug) or to the slave-tracking spreadsheet. If you're dealing with the problem right away, an ACK is not usually necessary, as Nagios will notice that the problem has been resolved. Do *not* ack a problem and then leave it hanging - when we were cleaning out nagios we found lots of acks from 3-6 months ago with no resolution to the underlying problem.
You can also mark a service or host for downtime. You will usually do this in advance of a planned downtime, e.g., a mass move of slaves. You specify a start time and duration for a downtime, and nagios will silence alerts during that time, but begin alerting again when the downtime is complete. Again, this avoids getting us in a state where we are ignoring alerts for months at a time.
At worst, if you're overwhelmed, you can ignore certain alerts (see above) and scan the full list of problems (again, see above), then unignore.
Known nagios alerts
[28] dm-hg02:https - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds armenzg_buildduty arr: should I be worrying about this message? [26] dm-hg02:http - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds nthomas depends if ssh is down nagios-sjc1 [29] talos-r3-fed-018.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100% joduinn-mtg is now known as joduinn-brb nthomas seems to work ok still, so people can push 16:53 nthomas I get the normal |No interactive shells allowed here!| and it kicks me out as expected
This is normally due to releases. We might have to bump the threshold.
[30] signing1.build.scl1:Ganglia IO is WARNING: CHECKGANGLIA WARNING: cpu_wio is 30.60
Downtimes
The downtimes section had grown quite large. If you have questions about how to schedule a downtime, who to notify, or how to coordinate downtimes with IT, please see the Downtimes page.
Talos
Note because a change to the Talos bundle always causes changes in the baseline times, the following should be done for *any* change...
- close all trees that are impacted by the change
- ensure all pending builds are done and GREEN
- do the update step below
- send a Talos changeset to all trees to generate new baselines
How to update the talos/pageloader zips
NOTE: Deploying talos.zip is not scary anymore as we don't replace the file anymore and the a-team has to land a change in the tree.
You may need to get IT to turn on access to build.mozilla.org.
#use your short ldap name (jford not jford@mozilla.com) ssh jford@build.mozilla.org cd /var/www/html/build/talos/zips/ # NOTE: bug# and talos cset helps tracking back wget -Otalos.bug#.cset.zip <whatever>talos.zip cd /var/www/html/build/talos/xpis # NOTE: We override it unlike with talos.zip since it has not been ported to the talos.json system wget <whatever>/pageloader.xpi
For taloz.zip changes: Once deployed, notify the a-team and let them know that they can land at their own convenience.
Updating talos for Tegras
To update talos on Android,
# for foopy05-11 csshX --login=cltbld foopy{05,06,07,08,09,10,11,12,13,14,15,16,17,18,19,20,22,23,24} cd /builds/talos-data/talos hg pull -u
This will update talos on each foopy to the tip of default.
Updating talos for N900s
ssh cltbld@production-mobile-master cd checkouts ./update.sh
This will update the fennecmark, maemkit, talos, and pageloader tarballs on pmm to the latest in their repos.
TBPL
How to deploy changes
RelEng no longer has access to do this. TBPL devs will request a push from Server Ops.
How to hide/unhide builders
- In the 'Tree Info' menu select 'Open tree admin panel'
- Filter/select the builders you want to change
- Save changes
- Enter the sheriff password and a description (with bug number if available) of your changes
Useful Links
- Build Dashboard Main Page
- You can get JSON dumps for people to analyze by adding
&format=json
- You cam see all build and test jobs for a certain branch for a certain revision by appending branch/revision to this link (e.g. revision/places/c4f8232c7aef)
- You can get JSON dumps for people to analyze by adding
- http://cruncher.build.mozilla.org/~bhearsum/cgi-bin/missing-slaves.py -- a list of slaves which are known on production masters but are not connected to any production masters. Note that this includes preprod and staging slaves, as well as some slaves that just don't exist. Use with care.
- http://build.mozilla.org/builds/last-job-per-slave.html (replace html with txt for text only version)
L10n Nightly Dashboard
Slave Handling
You'll need to be familiar with the location of slaves. You can find this with 'host' if you don't know off the top of your head
host linux-ix-slave07 linux-ix-slave07.build.mozilla.org is an alias for linux-ix-slave07.build.mtv1.mozilla.com.
Restarting Wedged Slaves
See How To/Get a Missing Slave Back Online.
Reboot an IX slave: Connect To IPMI
Requesting Reboots
Some slaves run on unmanaged hardware, meaning that the hardware can get into a state where someone must be onsite to unwedge it. Note that iX systems and VMs are not unmanaged, and should not be on a reboots bug. When an unmanaged host becomes unresponsive, it gets added to a reboots bug, based on its datacenter:
- https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-scl1 (by far the most common, since about 10 talos machines die per week)
- https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-sjc1
- https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-mtv1
NOTE: these bugs are formulaic. Don't get creative! Just add the hostname of the slave in a comment, or if you are adding multiple slaves at once, list each on its own line. If there's something the onsite person needs to know, include it after the hostname, on the same line. Do not try to "summarize" all of the slaves on the bug in a single comment.
Simultaneously, 'ack' the alert in #build:
10:27 < nagios-sjc1> [25] talos-r3-fed-025.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100% 10:51 < dustin> nagios-sjc1: ack 25 reboots-scl1 10:51 < nagios-sjc1> talos-r3-fed-025.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%;dustin;reboots-scl1
When Requested Reboots are Done
Checking Slaves
Once a reboots bug is closed by an onsite person, read the update to see which hosts were rebooted, and which (if any) require further work. Such further work should be deferred to a new bug, which you should open if relops did not (often time is tight at the datacenter). Update the slave tracking spreadsheet accordingly:
- for slaves that were rebooted normally: replace the "Bug #" column with the actual reboots bug number; set "Issue" to "rebooted"; and set "Blocked On" to "check" (which will turn the cell yellow). Check BuildAPI a few hours later to see if these slaves are building properly, and delete the rows from the spreadsheet if so.
- for slaves that were reimaged during the reboot process: replace the "Bug #" column with the actual reboots bug number; set "Issue" to "reimaged"; and set "Blocked On" to "set up". That set-up is your responsibility, too -- see How To/Set Up a Freshly Imaged Slave.
- for slaves that require further work from relops, change the "Bug #" column to reflect the bug tracking that work, and set the "Issue" and "Blocked On" columns appropriately
If any slaves were missed in the reboot process, add them to a new reboots bug.
New Bug
Once a reboots bug is closed, you will need to open a new one for any subsequent reboots. You don't have to wait until you need a reboot to do so. Here's how:
- remove the 'reboots-xxxx' alias from the previous reboots bug, and copy the bug's URL to your clipboard
- create a bug in "Server Operations: RelEng", with subject "reboot requests (xxxx)". You can leave the description blank if you don't have any slaves requiring reboot yet. Submit.
- edit the bug's colo-trip field to indicate the correct datacenter, and paste the previous reboot request's URL into the "See Also" field.
DNR
Slaves that are dead and not worth repairing are marked as "DNR" in the slave tracking spreadsheet. The types of slaves that are acceptable for DNR are listed in the "DNR'd Silos" sheet of the slave tracking spreadsheet. Such slaves should be acked in nagios, but are not tracked in any bug.
Loans
We need to track a slave from the time it is loaned out until it is back in its proper place (be that staging, preprod, or production). Currently we use bugs to track this flow.
- Bug from dev requesting loaner (build or test slave, platform, bug this is being used to help with)
- Loan it: How To/Send a slave out for loan
- File a bug to the RelEng component (connected to bug in point #1) to track the re-imaging and returning of the slave to its pool when it's returned -- I've been asking the dev to please comment in that bug when they are done with the loaner
- File a bug on ServerOps asking for re-image (blocking bug in #3) How To/Request That a Machine Be Reimaged
- When it's re-imaged, put it back in the pool How To/Set Up a Freshly Imaged Slave
Maintenance
Periodically scan the slave spreadsheet. Check slaves marked "check". Set up slaves marked "set up". Ask developers who have borrowed slaves to see if they're done with them. Ask relops about progress on broken slaves.
Common Failure Modes
Some slaves, especially linux VMs, will fail to clobber and repeatedly restart. In nagios, this causes all of the checks on that host to bounce up and down, because the reboots occur on a similar schedule to nagios's checks. Sometimes you can catch this via SSH, but the reboots are *very* quick and it may be easier to use vSphere Client to boot the host into single-user mode and clean out the build dirs.
All of the linux slaves will reboot after 10 attempts to run puppet. A puppet failure, then, will manifest as buildbot failing to start on that host. To stop the reboot cycle, log in to the slave and kill S98puppet (centos) or run-puppet-and-slave.sh (fedora).
Standard Bugs
- The current downtime bug should always be aliased as "releng-downtime": http://is.gd/cQO7I
- Reboots bugs have the Bugzilla aliases shown above.
- For IT bugs that are marked "infra only", yet still need to be readable by RelEng, it is not enough to add release@ alias - people get updates but not able to comment or read prior comments. Instead, cc the following:
- :aki, :armenzg, :bear, :bhearsum, :catlee, :coop, :hwine, :jhford, :joduinn, :joey, :lsblakk, :nthomas, :rail
Ganglia
- if you see that a host is reporting to ganglia in an incorrect manner it might just take this to fix it (e.g. bug 674233:
switch to root, service gmond restart
Queue Directories
If you see this in #build:
<nagios-sjc1> [54] buildbot-master12.build.scl1:Command Queue is CRITICAL: 4 dead items
It means that there are items in the "dead" queue for the given master. You need to look at the logs and fix any underlying issue and then retry the command by moving *only* the json file over to the "new" queue. See the Queue directories wiki page for details.
Cruncher
If you get an alert about cruncher running out of space it might be a sendmail issue (backed up emails taking up too much space and not getting sent out):
<nagios-sjc1> [07] cruncher.build.sjc1:disk - / is WARNING: DISK WARNING - free space: / 384 MB (5% inode=93%):
As root:
du -s -h /var/spool/* # confirm that mqueue or clientmqueue is the oversized culprit # stop sendmail, clean out the queues, restart sendmail /etc/init.d/sendmail stop rm -rf /var/spool/clientmqueue/* rm -rf /var/spool/mqueue/* /etc/init.d/sendmail start