ReleaseEngineering/How To/Process release email
This is a list of automatically generated emails you should expect to receive as a release engineer at mozilla. It is not complete
Subject Index
Zimbra glob/wildcard syntax, in alpha order
Field | Wildcard | Further Notes |
---|---|---|
Subject | idle kittens report | #briar-patch idle kittens reporting |
Subject | Humpty Dumpty Error * | #Puppet failing too many times on a slave |
Subject | [puppet-monitoring]* | #Puppet Log Monitoring |
Subject | Talos Suspected machine issue * | if you don't know, you don't care |
Subject | Try submission * | to: autolanduser@mozilla.com |
briar-patch idle kittens reporting
Why we get them
Email report outlining the status of any host that has been flagged as "idle"
What is sending them
A cron job that is running the kittenreaper.py task with the following parameters
python kittenreaper.py -w 1 -e
It pulls the list of hosts to check from http://build.mozilla.org/builds/slaves_needing_reboot.txt
What to do when one is recieved
not sure yet, unless your buildduty - then you should be watching it
Future plans
This will be replaced by the briar-patch dashboard
How to best filter these emails
Filtering can be done by matching the subject line which will not change
Puppet Log Monitoring
Why we get them
There are messages in the puppet master logs that indicate something is wrong with a slave or master. Since we have no other master monitoring tools, we are defaulting to sending email.
What is sending them
scl-production-puppet and soon all puppet masters have an instance of 'watch-puppet.py' running under screen as root.
The code for this script is stored here
What to do when one is recieved
- if the title contains "[puppet-monitoring][master_name] <slavename> is waiting to be signed", this is for information and requires no immediate action
- if the title contains "[puppet-monitoring][master_name] <slavename> has invalid cert", the script will try once to clean the cert before sending the email once there is a waiting signing request. If this is successful, you'll see a matching "<slavename> is waiting to be signed" email. The key will be automatically signed by a cronjob
How to silence or acknowledge this alert
It is not currently possible to silence this email. This script will send email each time the corresponding line pattern is seen in /var/log/messages. This means that most likely, each time a slave tries to puppet, an email will be sent.
Future plans
In the short term, we'd like to have this script monitor the puppet logs for more error conditions. It would also make sense to monitor all puppet masters
How to best filter these emails
- subject includes [puppet-monitoring]
Puppet failing too many times on a slave
Why we get them
We have no other monitoring for slaves failing to run puppet successfully. This became a large issue with the rev4 talos machines due to bug 700672. We are now doing an exponential back off on these slaves with a set number of iterations. Once the maximum number of iterations is reached, the slave will send this email then reboot. This helps us avoid puppet master load as well as allowing the machines try to fix themselves by rebooting.
What is sending them
Each machine that has these emails enabled will send the email itself when it fails to puppet the last time, and right before it reboots.
The code that sends them is unversioned, but is deployed to the slaves from
scl-production-puppet:/N/production/darwin10-i386/test/usr/local/bin/run-puppet.sh
What to do when one is recieved
- either ignore the email or find the root of the problem and fix it.
How to silence or acknowledge this alert
This email is a temporary workaround until we get a real puppet client monitoring tool. This email we be sent each time the maximum number of retires is reached, which is every couple hours.
Future plans
Would really like to replace these emails with real puppet monitoring.
How to best filter these emails
These emails are best filtered by having "Humpty Dumpty Error" in their subject. Becuase the hostname on the slave might not be correct every/all the time, filtering on domain names might not catch all cases.
Sample
Why we get them
Give a brief explanation of why this email is for, what it helps us do and why it should be watched
What is sending them
Include a link to the source of the program sending the email. Include information on which hosts are sending the email, and give information on how program runs. Is it a daemon? Does it have an init script? Do you run it under screen?
What to do when one is recieved
- if the title contains "[scl-production-puppet-new] <slavename> is waiting to be signed", this is for information and requires no immediate action
- if the title contains "[scl-production-puppet-new] <slavename> has invalid cert", the script will try once to clean the cert before sending the email. If this is successful, you'll see a matching "<slavename> is waiting to be signed" email. The key will be automatically signed
How to silence or acknowledge this alert
Include information on how to make the emails stop
Future plans
provide any future plans for this email. Is it temporary? Is it going to be replaced by a real dashboard? Are you going to add/change things people filter on?
How to best filter these emails
provide insight on how to filter these emails. Is there a distinguishing header? Is it always from a specifc host, or family of hosts? Is there a distinctive subject?