ReleaseEngineering/How To/Process release email

This is a list of automatically generated emails you should expect to receive as a release engineer at mozilla. It is not complete

Subject Index

Zimbra glob/wildcard syntax, in alpha order

Field Wildcard Further Notes
Subject idle kittens report #briar-patch idle kittens reporting
Subject Humpty Dumpty Error * #Puppet failing too many times on a slave
Subject [puppet-monitoring]* #Puppet Log Monitoring
Subject Talos Suspected machine issue * if you don't know, you don't care
Subject Try submission * to: autolanduser@mozilla.com

briar-patch idle kittens reporting

Why we get them

Email report outlining the status of any host that has been flagged as "idle"

What is sending them

A cron job that is running the kittenreaper.py task with the following parameters

 python kittenreaper.py -w 1 -e

It pulls the list of hosts to check from http://build.mozilla.org/builds/slaves_needing_reboot.txt

What to do when one is recieved

not sure yet, unless your buildduty - then you should be watching it

Future plans

This will be replaced by the briar-patch dashboard

How to best filter these emails

Filtering can be done by matching the subject line which will not change


Puppet Log Monitoring

Why we get them

There are messages in the puppet master logs that indicate something is wrong with a slave or master. Since we have no other master monitoring tools, we are defaulting to sending email.

What is sending them

scl-production-puppet and soon all puppet masters have an instance of 'watch-puppet.py' running under screen as root.

The code for this script is stored here

What to do when one is recieved

  • if the title contains "[puppet-monitoring][master_name] <slavename> is waiting to be signed", this is for information and requires no immediate action
  • if the title contains "[puppet-monitoring][master_name] <slavename> has invalid cert", the script will try once to clean the cert before sending the email once there is a waiting signing request. If this is successful, you'll see a matching "<slavename> is waiting to be signed" email. The key will be automatically signed by a cronjob

How to silence or acknowledge this alert

It is not currently possible to silence this email. This script will send email each time the corresponding line pattern is seen in /var/log/messages. This means that most likely, each time a slave tries to puppet, an email will be sent.

Future plans

In the short term, we'd like to have this script monitor the puppet logs for more error conditions. It would also make sense to monitor all puppet masters

How to best filter these emails

  • subject includes [puppet-monitoring]

Puppet failing too many times on a slave

Why we get them

We have no other monitoring for slaves failing to run puppet successfully. This became a large issue with the rev4 talos machines due to bug 700672. We are now doing an exponential back off on these slaves with a set number of iterations. Once the maximum number of iterations is reached, the slave will send this email then reboot. This helps us avoid puppet master load as well as allowing the machines try to fix themselves by rebooting.

What is sending them

Each machine that has these emails enabled will send the email itself when it fails to puppet the last time, and right before it reboots.

The code that sends them is unversioned, but is deployed to the slaves from

scl-production-puppet:/N/production/darwin10-i386/test/usr/local/bin/run-puppet.sh 

What to do when one is recieved

  • either ignore the email or find the root of the problem and fix it.

How to silence or acknowledge this alert

This email is a temporary workaround until we get a real puppet client monitoring tool. This email we be sent each time the maximum number of retires is reached, which is every couple hours.

Future plans

Would really like to replace these emails with real puppet monitoring.

How to best filter these emails

These emails are best filtered by having "Humpty Dumpty Error" in their subject. Becuase the hostname on the slave might not be correct every/all the time, filtering on domain names might not catch all cases.


Sample

Why we get them

Give a brief explanation of why this email is for, what it helps us do and why it should be watched

What is sending them

Include a link to the source of the program sending the email. Include information on which hosts are sending the email, and give information on how program runs. Is it a daemon? Does it have an init script? Do you run it under screen?

What to do when one is recieved

  • if the title contains "[scl-production-puppet-new] <slavename> is waiting to be signed", this is for information and requires no immediate action
  • if the title contains "[scl-production-puppet-new] <slavename> has invalid cert", the script will try once to clean the cert before sending the email. If this is successful, you'll see a matching "<slavename> is waiting to be signed" email. The key will be automatically signed

How to silence or acknowledge this alert

Include information on how to make the emails stop

Future plans

provide any future plans for this email. Is it temporary? Is it going to be replaced by a real dashboard? Are you going to add/change things people filter on?

How to best filter these emails

provide insight on how to filter these emails. Is there a distinguishing header? Is it always from a specifc host, or family of hosts? Is there a distinctive subject?