CIDuty/Dealing With Outages

Communication

Communication is the most important thing during an outage.

  • Talk to sheriffs in IRC to let them know what you're working on and what the status is.
  • Mirror that status into the relevant bugs as appropriate.
  • Coordinate with others who are helping to avoid duplicating work.
    • If debugging collaboratively with others, start a public etherpad to share state.

Problems with stage/ftp

Ganglia is useful for monitoring Mozilla-hosted services:

For problems with stage/ftp, the productdelivery cluster is informative:

Problems with networking

If you're experiencing network issues, particularly being colos (e.g. SCL3<->EC2), join the #netops-alerts IRC channel. Connection details are in https://mana.mozilla.org/wiki/display/SYSADMIN/IRC+use+within+IT

Smokeping can help you visualize outages in real-time:

Investigate network packet loss between Mozilla and Amazon EC2 using mtr.