CIDuty/Dealing With Outages
< CIDuty
Communication
Communication is the most important thing during an outage.
- Talk to sheriffs in IRC to let them know what you're working on and what the status is.
- Mirror that status into the relevant bugs as appropriate.
- Coordinate with others who are helping to avoid duplicating work.
- If debugging collaboratively with others, start a public etherpad to share state.
Problems with stage/ftp
Ganglia is useful for monitoring Mozilla-hosted services:
For problems with stage/ftp, the productdelivery cluster is informative:
Problems with networking
If you're experiencing network issues, particularly being colos (e.g. SCL3<->EC2), join the #netops-alerts IRC channel. Connection details are in https://mana.mozilla.org/wiki/display/SYSADMIN/IRC+use+within+IT
Smokeping can help you visualize outages in real-time:
- http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.core1-releng-scl3
- http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-use1
- http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-usw1
- http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-usw2
Investigate network packet loss between Mozilla and Amazon EC2 using mtr.