|
|
Line 61: |
Line 61: |
| *** for each host under heading "Stopped For A While" | | *** for each host under heading "Stopped For A While" |
| **** follow steps in dealing with [https://wiki.mozilla.org/ReleaseEngineering/How_To/Manage_AWS_slaves#Stopped_For_A_While_Instances stopped for a while instances] | | **** follow steps in dealing with [https://wiki.mozilla.org/ReleaseEngineering/How_To/Manage_AWS_slaves#Stopped_For_A_While_Instances stopped for a while instances] |
|
| |
| == Infrastructure performance ==
| |
| === Pending Jobs ===
| |
| We will sometimes be starved for capacity on one or more platforms. Because there are multiple potential causes, and hence multiple possible paths to resolution, the steps for [[ReleaseEngineering/How_To/Dealing_with_high_pending_counts|dealing with high pending counts]] are on their own page.
| |
|
| |
| === Wait times ===
| |
| This can be related to pending builds above.
| |
|
| |
| Releng has made a commitment to developers that 95% or more of their jobs will start within 15 minutes of submission.
| |
|
| |
| Build and Try (Build) slave pools have greater capacity (and can expand into AWS as required for linux/mobile/b2g) and are usually over 95% unless there is an outage.
| |
|
| |
| Many Test jobs are triggered per build/try job, and the current slave pool is finite, so it is rare for us to meet our turnaround commitment for test jobs.
| |
|
| |
| Fixing errant test slaves is hence more important fixing build slaves. See '''Slave Management''' below.
| |
|
| |
| Wait times are available either from [https://secure.pub.build.mozilla.org/buildapi/reports/waittimes the buildAPI wait times report] or the daily emails that go to dev.tree-management (un-filter them in Zimbra). Respond to any unusually long wait times in email, preferably with a reason.
| |
|
| |
| Wait times emails are run via crontab entries setup on relengwebadm.private.scl3.mozilla.com under the buildapi user.
| |
|
| |
| === Slave management ===
| |
| Bad slaves can burn builds and hung slaves can cause bad wait times. These slaves need to be rebooted, or handed to IT for recovery. Recovered slaves need to be tracked on their way back to operational status.
| |
|
| |
| The [[ReleaseEngineering/Buildduty/Nagios|Nagios wiki]] has more information about finding problem slaves using nagios.
| |
|
| |
| See the [[ReleaseEngineering/Buildduty/Slave_Management|Slave Management wiki]] for more information about fixing those slaves.
| |
| === dev.tree-management ===
| |
| Monitor dev.tree-management newsgroup (by [https://lists.mozilla.org/listinfo/dev-tree-management email] or by [nntp://mozilla.dev.tree-management nntp]).
| |
|
| |
| Watch for long running builds that are holding on to slave, i.e. >1 day.
| |
|
| |
| See the [https://secure.pub.build.mozilla.org/buildapi/running buildAPI list of running builds].
| |
|
| |
|
| == Others == | | == Others == |