Sheriffing/Job Visibility Policy: Difference between revisions

Split requirements into those for all jobs vs those for tier 1 jobs
(Split requirements into those for all jobs vs those for tier 1 jobs)
Line 1: Line 1:


== Requirements for being shown in the default TBPL view ==
== Requirements for jobs shown in the default Treeherder view ==


This page was created to clarify the requirements that a platform/test-suite has to meet, before its jobs can be shown in the default [https://tbpl.mozilla.org/ TBPL] view for the main development trees. Owners of non-sheriff managed project/disposable repos do not need to meet these requirements before requesting visibility changes.  
This page was created to clarify the requirements that a platform/test-suite has to meet, before its jobs can be shown in the default [https://tbpl.mozilla.org/ TBPL] view for the main development trees. Owners of non-sheriff managed project/disposable repos do not need to meet these requirements before requesting visibility changes.  
Line 8: Line 8:
To propose changes to this policy, please speak to the sheriffs and/or post to [https://lists.mozilla.org/listinfo/dev-platform dev.platform].
To propose changes to this policy, please speak to the sheriffs and/or post to [https://lists.mozilla.org/listinfo/dev-platform dev.platform].


=== 1) Has an active owner ===
==== Has an active owner ====
* Who is committed to ensuring the other requirements are met not just initially, but over the long term.
* Who is committed to ensuring the other requirements are met not just initially, but over the long term.
* Who will ensure the new job type is switched off to save resources, should we stop finding it useful in the future.
* Who will ensure the new job type is switched off to save resources, should we stop finding it useful in the future.


=== 2) Breakage is expected to be followed by tree closure or backout ===
==== Outputs failures in a TBPL-starrable format ====
* Failures visible in the default view (other than those that are known intermittents/transient), must have their cause backed out in a timely fashion or else the tree closed until diagnosed.
* Why? If tier != 1 jobs were instead made visible in the default view, they would:
** Interfere with ability to sheriff the tree:
*** Indistinguishable from tier-1 failures.
*** Appear in the failure count/cause the tab to glow.
*** Slow down navigation of failures when using keyboard shortcuts.
** Cause extra workload for sheriffs by making them perform initial diagnosis/bug filing & then starring of the failure on every push until it is fixed an indeterminate amount of time later.
** Cause confusion for non-sheriffs using project branches/try-server, as well as on all trees at the weekends when there are no employed sheriffs.
* If your platform/test falls under the category of "someone should just file a bug and it will be investigated by our team later", then it unfortunately does not meet this requirement. From past requests this normally translates to "group X think this job type is important but we want to delegate the task of monitoring it to someone else".
 
=== 3) Runs on mozilla-central and all trees that merge into it ===
* Otherwise job failures when tree X merges into mozilla-central will not be attributable to a single changeset, resulting in either tree closure or backout of the entire merge (see requirement #2).
* When filing the release engineering bug to enable your job on all the required trees, ask to enable it on "mozilla-central based trees" and release engineering will enable it in the default config from which all trunk trees inherit (unless the various tree owners have explicitly opted out). As a rough guide, mozilla-central based trees include mozilla-inbound, fx-team, b2g-inbound as well as many of the other project/disposable repositories.
 
=== 4) Scheduled on every push ===
* Otherwise job failures will not be attributable to a single changeset, resulting in either tree closure or backout of multiple pushes (see requirement #2).
* An exception is made for nightly builds with an virtually equivalent non-nightly variant that is built on every push & for tests run on PGO builds (given that PGO builds take an inordinate amount of time, we still schedule them every 3/6 hours depending on tree, and relatively speaking there are not too many PGO-only test failures).
* Note also that coalescing (buildbot queue collapsing when there is more than one queued job of the exact same tree/type) may mean that not all scheduled jobs actually get run. Whilst coalescing makes sheriffing harder, it's a necessary evil given that automation infrastructure demand frequently outstrips supply.
 
=== 5) Easily run on try server ===
* Otherwise developers who have had their landing backed out for breaking the job type may be unable to debug the failures/test the fix, particularly if they only reproduce on our infrastructure.
* Developers should not be expected to guess try chooser options, so http://trychooser.pub.build.mozilla.org/ should be updated if appropriate.
 
=== 6) Outputs failures in a TBPL-starrable format ===
* It is highly recommended that new test harnesses do not reinvent the wheel and instead use parts of [https://wiki.mozilla.org/Auto-tools/Projects/MozBase mozbase] (eg: mozcrash) if at all possible - speak to the [[Auto-tools|A-Team]] for more info.
* It is highly recommended that new test harnesses do not reinvent the wheel and instead use parts of [https://wiki.mozilla.org/Auto-tools/Projects/MozBase mozbase] (eg: mozcrash) if at all possible - speak to the [[Auto-tools|A-Team]] for more info.
* Failures must appear in the TBPL annotated summary (ie: matches the [https://hg.mozilla.org/webtools/tbpl/file/tip/php/inc/GeneralErrorFilter.php log parsing regexp]), otherwise the full log will have to be opened for every failure.
* Failures must appear in the TBPL annotated summary (ie: matches the [https://hg.mozilla.org/webtools/tbpl/file/tip/php/inc/GeneralErrorFilter.php log parsing regexp]), otherwise the full log will have to be opened for every failure.
Line 50: Line 26:
* The sheriffs will be happy to advise regarding the above.
* The sheriffs will be happy to advise regarding the above.


=== 7) Low intermittent failure rate ===
==== Low intermittent failure rate ====
* A high failure rate:
* A high failure rate:
** Causes unnecessary sheriff workload.
** Causes unnecessary sheriff workload.
Line 59: Line 35:
* However, sheriffs will make the final determination of whether a job type has too many intermittent failures. This will be a based on a combination of factors including failure rate, length of time the failures have been occurring, owner interest in fixing them & whether TBPL is able to make bug suggestions.
* However, sheriffs will make the final determination of whether a job type has too many intermittent failures. This will be a based on a combination of factors including failure rate, length of time the failures have been occurring, owner interest in fixing them & whether TBPL is able to make bug suggestions.


=== 8) Must avoid patterns known to cause non deterministic failures ===
==== Must avoid patterns known to cause non deterministic failures ====
* Must avoid pulling the tip of external repositories as part of the build - since landings there can cause non-obvious failures. If an external repository is absolutely necessary, instead reference the desired changeset from a manifest in mozilla-central (like talos or gaia do).
* Must avoid pulling the tip of external repositories as part of the build - since landings there can cause non-obvious failures. If an external repository is absolutely necessary, instead reference the desired changeset from a manifest in mozilla-central (like talos or gaia do).
* Must not rely on resources outside of the build network:
* Must not rely on resources outside of the build network:
Line 68: Line 44:
* See the [https://developer.mozilla.org/en-US/docs/Mozilla/QA/Avoiding_intermittent_oranges best practices for avoiding intermittent failures (oranges)].
* See the [https://developer.mozilla.org/en-US/docs/Mozilla/QA/Avoiding_intermittent_oranges best practices for avoiding intermittent failures (oranges)].


=== 9) Supports the disabling of individual tests ===
==== Has sufficient documentation ====
* It must be possible for sheriffs to disable an individual test per platform or entirely, by either annotating the test or editing a manifest/moz.build/Makefile in the relevant gecko repository. (See also requirement #10).
 
=== 10) Has sufficient documentation ===
* Has a wiki page with:
* Has a wiki page with:
** An overview of the test-suite.
** An overview of the test-suite.
Line 80: Line 53:
* That wiki page is linked to from https://developer.mozilla.org/docs/Mozilla/QA/Automated_testing
* That wiki page is linked to from https://developer.mozilla.org/docs/Mozilla/QA/Automated_testing


=== 11) Easy for a dev to run locally ===
==== Crashes should produce valid crash stacks ====
* The crash reporter should be enabled, mini-dumps processed correctly (ie: with symbols available) & the resultant crash stack output in the job log.
 
==== Logs should not exceed the max log size ====
* Logs for both successful and failed runs should not exceed the log size limit, which in the case of buildbot is 50MB for all buildsteps including those outside of mozharness.
 
== Additional requirements for tier 1 jobs ==
 
Tier 1 jobs are those that ... #todo (also reference https://developer.mozilla.org/en-US/docs/Supported_build_configurations). In addition to the requirements above, tier 1 jobs must also meet the following.
 
==== Breakage is expected to be followed by tree closure or backout ====
* Failures visible in the default view (other than those that are known intermittents/transient), must have their cause backed out in a timely fashion or else the tree closed until diagnosed.
* Why? If tier != 1 jobs were instead made visible in the default view, they would:
** Interfere with ability to sheriff the tree:
*** Indistinguishable from tier-1 failures.
*** Appear in the failure count/cause the tab to glow.
*** Slow down navigation of failures when using keyboard shortcuts.
** Cause extra workload for sheriffs by making them perform initial diagnosis/bug filing & then starring of the failure on every push until it is fixed an indeterminate amount of time later.
** Cause confusion for non-sheriffs using project branches/try-server, as well as on all trees at the weekends when there are no employed sheriffs.
* If your platform/test falls under the category of "someone should just file a bug and it will be investigated by our team later", then it unfortunately does not meet this requirement. From past requests this normally translates to "group X think this job type is important but we want to delegate the task of monitoring it to someone else".
 
==== Runs on mozilla-central and all trees that merge into it ====
* Otherwise job failures when tree X merges into mozilla-central will not be attributable to a single changeset, resulting in either tree closure or backout of the entire merge (see requirement #2).
* When filing the release engineering bug to enable your job on all the required trees, ask to enable it on "mozilla-central based trees" and release engineering will enable it in the default config from which all trunk trees inherit (unless the various tree owners have explicitly opted out). As a rough guide, mozilla-central based trees include mozilla-inbound, fx-team, b2g-inbound as well as many of the other project/disposable repositories.
 
==== Scheduled on every push ====
* Otherwise job failures will not be attributable to a single changeset, resulting in either tree closure or backout of multiple pushes (see requirement #2).
* An exception is made for nightly builds with an virtually equivalent non-nightly variant that is built on every push & for tests run on PGO builds (given that PGO builds take an inordinate amount of time, we still schedule them every 3/6 hours depending on tree, and relatively speaking there are not too many PGO-only test failures).
* Note also that coalescing (buildbot queue collapsing when there is more than one queued job of the exact same tree/type) may mean that not all scheduled jobs actually get run. Whilst coalescing makes sheriffing harder, it's a necessary evil given that automation infrastructure demand frequently outstrips supply.
 
==== Easily run on try server ====
* Otherwise developers who have had their landing backed out for breaking the job type may be unable to debug the failures/test the fix, particularly if they only reproduce on our infrastructure.
* Developers should not be expected to guess try chooser options, so http://trychooser.pub.build.mozilla.org/ should be updated if appropriate.
 
== Optional, but helpful ==
 
==== Easy for a dev to run locally ====
* Supported by mach (if appropriate).
* Supported by mach (if appropriate).
* Ideally part of mozilla-central (legacy exceptions being Talos, gaia).
* Ideally part of mozilla-central (legacy exceptions being Talos, gaia).


=== 12) Crashes should produce valid crash stacks ===
==== Supports the disabling of individual tests ====
* The crash reporter should be enabled, mini-dumps processed correctly (ie: with symbols available) & the resultant crash stack output in the job log.
* It must be possible for sheriffs to disable an individual test per platform or entirely, by either annotating the test or editing a manifest/moz.build/Makefile in the relevant gecko repository.
 
=== 13) Logs should not exceed the max log size ===
* Logs for both successful and failed runs should not exceed the log size limit, which in the case of buildbot is 50MB for all buildsteps including those outside of mozharness.


== Requesting changes in visibility ==
== Requesting changes in visibility ==
Line 102: Line 108:
== The future ==
== The future ==
* Planned improvements to our tooling will likely mean that some of these requirements can be relaxed in the future, as well as making it easier for maintainers of non-default-view job types to track their success/failure without having to monitor TBPL continuously.
* Planned improvements to our tooling will likely mean that some of these requirements can be relaxed in the future, as well as making it easier for maintainers of non-default-view job types to track their success/failure without having to monitor TBPL continuously.
* Planned features for the successor to TBPL ([[Auto-tools/Projects/TBPL2]]) include:
* Planned features for the successor to TBPL ([[Auto-tools/Projects/Treeherder]]) include:
** Multiple dashboards/views for different use-cases/teams (giving us more flexibility than just "default view" or "&showall=1").
** Multiple dashboards/views for different use-cases/teams (giving us more flexibility than just "default view" or "&showall=1").
** Opt-in notifications (email, IRC, dashboard, ...?) of failures for desired job types (see proposal in {{bug|851061}}).
** Opt-in notifications (email, IRC, dashboard, ...?) of failures for desired job types (see proposal in {{bug|851061}}).
* [[Auto-tools/Projects/Bisect_in_the_cloud]] will allow sheriffs to more easily narrow regression ranges for job types that do not run on every push, making it more viable to accept them into certain views/dashboards.
* [[Auto-tools/Projects/Bisect_in_the_cloud]] will allow sheriffs to more easily narrow regression ranges for job types that do not run on every push, making it more viable to accept them into certain views/dashboards.
canmove, Confirmed users
1,126

edits