Sheriffing/Job Visibility Policy: Difference between revisions

→‎Has sufficient documentation: Updated automated testing link to a Firefox source docs page
(→‎4) Scheduled on every push: Clarify PGO reasoning)
(→‎Has sufficient documentation: Updated automated testing link to a Firefox source docs page)
 
(48 intermediate revisions by 9 users not shown)
Line 1: Line 1:
This page exists to clarify the policy towards how jobs reporting to [https://treeherder.mozilla.org/ Treeherder] are managed. Common sense will apply in cases where some of the requirements are not applicable for a particular platform/build/test type.


== Requirements for being shown in the default TBPL view ==
To propose changes to this policy, please speak to the sheriffs and/or send a message to the [https://groups.google.com/a/mozilla.org/g/sheriffs sheriffs group].


This page was created to clarify the existing requirements that a platform/test-suite has to meet, before its jobs can be shown in the default [https://tbpl.mozilla.org/ TBPL] view. To propose changes to this policy, please speak to the sheriffs and/or post to [https://lists.mozilla.org/listinfo/dev-platform dev.platform].
== Overview of the Job Visibility Tiers ==
Jobs reporting to Treeherder can fall into three tiers.
* <b>Tier 1:</b> Jobs that run on a [https://firefox-source-docs.mozilla.org/build/buildsystem/supported-configurations.html Tier-1 platform], are shown by default on Treeherder, and are sheriff-managed. Bustage will cause a tree closure and is expected to result in a quick follow-up push or a backout (at the discretion of the sheriff on duty). Bugs will be filed for new intermittent test failures and are subject to the [https://wiki.mozilla.org/Sheriffing/Test_Disabling_Policy Test Disabling Policy] if not addressed in a timely fashion.
* <b>Tier 2:</b> Jobs are shown by default on Treeherder and sheriffs will create bugs for new failures, even if those are permanent. New test failures/bustage will not result in a backout, but a tracking bug will be filed when observed. These new issues are expected to be fixed in 2 business days.
* <b>Tier 3:</b> Jobs are not shown by default on Treeherder. All responsibilities for monitoring the results will fall upon the owner of the job.


=== 1) Has an active owner ===
== Requirements for jobs shown in the default Treeherder view ==
The below section applies to <b>both Tier 1 and Tier 2 jobs</b>. Owners of non-sheriff managed project/disposable repos do not need to meet these requirements. However, they must be satisfied prior to being enabled in production.
 
==== Has an active owner ====
* Who is committed to ensuring the other requirements are met not just initially, but over the long term.
* Who is committed to ensuring the other requirements are met not just initially, but over the long term.
* Who will ensure the new job type is switched off to save resources, should we stop finding it useful in the future.
* Who will ensure the new job type is switched off to save resources should we stop finding it useful in the future.
 
==== Usable job logs ====
* Full logs should be available for both successful and failed runs in either raw or structured formats.
* The crash reporter should be enabled, mini-dumps processed correctly (ie: with symbols available) & the resultant valid crash stack visible in the log (it is recommended to use mozcrash to avoid reinventing the wheel).
* Failures must appear in the Treeherder failure summary in order to avoid having to open the full log for every failure.
* Failure output must be in the format expected by the Treeherder's [https://github.com/mozilla/treeherder-service/blob/master/treeherder/model/error_summary.py bug suggestion generator] (otherwise sheriffs have to manually search Bugzilla when classifying/annotation intermittent failures):
** For in-tree/product issues (eg: test failures, crashes):
*** Delimeter: ' | '
*** 1st token: One of {TEST-UNEXPECTED-FAIL, TEST-UNEXPECTED-PASS, PROCESS-CRASH}.
*** 2nd token: A unique test name/filepath (not a generic test loader that runs 100s of other test files, since otherwise bug suggestions will return too many results).
*** 3rd token: The specific failure message (eg: the test part that failed, the top frame of a crash or the leaked objects list for a leak).
** For non test-specific issues (eg: infra/automation/harness):
*** Treeherder falls back to searching Bugzilla for the entire failure line (excluding mozharness logging prefix), so it should be both unique to that failure type & repeatable (ie: no use of process IDs or timestamps, for which there will rarely be a repeat match against a bug summary).
** Exceptions & timeouts must be handled with appropriate log output (eg: the failure line must state in which test the timeout occurred, not just that the entire run has timed out).
** [documentation for mozlog library https://firefox-source-docs.mozilla.org/mozbase/mozlog.html]
* The sheriffs will be happy to advise regarding the above.
 
==== Has sufficient documentation ====
* Has a wiki page with:
** An overview of the test-suite.
** Instructions for running locally.
** How to disable an individual failing test.
** The current owner/who to contact for help.
** The Bugzilla product/component where bugs should be filed (Github issues is not discoverable enough and prevents the use of bug dependencies within the rest of the project).
* That wiki page is linked to from https://firefox-source-docs.mozilla.org/testing/automated-testing/index.html
 
== Additional requirements for Tier 1 jobs ==


=== 2) Breakage is expected to be followed by tree closure or backout ===
==== Breakage is expected to be followed by tree closure or backout ====
* Failures visible in the default view (other than those that are known intermittents/transient), must have their cause backed out in a timely fashion or else the tree closed until diagnosed.
* Failures visible in the default view (other than those that are known intermittents/transient), must have their cause backed out in a timely fashion or else the tree closed until diagnosed.
* Why? If tier != 1 jobs were instead made visible in the default view, they would:
* Sheriffs will generally ping in [https://chat.mozilla.org/#/room/#developers:mozilla.org #developers on chat.mozilla.org] when such a situation arises. If sufficient time passes without acknowledgement (typically ~5min), the regressing patch(es) will be backed out in order to minimize the length of the closure for other developers.
** Interfere with ability to sheriff the tree:
* If acknowledged, sheriffs will decide in conjunction with the developer whether backing out or fixing in-place is the most reasonable resolution. The sheriff maintains the right to backout if necessary, however.
*** Indistinguishable from tier-1 failures.
*** Appear in the failure count/cause the tab to glow.
*** Slow down navigation of failures when using keyboard shortcuts.
** Cause extra workload for sheriffs by making them perform initial diagnosis/bug filing & then starring of the failure on every push until it is fixed an indeterminate amount of time later.
** Cause confusion for non-sheriffs using project branches/try-server, as well as on all trees at the weekends when there are no employed sheriffs.
* If your platform/test falls under the category of "someone should just file a bug and it will be investigated by our team later", then it unfortunately does not meet this requirement. From past requests this normally translates to "group X think this job type is important but we want to delegate the task of monitoring it to someone else".


=== 3) Runs on all trees that merge into mozilla-central ===
==== Runs on mozilla-central and autoland ====
* Otherwise job failures when tree X merges into mozilla-central will not be attributable to a single changeset, resulting in either tree closure or backout of the entire merge (see requirement #2).
* Necessary because job failures when autoland merges into mozilla-central will not be attributable to a single changeset, resulting in either tree closure or backout of the entire merge (see the previous requirement).


=== 4) Scheduled on every push ===
==== Scheduled on every push ====
* Otherwise job failures will not be attributable to a single changeset, resulting in either tree closure or backout of multiple pushes (see requirement #2).
* Otherwise job failures will not be attributable to a single changeset, resulting in either tree closure or backout of multiple pushes (see requirement #2).
* An exception is made for nightly builds with an virtually equivalent non-nightly variant that is built on every push & for tests run on PGO builds (given that PGO builds take an inordinate amount of time, we still schedule them every 3/6 hours depending on tree, and relatively speaking there any not too many PGO-only test failures).
* An exception is made for nightly builds with a virtually equivalent non-nightly variant that is built on every push & for tests run on shippable builds (relatively speaking there are not too many shippable-only test failures). Periodic builds have also been granted an exception as they don't run tests and have sufficient coverage on other platforms such that the odds of unique bustage are small and relatively easy to diagnose.
* Note also that coalescing (buildbot queue collapsing when there is more than one queued job) may mean that not all scheduled jobs actually get run. Whilst coalescing makes sheriffing harder, it's a necessary evil given that automation infrastructure demand frequently outstrips supply.
* Note also that scheduling optimization (may mean that not all scheduled jobs actually get run. Whilst such coalescing makes sheriffing harder, it's a necessary action to keep the automation infrastructure demand at reasonable levels.
 
=== 5) Easily run on try server ===
* Otherwise developers who have had their landing backed out for breaking the job type will be unable to easily debug/fix the failures, particularly if they only reproduce on our infrastructure.
* Developers should not be expected to guess try chooser options, so http://trychooser.pub.build.mozilla.org/ must have been updated.


=== 6) Outputs failures in a TBPL-starrable format ===
==== Must avoid patterns known to cause non deterministic failures ====
* Failures must appear in the TBPL annotated summary (ie: use the standard TEST-UNEXPECTED-{FAIL,PASS}, PROCESS-CRASH, ... format), otherwise sheriffs & devs have to open the full logs.
* Must avoid pulling the tip of external repositories or their latest release as part of the build - since landings there can cause non-obvious failures. If an external repository/dependency is absolutely necessary, instead reference the desired changeset or version from a manifest in mozilla-central.
* Failures must output the test names correctly, so TBPL can perform the BzAPI intermittent-failure searches for bug suggestions.
* Must not rely on resources from sites whose content we do not control/have no SLA:
* Exceptions & timeouts must be caught and handled with a TBPL compatible failure message.
** Since these will cause failures when the external site is unavailable, as well as impacting end to end times & adding noise to performance tests.
* The sheriffs will be happy to help advise how to meet this requirement.
** eg: Emulator/driver binaries direct from a vendor's site, package downloads from PyPi or page assets for unit/performance tests.
** Ensure MOZ_DISABLE_NONLOCAL_CONNECTIONS is defined in the automation environment (see {{bug|995417}}) & use a list of automation prefs for switching off undesirable behavior (e.g. automatic updates, telemetry pings; see {{bug|1023483}} for where these are set).
* Must not contain time bombs, e.g. tests that will fail after a certain date or when run at certain times (e.g., the day summer time starts or ends, or when the test starts before midnight and finishes after midnight).
* See the [https://developer.mozilla.org/en-US/docs/Mozilla/QA/Avoiding_intermittent_oranges best practices for avoiding intermittent failures (oranges)].


=== 7) Per job intermittent failure rate of less than 5% ===
==== Low intermittent failure rate ====
* A high failure rate:
* A high failure rate:
** Causes unnecessary sheriff workload.
** Causes unnecessary sheriff workload.
** Affects the ability to sheriff the trees as a whole, particularly during times of heavy coalescing.
** Affects the ability to sheriff the trees as a whole, particularly during times of heavy coalescing.
** Undermines devs confidence in the platform/test-suite - which as demonstrated by Firefox for Android, permanently affects their willingness to believe any future failures, even once the intermittent-failure rate is lowered.
** Undermines confidence in the platform/test-suite - which permanently affects developers' willingness to believe any future failures, even once the intermittent-failure rate is lowered.
* A mozilla-central push results in ~400 jobs. A 5% failure rate would mean 20 failures on that push - ie: an OrangeFactor of 20. The typical OrangeFactor across all trunk trees is normally 3-5, so a 5% failure rate is extremely generous.
* A mozilla-central push results in 4000-10000 jobs. The typical intermittent failure rate (OrangeFactor) across all trunk trees is normally 2-4%.
 
* Therefore as a rough guide a new platform/test suite must have at most a 5% per job failure rate initially, and ideally <1% longer term.
=== 8) Supports the disabling of individual tests ===
* However, sheriffs will make the final determination of whether a job type has too many intermittent failures. This will be a based on a combination of factors including failure rate, length of time the failures have been occurring, owner interest in fixing them & whether Treeherder is able to make bug suggestions.
* It must be possible for sheriffs to disable an individual test per platform or entirely, by either annotating the test or editing a manifest/moz.build/Makefile. (See also requirement #10).


=== 9) Must not rely on resources outside of the build network ===
==== Easily run on try server ====
* Since these will cause failures when the external site is unavailable, as well as impacting end to end times & adding noise to performance tests.
* Needed so that developers who have had their landing backed out for breaking the job type are able to debug the failures/test the fix, particularly if they only reproduce on our infrastructure.
* eg: Emulator/driver binaries direct from a vendor's site, package downloads from PyPi or page assets for unit/performance tests.
* The job should be visible to |./mach try fuzzy| and |./mach try chooser| without having to use the --full option.


=== 10) Has sufficient documentation ===
== Optional, but helpful ==
* Has a wiki page with:
** An overview of the test-suite.
** Instructions for running locally.
** How to disable an individual failing test.
** The current owner/who to contact for help.
* That wiki page is linked to from https://developer.mozilla.org/docs/Mozilla/QA/Automated_testing


=== 11) Easy for a dev to run locally ===
==== Easy for a dev to run locally ====
* Is supported by mach.
* Supported by mach (if appropriate).
* Ideally part of mozilla-central (legacy exception being Talos).
* Ideally part of mozilla-central (legacy exceptions being Talos).


==== Supports the disabling of individual tests ====
* It must be possible for sheriffs to disable an individual test per platform or entirely, by either annotating the test or editing a manifest in the relevant gecko repository.


== Requesting changes in visibility ==
== Requesting changes in visibility ==
* Please file a bug using this [https://bugzilla.mozilla.org/enter_bug.cgi?cc=emorley%40mozilla.com&cc=ryanvm%40gmail.com&cc=philringnalda%40gmail.com&comment=Job%2Fplatform%20ZZZZZZ%20now%20meets%20the%20requirements%20at%3A%0D%0Ahttps%3A%2F%2Fwiki.mozilla.org%2FSheriffing%2FJob_Visibility_Policy%0D%0A%0D%0APlease%20may%20it%20be%20unhidden&component=Tinderboxpushlog&form_name=enter_bug&op_sys=All&product=Webtools&rep_platform=All&short_desc=Please%20unhide%20job%2Fplatform%20ZZZZZZ%20on%20TBPL&version=Trunk template], so that changes in visibility are more discoverable (vs IRC or asking as part of a bug in another product/component).
* Jobs that are marked as tier 3 will be hidden in Treeherder by default.
* To adjust the tier for a Taskcluster job, use a bug either in the Firefox Build System :: Task Configuration component, or else a component related to the type of task being adjusted, then edit the in-tree task definition.
* CC :sheriffs when adjusting a job's tier, so they are aware of the change and can confirm the criteria have been met.


== My platform/test-suite does not meet the requirements, what now? ==
== Adding a new test task or a new test platform? ==
* Your platform/test-suite will still be being run, just not shown on the default view. This model has worked well for many projects/build types (eg jetpack, xulrunner, spidermonkey).
* Be sure to demonstrate an acceptable intermittent failure rate for your new test tasks on try, and include the try links in the bug which adds the new tasks. Usually that means repeating each new test task at least 10 times (try: --rebuild 10).  
* To see it, append '&showall=1' to the URL ({{bug|748833}} will add a checkbox for this to the TBPL UI).
* For each known intermittent failure, check the expected frequency from recent comments in the bug, or by looking up the failure in Treeherder's Intermittent Failures view; if you see higher failure rates in your try push, consider fixing or disabling the test(s) before enabling your new task(s).
* To filter the jobs displayed, under the 'Filters' menu use the 'job name' field (which supports regex).
* eg: to see both ASan & Valgrind jobs on mozilla-central (neither of which are shown by default), use: [https://tbpl.mozilla.org/?showall=1&jobname=(asan|valgrind) https://tbpl.mozilla.org/?showall=1&jobname=(asan|valgrind)]


== The future ==
== My platform/test-suite does not meet the base requirements, what now? ==
* Planned improvements to our tooling will likely mean that some of these requirements can be relaxed in the future, as well as making it easier for maintainers of non-default-view job types to track their success/failure without having to monitor TBPL continuously.
* Your platform/test-suite will still be being run, just not shown on the default view. This model has worked well for many projects/build types (e.g. spidermonkey).
* The successor to TBPL ([[Auto-tools/Projects/TBPL2]]) will support:
* To see it, click the "3" button to the left of the quick filter input field in the second toolbar of the Treeherder UI.
** Multiple dashboards/views for different use cases/teams (giving us more flexibility than just "default-view" or "&showall=1".
* To filter the jobs displayed, under the 'Filters' menu use the 'job name' field.
** Opt-in notifications (email, IRC, dashboard, ...?) of failures for desired job types (see proposal in {{bug|851061}}).
* For Try specifically, you can request that the job type be made non-default (i.e. requires explicit opt-in when the tasks to run get selected), in order to be shown in the default view on Try - see <code >UNCOMMON_TRY_TASK_LABELS</code> in [taskcluster/taskgraph/target_tasks.py target_tasks.py].
* [[Auto-tools/Projects/Bisect_in_the_cloud]] will allow sheriffs to more easily narrow regression ranges for job types that do not run on every push, making it more viable to accept them into certain views/dashboards.
canmove, Confirmed users, Bureaucrats and Sysops emeriti
3,627

edits