Sheriffing/Job Visibility Policy: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(Split requirements into those for all jobs vs those for tier 1 jobs)
(→‎Requirements for jobs shown in the default Treeherder view: First pass at cleaning up the all jobs section)
Line 2: Line 2:
== Requirements for jobs shown in the default Treeherder view ==
== Requirements for jobs shown in the default Treeherder view ==


This page was created to clarify the requirements that a platform/test-suite has to meet, before its jobs can be shown in the default [https://tbpl.mozilla.org/ TBPL] view for the main development trees. Owners of non-sheriff managed project/disposable repos do not need to meet these requirements before requesting visibility changes.  
This page was created to clarify the requirements that a platform/test-suite has to meet, before its jobs can be shown in the default [https://treeherder.mozilla.org/ Treeherder] view for the main development trees. Owners of non-sheriff managed project/disposable repos do not need to meet these requirements before requesting visibility changes.  


Common sense will apply in cases where some of the requirements are not applicable for a particular platform/build/test type.
Common sense will apply in cases where some of the requirements are not applicable for a particular platform/build/test type.
Line 12: Line 12:
* Who will ensure the new job type is switched off to save resources, should we stop finding it useful in the future.
* Who will ensure the new job type is switched off to save resources, should we stop finding it useful in the future.


==== Outputs failures in a TBPL-starrable format ====
==== Usable job logs ====
* Logs for both successful and failed runs should not exceed the log size limit, which in the case of buildbot is 50MB for all buildsteps including those outside of mozharness.
* The crash reporter should be enabled, mini-dumps processed correctly (ie: with symbols available) & the resultant valid crash stack output in the job log.
* It is highly recommended that new test harnesses do not reinvent the wheel and instead use parts of [https://wiki.mozilla.org/Auto-tools/Projects/MozBase mozbase] (eg: mozcrash) if at all possible - speak to the [[Auto-tools|A-Team]] for more info.
* It is highly recommended that new test harnesses do not reinvent the wheel and instead use parts of [https://wiki.mozilla.org/Auto-tools/Projects/MozBase mozbase] (eg: mozcrash) if at all possible - speak to the [[Auto-tools|A-Team]] for more info.
* Failures must appear in the TBPL annotated summary (ie: matches the [https://hg.mozilla.org/webtools/tbpl/file/tip/php/inc/GeneralErrorFilter.php log parsing regexp]), otherwise the full log will have to be opened for every failure.
* Failures must appear in the TBPL annotated summary (ie: matches the [https://hg.mozilla.org/webtools/tbpl/file/tip/php/inc/GeneralErrorFilter.php log parsing regexp]), otherwise the full log will have to be opened for every failure.
Line 34: Line 36:
* Therefore as a rough guide a new platform/testsuite must have at most a 5% per job failure rate initially, and ideally <1% longer term.
* Therefore as a rough guide a new platform/testsuite must have at most a 5% per job failure rate initially, and ideally <1% longer term.
* However, sheriffs will make the final determination of whether a job type has too many intermittent failures. This will be a based on a combination of factors including failure rate, length of time the failures have been occurring, owner interest in fixing them & whether TBPL is able to make bug suggestions.
* However, sheriffs will make the final determination of whether a job type has too many intermittent failures. This will be a based on a combination of factors including failure rate, length of time the failures have been occurring, owner interest in fixing them & whether TBPL is able to make bug suggestions.


==== Must avoid patterns known to cause non deterministic failures ====
==== Must avoid patterns known to cause non deterministic failures ====
* Must avoid pulling the tip of external repositories as part of the build - since landings there can cause non-obvious failures. If an external repository is absolutely necessary, instead reference the desired changeset from a manifest in mozilla-central (like talos or gaia do).
* Must avoid pulling the tip of external repositories as part of the build - since landings there can cause non-obvious failures. If an external repository is absolutely necessary, instead reference the desired changeset from a manifest in mozilla-central (like talos or gaia do).
* Must not rely on resources outside of the build network:
* Must not rely on resources from sites whose content we do not control/have no SLA:
** Since these will cause failures when the external site is unavailable, as well as impacting end to end times & adding noise to performance tests.
** Since these will cause failures when the external site is unavailable, as well as impacting end to end times & adding noise to performance tests.
** eg: Emulator/driver binaries direct from a vendor's site, package downloads from PyPi or page assets for unit/performance tests.
** eg: Emulator/driver binaries direct from a vendor's site, package downloads from PyPi or page assets for unit/performance tests.
Line 52: Line 55:
** The Bugzilla product/component where bugs should be filed (Github issues is not discoverable enough and prevents the use of bug dependencies within the rest of the project).
** The Bugzilla product/component where bugs should be filed (Github issues is not discoverable enough and prevents the use of bug dependencies within the rest of the project).
* That wiki page is linked to from https://developer.mozilla.org/docs/Mozilla/QA/Automated_testing
* That wiki page is linked to from https://developer.mozilla.org/docs/Mozilla/QA/Automated_testing
==== Crashes should produce valid crash stacks ====
* The crash reporter should be enabled, mini-dumps processed correctly (ie: with symbols available) & the resultant crash stack output in the job log.
==== Logs should not exceed the max log size ====
* Logs for both successful and failed runs should not exceed the log size limit, which in the case of buildbot is 50MB for all buildsteps including those outside of mozharness.


== Additional requirements for tier 1 jobs ==
== Additional requirements for tier 1 jobs ==

Revision as of 17:41, 9 October 2014

Requirements for jobs shown in the default Treeherder view

This page was created to clarify the requirements that a platform/test-suite has to meet, before its jobs can be shown in the default Treeherder view for the main development trees. Owners of non-sheriff managed project/disposable repos do not need to meet these requirements before requesting visibility changes.

Common sense will apply in cases where some of the requirements are not applicable for a particular platform/build/test type.

To propose changes to this policy, please speak to the sheriffs and/or post to dev.platform.

Has an active owner

  • Who is committed to ensuring the other requirements are met not just initially, but over the long term.
  • Who will ensure the new job type is switched off to save resources, should we stop finding it useful in the future.

Usable job logs

  • Logs for both successful and failed runs should not exceed the log size limit, which in the case of buildbot is 50MB for all buildsteps including those outside of mozharness.
  • The crash reporter should be enabled, mini-dumps processed correctly (ie: with symbols available) & the resultant valid crash stack output in the job log.
  • It is highly recommended that new test harnesses do not reinvent the wheel and instead use parts of mozbase (eg: mozcrash) if at all possible - speak to the A-Team for more info.
  • Failures must appear in the TBPL annotated summary (ie: matches the log parsing regexp), otherwise the full log will have to be opened for every failure.
  • Failure output must be in the format expected by TBPL's bug suggestion generator (otherwise sheriffs have to manually search Bugzilla when starring intermittent failures):
    • For in-tree/product issues (eg: test failures, crashes):
      • Delimeter: ' | '
      • 1st token: One of {TEST-UNEXPECTED-FAIL, TEST-UNEXPECTED-PASS, PROCESS-CRASH}.
      • 2nd token: A unique test name/filepath (not a generic test loader that runs 100s of other test files, since otherwise bug suggestions will return too many results).
      • 3rd token: The specific failure message (eg: the test part that failed, the top frame of a crash or the leaked objects list for a leak).
    • For non test-specific issues (eg: infra/automation/harness):
      • TBPL falls back to searching Bugzilla for the entire failure line (excluding mozharness logging prefix), so it should be both unique to that failure type & repeatable (ie: no use of process IDs or timestamps, for which there will rarely be a repeat match against a bug summary).
    • Exceptions & timeouts must be handled with appropriate log output (eg: the failure line must state in which test the timeout occurred, not just that the entire run has timed out).
  • The sheriffs will be happy to advise regarding the above.

Low intermittent failure rate

  • A high failure rate:
    • Causes unnecessary sheriff workload.
    • Affects the ability to sheriff the trees as a whole, particularly during times of heavy coalescing.
    • Undermines devs confidence in the platform/test-suite - which as demonstrated by Firefox for Android, permanently affects their willingness to believe any future failures, even once the intermittent-failure rate is lowered.
  • A mozilla-central push results in ~400 jobs. The typical OrangeFactor across all trunk trees is normally (excluding the recent spike) 3-4, ie: a failure rate of ~1%.
  • Therefore as a rough guide a new platform/testsuite must have at most a 5% per job failure rate initially, and ideally <1% longer term.
  • However, sheriffs will make the final determination of whether a job type has too many intermittent failures. This will be a based on a combination of factors including failure rate, length of time the failures have been occurring, owner interest in fixing them & whether TBPL is able to make bug suggestions.


Must avoid patterns known to cause non deterministic failures

  • Must avoid pulling the tip of external repositories as part of the build - since landings there can cause non-obvious failures. If an external repository is absolutely necessary, instead reference the desired changeset from a manifest in mozilla-central (like talos or gaia do).
  • Must not rely on resources from sites whose content we do not control/have no SLA:
    • Since these will cause failures when the external site is unavailable, as well as impacting end to end times & adding noise to performance tests.
    • eg: Emulator/driver binaries direct from a vendor's site, package downloads from PyPi or page assets for unit/performance tests.
    • Ensure MOZ_DISABLE_NONLOCAL_CONNECTIONS is defined in the automation environment (see bug 995417) & use a list of automation prefs for switching off undesirable behaviour (eg automatic updates, telemetry pings; see bug 1023483 for where these are set).
  • Must not contain time bombs, e.g. tests that will fail after a certain date or when run at certain times (e.g., the day summer time starts or ends, or when the test starts before midnight and finishes after midnight).
  • See the best practices for avoiding intermittent failures (oranges).

Has sufficient documentation

  • Has a wiki page with:
    • An overview of the test-suite.
    • Instructions for running locally.
    • How to disable an individual failing test.
    • The current owner/who to contact for help.
    • The Bugzilla product/component where bugs should be filed (Github issues is not discoverable enough and prevents the use of bug dependencies within the rest of the project).
  • That wiki page is linked to from https://developer.mozilla.org/docs/Mozilla/QA/Automated_testing

Additional requirements for tier 1 jobs

Tier 1 jobs are those that ... #todo (also reference https://developer.mozilla.org/en-US/docs/Supported_build_configurations). In addition to the requirements above, tier 1 jobs must also meet the following.

Breakage is expected to be followed by tree closure or backout

  • Failures visible in the default view (other than those that are known intermittents/transient), must have their cause backed out in a timely fashion or else the tree closed until diagnosed.
  • Why? If tier != 1 jobs were instead made visible in the default view, they would:
    • Interfere with ability to sheriff the tree:
      • Indistinguishable from tier-1 failures.
      • Appear in the failure count/cause the tab to glow.
      • Slow down navigation of failures when using keyboard shortcuts.
    • Cause extra workload for sheriffs by making them perform initial diagnosis/bug filing & then starring of the failure on every push until it is fixed an indeterminate amount of time later.
    • Cause confusion for non-sheriffs using project branches/try-server, as well as on all trees at the weekends when there are no employed sheriffs.
  • If your platform/test falls under the category of "someone should just file a bug and it will be investigated by our team later", then it unfortunately does not meet this requirement. From past requests this normally translates to "group X think this job type is important but we want to delegate the task of monitoring it to someone else".

Runs on mozilla-central and all trees that merge into it

  • Otherwise job failures when tree X merges into mozilla-central will not be attributable to a single changeset, resulting in either tree closure or backout of the entire merge (see requirement #2).
  • When filing the release engineering bug to enable your job on all the required trees, ask to enable it on "mozilla-central based trees" and release engineering will enable it in the default config from which all trunk trees inherit (unless the various tree owners have explicitly opted out). As a rough guide, mozilla-central based trees include mozilla-inbound, fx-team, b2g-inbound as well as many of the other project/disposable repositories.

Scheduled on every push

  • Otherwise job failures will not be attributable to a single changeset, resulting in either tree closure or backout of multiple pushes (see requirement #2).
  • An exception is made for nightly builds with an virtually equivalent non-nightly variant that is built on every push & for tests run on PGO builds (given that PGO builds take an inordinate amount of time, we still schedule them every 3/6 hours depending on tree, and relatively speaking there are not too many PGO-only test failures).
  • Note also that coalescing (buildbot queue collapsing when there is more than one queued job of the exact same tree/type) may mean that not all scheduled jobs actually get run. Whilst coalescing makes sheriffing harder, it's a necessary evil given that automation infrastructure demand frequently outstrips supply.

Easily run on try server

  • Otherwise developers who have had their landing backed out for breaking the job type may be unable to debug the failures/test the fix, particularly if they only reproduce on our infrastructure.
  • Developers should not be expected to guess try chooser options, so http://trychooser.pub.build.mozilla.org/ should be updated if appropriate.

Optional, but helpful

Easy for a dev to run locally

  • Supported by mach (if appropriate).
  • Ideally part of mozilla-central (legacy exceptions being Talos, gaia).

Supports the disabling of individual tests

  • It must be possible for sheriffs to disable an individual test per platform or entirely, by either annotating the test or editing a manifest/moz.build/Makefile in the relevant gecko repository.

Requesting changes in visibility

  • Please file a bug using this template, so that changes in visibility are more discoverable (vs IRC or asking as part of a bug in another product/component) and reasoning/history is preserved.

My platform/test-suite does not meet the requirements, what now?

  • Your platform/test-suite will still be being run, just not shown on the default view. This model has worked well for many projects/build types (eg jetpack, xulrunner, spidermonkey).
  • To see it, append '&showall=1' to the URL (bug 748833 will add a checkbox for this to the TBPL UI).
  • To filter the jobs displayed, under the 'Filters' menu use the 'job name' field (which supports regexp).
  • eg: to see both ASan & Valgrind jobs on mozilla-central (neither of which are shown by default), use: https://tbpl.mozilla.org/?showall=1&jobname=(asan|valgrind)
  • For Try specifically, you can request that the job type by made non-default (ie requires explicit opt-in when using trychooser syntax, and won't be scheduled with '-u all' or similar), in order to be shown in the default view on Try - example.

The future

  • Planned improvements to our tooling will likely mean that some of these requirements can be relaxed in the future, as well as making it easier for maintainers of non-default-view job types to track their success/failure without having to monitor TBPL continuously.
  • Planned features for the successor to TBPL (Auto-tools/Projects/Treeherder) include:
    • Multiple dashboards/views for different use-cases/teams (giving us more flexibility than just "default view" or "&showall=1").
    • Opt-in notifications (email, IRC, dashboard, ...?) of failures for desired job types (see proposal in bug 851061).
  • Auto-tools/Projects/Bisect_in_the_cloud will allow sheriffs to more easily narrow regression ranges for job types that do not run on every push, making it more viable to accept them into certain views/dashboards.