Sheriffing/How To/Intermittent bugs: Difference between revisions

m
fixed ordering of paragraphs
(→‎Machine-specific failures: add instructions to terminate all machines)
m (fixed ordering of paragraphs)
 
(14 intermediate revisions by 3 users not shown)
Line 5: Line 5:


[[File:Treeherder suggestion.png|Treeherder suggestions]]
[[File:Treeherder suggestion.png|Treeherder suggestions]]
If a test failure has a suggestion with the test name and "single tracking bug" appended, it should be used for the failure classification if the failure is expected to be intermittent. Example: <code>Intermittent devtools/client/aboutdebugging/test/browser/browser_aboutdebugging_addons_debug_popup.js | single tracking bug</code>


==About ADB failures on Android==
==About ADB failures on Android==
Line 36: Line 38:
If you are not sure if a failure line is a new frequent failure or even a perma-failure and require a backout of the change which causes the failure, you can request more runs of the job:
If you are not sure if a failure line is a new frequent failure or even a perma-failure and require a backout of the change which causes the failure, you can request more runs of the job:
* On the same push ("retrigger") by pressing the '''r''' key or clicking the button (circular arrow).
* On the same push ("retrigger") by pressing the '''r''' key or clicking the button (circular arrow).
** The glove icon [[File:Treeherder - intermittent failed and successful run indicator.png|left]] for a failed task indicates there have been successful runs of the task with the same configuration (e.g. platform) on the same push - the failure is verified to be intermittent.
* On previous pushes by opening the actions menu in the bottom toolbar ("...") and and calling "Backfill". This will run the job on the 5 previous pushes, independent if it already ran for those pushes or not. If the job depends on a build which is missing, that build will be generated before the test runs.
* On previous pushes by opening the actions menu in the bottom toolbar ("...") and and calling "Backfill". This will run the job on the 5 previous pushes, independent if it already ran for those pushes or not. If the job depends on a build which is missing, that build will be generated before the test runs.
** A quicker way - if supported - to check if this fails frequently is test-verify backfill.
** A quicker way - if supported - to check if this fails frequently is test-verify backfill.
Line 46: Line 49:


There is also a [https://sql.telemetry.mozilla.org/dashboard/permafailing-tests dashboard] to check for permafailing tasks.
There is also a [https://sql.telemetry.mozilla.org/dashboard/permafailing-tests dashboard] to check for permafailing tasks.
== Filter Treeherder by test path ==
NOTE: Some test paths are not yet available. See [https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/tests.py#1282 source code]. See "chunking" bugs under this [https://bugzilla.mozilla.org/showdependencytree.cgi?maxdepth=1&id=1603255&hide_resolved=1 meta bug].
Treeherder allows to filter the tasks by which test they ran:
* Click the Filter icon button from the second toolbar at the top, not the "Filters" menu at in the top toolbar.
* Select the filter with the text 'test path'.
* As filter value, enter the path of the folder which contains the test but without test name. E.g. if you want to find toolkit/components/passwordmgr/test/browser/browser_doorhanger_generated_password.js the search should be for toolkit/components/passwordmgr/test/browser/
* Press Enter.
= Mobile Failures =
If tasks for mobile products like Firefox for Android failure, they need to retriggered until they succeed or the changeset causing the permanent issue has been backed out if the task group ends with "-debug", "-beta" or "-nightly". Without the successful run, the mobile Nightly does not ship to users.


= General failure messages - deciding if new bug needed =
= General failure messages - deciding if new bug needed =
Line 57: Line 73:


If a task fails with a message mentioning "ADBError" or something which starts with "ADB" and ends with "Error", it's a failure returned by the phone which runs the task to the controlling computer. Each failure message needs its own bug (e.g. for "init failed", "failed to create directory" etc.)
If a task fails with a message mentioning "ADBError" or something which starts with "ADB" and ends with "Error", it's a failure returned by the phone which runs the task to the controlling computer. Each failure message needs its own bug (e.g. for "init failed", "failed to create directory" etc.)
== Crash reports without human-readable function name ==
If a crash mentions a file and a hexadecimal address, e.g. "application crashed [@ libxul.so + 0x17eb4ff]", instead of a function name like "application crashed [@ {virtual override thunk({offset(-16)}, mozilla::net::Http2Session::TakeHttpConnection())} + 0x2d]", the translation of the crash where the crash occurred from machine address to human-readable address ("symbolication") failed. Usually the files "libxul.so" (Linux and Android) or "xul.dll" (Windows) are mentioned and can be observed mostly on systems which limited memory, e.g. Android and Windows 7 (= 32-bit). In these cases, assume such a crash matches a bug suggestion about a crash if there is only one suggestion about a crash.


= How to file a bug for an intermittent failure =
= How to file a bug for an intermittent failure =
Line 136: Line 155:
Terminating the machine from the command line:
Terminating the machine from the command line:


Before the taskcluster migration on November 9th 2019:
* Run <code>export TASKCLUSTER_ROOT_URL=https://taskcluster.net</code><br>
If the migration already took place, run this instead:<br>
* Run <code>export TASKCLUSTER_ROOT_URL=https://firefox-ci-tc.services.mozilla.com</code>
* Run <code>export TASKCLUSTER_ROOT_URL=https://firefox-ci-tc.services.mozilla.com</code>
 
* Sign in: <code>eval $(taskcluster signin --scope 'worker-manager:*')</code>
* Sign in: <code>eval `taskcluster signin --scope worker-manager:*`</code>
* Log in in the browser, close the browser after you clicked 'Create Client'.
* Log in in the browser, close the browser after you clicked 'Create Client'.
* The syntax for terminating a worker is <code>taskcluster api workerManager removeWorker <workerPoolId> <workerGroup> <workerId></code><br/>
* The syntax for terminating a worker is <code>taskcluster api workerManager removeWorker <workerPoolId> <workerGroup> <workerId></code><br/>
Example:
Example:
The page for the worker lists: <code>gecko-3 / worker-types / b-linux / aws / i-07985396a2efa2ed1</code> as identifier<br/>
The page for the worker lists: <code>gecko-3 / worker-types / b-linux / us-west-1 / i-07985396a2efa2ed1</code> as identifier<br/>
The command to run is:
The command to run is:
<code>taskcluster api workerManager removeWorker gecko-3/b-linux aws i-07985396a2efa2ed1</code>
<code>taskcluster api workerManager removeWorker gecko-3/b-linux us-west-1 i-07985396a2efa2ed1</code>


Terminating all machines in a worker pool:
Terminating all machines in a worker pool:


Save the worker pool name in a variable. In this case for <code>wp=gecko-3/decision</code>:
Save the worker pool name in a variable. In this case for wp=gecko-3/decision:
<code>wp=gecko-3/decision</code>
 
<code>export wp=gecko-3/decision</code>
 
Run this script to terminate all machines in that pool:
Run this script to terminate all machines in that pool:
<code>taskcluster api workerManager listWorkersForWorkerPool $wp | jq -r '.workers[] | .workerGroup + " " + .workerId' | while read wg wid; do taskcluster api workerManager removeWorker $wp $wg $wid; done</code>
<code>taskcluster api workerManager listWorkersForWorkerPool $wp | jq -r '.workers[] | .workerGroup + " " + .workerId' | while read wg wid; do taskcluster api workerManager removeWorker $wp $wg $wid; done</code>


== How to file a security bug ==
== How to file a security bug ==
'''NOTE''': Treeherder supports the creation of security bugs from its bug filer. The process described below should be used if that feature stopped working or a bug has to be filed outside Treeherder


When we see failures which contains “use-after-poison” in the log, it usually means that we have to file a security bug for it. Security bugs are '''not''' visible except when you are on the CC list.
When we see failures which contains “use-after-poison” in the log, it usually means that we have to file a security bug for it. Security bugs are '''not''' visible except when you are on the CC list.
Line 174: Line 194:
* stack-buffer-underflow
* stack-buffer-underflow
* use-after-poison
* use-after-poison
The bug should be filed manually from '''Bugzilla''', and <span style="color:#FF0000">'''not'''</span> from '''Treeherder.'''
How to file such a bug:
How to file such a bug:
# Access '''Bugzilla''' (https://bugzilla.mozilla.org/enter_bug.cgi) and search after the relevant Component, in this case Core :: Layout.
# Access '''Bugzilla''' (https://bugzilla.mozilla.org/enter_bug.cgi) and search after the relevant Component, in this case Core :: Layout.
Line 191: Line 210:
* The task generation is stored in Firefox Build System :: Task Configuration.
* The task generation is stored in Firefox Build System :: Task Configuration.
If it is necessary to check if an issue is related to Treeherder or Taskcluster, open an affected job by selecting it and then clicking on the task link at the bottom left. Is the job showing as expected in the Taskcluster page, Treeherder might not receive the (correct) data and the bug should be filed against Treeherder, else into Taskcluster.
If it is necessary to check if an issue is related to Treeherder or Taskcluster, open an affected job by selecting it and then clicking on the task link at the bottom left. Is the job showing as expected in the Taskcluster page, Treeherder might not receive the (correct) data and the bug should be filed against Treeherder, else into Taskcluster.
== Reverting classifications ("Unclassification") ==
* Classifications of individual tasks can be removed from the "Annotations" tab of the task; the bug number should be removed first.
* The classifications of multiple tasks can be removed by code sheriffs. The tasks have to be pinned to the pinboard and "Unclassify all" from the "save" button's dropdown menu has to be called.
== Nightly / Release task failure ==
Builds which shall ship to users of Nightly, Beta, Release and ESR use an existing push and schedule missing and additional tasks, e.g. startup tests and localized builds. These must all complete for the build to be regarded as "good to ship" (the tasks are grouped by platform and if they are for English en-US builds or other locales and shipping gets activated for each of those groups when its tasks complete).
Sometimes it is not obvious if a task must be rerun or in a few special cases won't be shown by Treeherder. To validate all tasks in the release graph completed, open the Taskcluster page which shows all tasks in the release graph. For every non-Nightly release, the <code>taskcluster-firefoxci</code> bot posts it in the [https://chat.mozilla.org/#/room/#releaseduty:mozilla.org #releaseduty] channel (for beta, there are 2 release graphs: 1 for the Developer Edition and 1 for Beta).
To get the corresponding Taskcluster page for a Nightly:
# Open the [https://treeherder.mozilla.org/jobs?repo=mozilla-central Treeherder page for mozilla-central].
# Find the cron task which scheduled the Nightly release graph. For desktop, it has the symbol `cron(Nd)`.
# Select the task.
# At the bottom left, open the link to the Taskcluster url for the task.
# In the Taskcluster page, click the "Task Group" link near the top left.
If all tasks are listed under the green "Completed" category and the count for all other states is 0, the release should have shipped.
Confirmed users
571

edits