Sheriffing/How To/Intermittent bugs: Difference between revisions

← Older edit

Sheriffing/How To/Intermittent bugs (view source)

Revision as of 21:40, 26 March 2024

11,624 bytes added , 26 March 2024

m

fixed ordering of paragraphs

Archaeopteryx

Confirmed users

571

edits

@@ Line 5: / Line 5: @@
 [[File:Treeherder suggestion.png|Treeherder suggestions]]
+If a test failure has a suggestion with the test name and "single tracking bug" appended, it should be used for the failure classification if the failure is expected to be intermittent. Example: <code>Intermittent devtools/client/aboutdebugging/test/browser/browser_aboutdebugging_addons_debug_popup.js | single tracking bug</code>
+==About ADB failures on Android==
+adb is a program which allows an external computer to control a phone, install software etc.
+ADBTimeoutError: The command didn't finish for an unknown reason, e.g. a connection issue.
+ADBError, ADBProcessError: Here one of the command the phone shall run failed. Each of the failures needs an own bug. E.g.
+Intermittent raise ADBError("ADBDevice.__init__: ls could not be found")
+means the program "ls" was not found. If it were a different program, a new bug should be filed.
+==TV (test-verify) failure to test suite mapping==
+The Test-Verify job tests if added or modified test files fail and runs them often sequential and in parallel in one or more sessions. To check if they also fail in their normal test suite, the test suite has to be identified. Search the log for the TV job for the test name and find e.g.
+<code>Per-test run found test toolkit/mozapps/extensions/test/browser/browser_webapi_access.js (mochitest-browser-chrome/None)</code>
+The test belongs to mochitest browser-chrome - the <code>bc</code> tasks on Treeherder.
+==Exceptions without logs linked from Treeherder==
+Tasks which failed with an exception have no job linked from Treeherder but still have a log in Taskcluster (setting up the task failed). To access it:
+# Click the "Task" link at the bottom left.
+# From the "Task Run" tab, select the correct number. Should be the highest one if task did not get rerun/retriggered.
+# From "Run Logs", click on "public/logs/live.log".
+# Check the end of the log for failures.
 = Check for intermittency =
@@ Line 10: / Line 38: @@
 If you are not sure if a failure line is a new frequent failure or even a perma-failure and require a backout of the change which causes the failure, you can request more runs of the job:
 * On the same push ("retrigger") by pressing the '''r''' key or clicking the button (circular arrow).
+** The glove icon [[File:Treeherder - intermittent failed and successful run indicator.png|left]] for a failed task indicates there have been successful runs of the task with the same configuration (e.g. platform) on the same push - the failure is verified to be intermittent.
 * On previous pushes by opening the actions menu in the bottom toolbar ("...") and and calling "Backfill". This will run the job on the 5 previous pushes, independent if it already ran for those pushes or not. If the job depends on a build which is missing, that build will be generated before the test runs.
 ** A quicker way - if supported - to check if this fails frequently is test-verify backfill.
@@ Line 18: / Line 47: @@
 **#* Add "testPath:" and the test path and name from the clipboard.
 **# The test will be run multiple times in a job TV-bf. If it fails for the later job but passes for the previous one, it is a strong indicator that the failure is related to changes of the push with the TV-bf failure.
+There is also a [https://sql.telemetry.mozilla.org/dashboard/permafailing-tests dashboard] to check for permafailing tasks.
+== Filter Treeherder by test path ==
+NOTE: Some test paths are not yet available. See [https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/tests.py#1282 source code]. See "chunking" bugs under this [https://bugzilla.mozilla.org/showdependencytree.cgi?maxdepth=1&id=1603255&hide_resolved=1 meta bug].
+Treeherder allows to filter the tasks by which test they ran:
+* Click the Filter icon button from the second toolbar at the top, not the "Filters" menu at in the top toolbar.
+* Select the filter with the text 'test path'.
+* As filter value, enter the path of the folder which contains the test but without test name. E.g. if you want to find toolkit/components/passwordmgr/test/browser/browser_doorhanger_generated_password.js the search should be for toolkit/components/passwordmgr/test/browser/
+* Press Enter.
+= Mobile Failures =
+If tasks for mobile products like Firefox for Android failure, they need to retriggered until they succeed or the changeset causing the permanent issue has been backed out if the task group ends with "-debug", "-beta" or "-nightly". Without the successful run, the mobile Nightly does not ship to users.
 = General failure messages - deciding if new bug needed =
@@ Line 26: / Line 70: @@
 Example: Btup builds started to also fail intermittently with the message from [https://bugzilla.mozilla.org/show_bug.cgi?id=1411358 bug 1411358]. The logs for these jobs showed no output before the timeout got hit, often even for more than 40 minutes. [https://bugzilla.mozilla.org/show_bug.cgi?id=1480494 bug 1480494] got created and because the scope was only on that build type, investigation by developers started quickly.
 The jobs [https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2018-08-03&endday=2018-08-10&tree=trunk&bug=1411358 classified as bug 1411358] - sort by "Test Suite" and look for Test Suite "opt" - showed the issue started on August 3rd while there had been many similar failure messages for other job types already before that.
+If a task fails with a message mentioning "ADBError" or something which starts with "ADB" and ends with "Error", it's a failure returned by the phone which runs the task to the controlling computer. Each failure message needs its own bug (e.g. for "init failed", "failed to create directory" etc.)
+== Crash reports without human-readable function name ==
+If a crash mentions a file and a hexadecimal address, e.g. "application crashed [@ libxul.so + 0x17eb4ff]", instead of a function name like "application crashed [@ {virtual override thunk({offset(-16)}, mozilla::net::Http2Session::TakeHttpConnection())} + 0x2d]", the translation of the crash where the crash occurred from machine address to human-readable address ("symbolication") failed. Usually the files "libxul.so" (Linux and Android) or "xul.dll" (Windows) are mentioned and can be observed mostly on systems which limited memory, e.g. Android and Windows 7 (= 32-bit). In these cases, assume such a crash matches a bug suggestion about a crash if there is only one suggestion about a crash.
 = How to file a bug for an intermittent failure =
@@ Line 50: / Line 99: @@
 ## Find it in the repository, either with the search term 'path:/navigation-timing/test_timing_xserver_redirect.html' on [https://dxr.mozilla.org/mozilla-central/source/ DXR] or '/navigation-timing/test_timing_xserver_redirect.html' in the right path filter field of [http://searchfox.org/ searchfox]. If you don't find anything, then there are still folders from outside the source folder in the path. Delete everything e.g. up to 'gecko' or 'build' and try again.
 ## Copy the full folder and file path, e.g. <code>testing/web-platform/tests/navigation-timing/test_timing_xserver_redirect.html</code>
-## In the console with the mozilla-unified folder, run the following command to get the Bugzilla product and component in which bugs related to the file should be posted:<br>
+## In the console with the mozilla-unified folder, run the following command to get the Bugzilla product and component in which bugs related to the file should be posted: <code>./mach file-info bugzilla-component testing/web-platform/tests/navigation-timing/test_timing_xserver_redirect.html</code> In this case, we get: <code>Core :: DOM<br>testing/web-platform/tests/navigation-timing/test_timing_xserver_redirect.html</code>
-<code>./mach file-info bugzilla-component testing/web-platform/tests/navigation-timing/test_timing_xserver_redirect.html</code><br>
+## If the failure is real crash (not a crash report because the test execution hang and the application eventually has to be shut down), try to identify the component from the crash:
-In this case, we get: <code>Core :: DOM<br>
+### Check if the crash signature itself discloses were the bug belongs, e.g. <code>[@ webrtc::MouseCursorMonitorX11::CaptureCursor()]</code> goes into <code>Core :: WebRTC</code>.
-testing/web-platform/tests/navigation-timing/test_timing_xserver_redirect.html</code>
+### In case it's not obvious from the crash signature, check the crashing thread. The files mentioned for the first numbers ("stack frame") can be from managing the crash (e.g. contain "report", "panic"). Skip those. After you find one which looks like "real" code, look up the component like mentioned above. Example:<br>
+  Thread 24 (crashed)
+  libxul.so!GeckoCrash [nsAppRunner.cpp:38c57ccca71e24c90e73bfd2a06bd6a1de6b17db : 5076 + 0x15]
+  libxul.so!gkrust_shared::panic_hook [lib.rs:38c57ccca71e24c90e73bfd2a06bd6a1de6b17db : 241 + 0x9]
+  libxul.so!core::ops::function::Fn::call [function.rs:91856ed52c58aa5ba66a015354d1cc69e9779bdf : 69 + 0x9]
+  libxul.so!rust_panic_with_hook [panicking.rs:91856ed52c58aa5ba66a015354d1cc69e9779bdf : 482 + 0x6]
+  libxul.so!std::panicking::begin_panic [panicking.rs:91856ed52c58aa5ba66a015354d1cc69e9779bdf : 412 + 0x1e]
+  libxul.so!webrender::profiler::TimeProfileCounter::profile [profiler.rs:38c57ccca71e24c90e73bfd2a06bd6a1de6b17db : 282 + 0xaa]
+This is a Webrender failure and belongs into <code>Core :: Graphics: Webrender</code><br>
+If it cannot be identified where the bug belongs, put it in <code>Firefox :: Untriaged</code>
+## In case it's unknown into which product and component the bug belongs, put it in <code>Firefox :: Untriaged</code> and developers will take a look at it.
 # Copy the failure text from the log window into the bug
 # Set the Summary as: Intermittent navigation-timing/test_timing_xserver_redirect.html | expected OK
@@ Line 70: / Line 129: @@
 [[File:Sheriffing intermittent failures dead pixel.png|reftest analyzer with highlighted dead pixel outside of area with content created for testing]]
-Terminate the machine if you discover such an issue.
+Terminate the machine if you discover such an issue:
+See steps below if there is no '''Terminate''' button.
+# In Treeherder with the job selected, click the link '''Task: <something>''' at the bottom left.
+# Click on the tab '''Task Run 0 (Latest)'''. If the number is not <code>0</code>, the job ran multiple times. Find the one in Treeherder by comparing the '''Job Ended''' time in Treeherder (bottom left) with the line Resolved in Taskcluster. Move the mouse over it to see the timestamp.
+# Click on '''WorkerId'''.
+* If there are many failed jobs listed for that machine, terminate it if it's rented one or quarantine it if it's owned by Mozilla (termination button not offered in that case).
+* If in doubt, terminate it.
+* If you quarantine a machine, let #ci on IRC know.
+The new worker provisioner doesn't support terminating machines from the browser yet - the '''Terminate''' button is missing on the Taskcluster page and the provisioner name doesn't mention '''hardware''' (like in 'releng-hardware'). These machines have to be removed from the command line:
+Initial setup:
+# Download the [https://github.com/taskcluster/taskcluster/releases/ latest taskcluster executable] (taskcluster-linux-amd64).
+# Rename it to <code>taskcluster</code>
+# Run <code>which taskcluster</code> in your console terminal.
+# In a console, run <code>sudo nautilus</code> to open the file manager with superuser permissions.
+# Open the the folder returned by the <code>which</code>
+# Replace the <code>taskcluster</code> file there which the downloaded and renamed one.
+# In the file manager, right click the <code>taskcluster</code> file > 'Properties' > 'Permissions' and set 'Allow executing file as program' if it is not enabled yet.
+# In the console, run <code>taskcluster</code> to confirm the application gets found.
+Terminating the machine from the command line:
+* Run <code>export TASKCLUSTER_ROOT_URL=https://firefox-ci-tc.services.mozilla.com</code>
+* Sign in: <code>eval $(taskcluster signin --scope 'worker-manager:*')</code>
+* Log in in the browser, close the browser after you clicked 'Create Client'.
+* The syntax for terminating a worker is <code>taskcluster api workerManager removeWorker <workerPoolId> <workerGroup> <workerId></code><br/>
+Example:
+The page for the worker lists: <code>gecko-3 / worker-types / b-linux / us-west-1 / i-07985396a2efa2ed1</code> as identifier<br/>
+The command to run is:
+<code>taskcluster api workerManager removeWorker gecko-3/b-linux us-west-1 i-07985396a2efa2ed1</code>
+Terminating all machines in a worker pool:
+Save the worker pool name in a variable. In this case for wp=gecko-3/decision:
+<code>export wp=gecko-3/decision</code>
+Run this script to terminate all machines in that pool:
+<code>taskcluster api workerManager listWorkersForWorkerPool $wp | jq -r '.workers[] | .workerGroup + " " + .workerId' | while read wg wid; do taskcluster api workerManager removeWorker $wp $wg $wid; done</code>
 == How to file a security bug ==
+'''NOTE''': Treeherder supports the creation of security bugs from its bug filer. The process described below should be used if that feature stopped working or a bug has to be filed outside Treeherder
 When we see failures which contains “use-after-poison” in the log, it usually means that we have to file a security bug for it. Security bugs are '''not''' visible except when you are on the CC list.
@@ Line 90: / Line 194: @@
 * stack-buffer-underflow
 * use-after-poison
-The bug should be filed manually from '''Bugzilla''', and <span style="color:#FF0000">'''not'''</span> from '''Treeherder.'''
 How to file such a bug:
 # Access '''Bugzilla''' (https://bugzilla.mozilla.org/enter_bug.cgi) and search after the relevant Component, in this case Core :: Layout.
@@ Line 101: / Line 204: @@
 <span style="color:#14866d">'''''Note: If you need to leave a security bug for the next shift for a follow up, make sure to add one member of that shift on the CC list'''''.</span>
+== Infrastructure related issues ==
+* Treeherder only displays show and allows to retrigger them.
+* Taskcluster runs them.
+* The task generation is stored in Firefox Build System :: Task Configuration.
+If it is necessary to check if an issue is related to Treeherder or Taskcluster, open an affected job by selecting it and then clicking on the task link at the bottom left. Is the job showing as expected in the Taskcluster page, Treeherder might not receive the (correct) data and the bug should be filed against Treeherder, else into Taskcluster.
+== Reverting classifications ("Unclassification") ==
+* Classifications of individual tasks can be removed from the "Annotations" tab of the task; the bug number should be removed first.
+* The classifications of multiple tasks can be removed by code sheriffs. The tasks have to be pinned to the pinboard and "Unclassify all" from the "save" button's dropdown menu has to be called.
+== Nightly / Release task failure ==
+Builds which shall ship to users of Nightly, Beta, Release and ESR use an existing push and schedule missing and additional tasks, e.g. startup tests and localized builds. These must all complete for the build to be regarded as "good to ship" (the tasks are grouped by platform and if they are for English en-US builds or other locales and shipping gets activated for each of those groups when its tasks complete).
+Sometimes it is not obvious if a task must be rerun or in a few special cases won't be shown by Treeherder. To validate all tasks in the release graph completed, open the Taskcluster page which shows all tasks in the release graph. For every non-Nightly release, the <code>taskcluster-firefoxci</code> bot posts it in the [https://chat.mozilla.org/#/room/#releaseduty:mozilla.org #releaseduty] channel (for beta, there are 2 release graphs: 1 for the Developer Edition and 1 for Beta).
+To get the corresponding Taskcluster page for a Nightly:
+# Open the [https://treeherder.mozilla.org/jobs?repo=mozilla-central Treeherder page for mozilla-central].
+# Find the cron task which scheduled the Nightly release graph. For desktop, it has the symbol `cron(Nd)`.
+# Select the task.
+# At the bottom left, open the link to the Taskcluster url for the task.
+# In the Taskcluster page, click the "Task Group" link near the top left.
+If all tasks are listed under the green "Completed" category and the count for all other states is 0, the release should have shipped.