Confirmed users
1,759
edits
m (→finding bugs to work on: initial data) |
|||
(15 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
= finding bugs to work on = | = finding bugs to work on = | ||
We have a [https://charts.mozilla.org/ | We have a [https://charts.mozilla.org/FreshOranges/index.html fresh oranges dashboard] which looks like neglected oranges, except that it shows new failures that are high frequency and ignoring [stockwell infra] bugs. | ||
As these are new bugs, there will be issues here that are infra or harness related | As these are new bugs, there will be issues here that are infra or harness related. Use this as an opportunity to annotate [stockwell infra] if it is build, taskcluster, network, machine related. Otherwise, rules are similar to disable-recommended: If a test case is in the bugzilla summary, we should be able to retrigger and find the patch which caused this to become so frequent. | ||
'''Skip test-verify bugs''': test-verify already repeats tests, and only runs tests which were modified on a push. There is no need to retrigger or backfill a test-verify failure. | |||
= choosing a config to test = | = choosing a config to test = | ||
Line 8: | Line 10: | ||
If there is not a clear winner, then consider a few factors which could help: | If there is not a clear winner, then consider a few factors which could help: | ||
* debug typically provides more data, but takes longer | * debug typically provides more data than opt, but takes longer | ||
* pgo is harder to backfill and builds take longer | * pgo is harder to backfill and builds take longer: try to avoid this | ||
* ccov/jsdcov builds/tests are only run on mozilla-central | * ccov/jsdcov builds/tests are only run on mozilla-central: avoid these configs | ||
* nightly is only run on mozilla-central | * nightly is only run on mozilla-central: avoid these configs | ||
* mac osx has a limited device pool | * mac osx has a limited device pool: try to pick linux or windows | ||
= choosing a starting point = | = choosing a starting point = | ||
Line 24: | Line 26: | ||
In many cases you will pick a different failure as the first point- I often like to pick the second instance of the branch/config so I can confirm multiple revisions show the failure (show a pattern). | In many cases you will pick a different failure as the first point- I often like to pick the second instance of the branch/config so I can confirm multiple revisions show the failure (show a pattern). | ||
BEWARE - in many cases the first failure posted is not the earliest revision. Timestamps in orangefactor are based on when the job was completed, not when the revision was pushed. | |||
[[File:1-OF first failures.jpg|300px]] | |||
The above example shows that windows 7 opt/pgo is common- I am picking win7-pgo on mozilla-inbound as it is where the pattern seems to be the most frequent. | |||
= how to find which job to retrigger = | = how to find which job to retrigger = | ||
Line 29: | Line 37: | ||
Picking the first job is easy- that is usually very obvious when choosing the config that you are running against and pulling up the revision to start with. for example, it might be linux64/debug mochitest-browser-chrome-e10s-3. | Picking the first job is easy- that is usually very obvious when choosing the config that you are running against and pulling up the revision to start with. for example, it might be linux64/debug mochitest-browser-chrome-e10s-3. | ||
[[File:TH_filtered_view.jpg||500px]] | |||
Note in the above picture we filter on |win pgo bc1| and then we need to click the '20' link for 20 more revisions | |||
[[File:TH_history.jpg|500px]] | |||
Note in the above picture we have bc1 available to retrigger on many revisions, you can see the specific error highlighted in the preview pane, and I have circled the 'retrigger' button | |||
As a sanity check, I pull up the log file and search for the test name, it should show up as TEST-START, and then shortly after TEST-UNEXPECTED-FAIL. | As a sanity check, I pull up the log file and search for the test name, it should show up as TEST-START, and then shortly after TEST-UNEXPECTED-FAIL. | ||
Line 40: | Line 54: | ||
* we do not run every job/chunk on every push, so it could be 30 failures in 75 data points | * we do not run every job/chunk on every push, so it could be 30 failures in 75 data points | ||
* there could be retriggers on the existing data and we could have 3 or 4 failures on a few pushes making it failing less than 20% | * there could be retriggers on the existing data and we could have 3 or 4 failures on a few pushes making it failing less than 20% | ||
[[File:TH_retriggered.jpg|500px]] | |||
The above shows 20 retriggers (21 data points each) for the bc1 job. 40 would give us a clear pattern, but I wanted to save a few resources and make sure 20 retriggers would show an error and possibly a closer range. | |||
= backfilling = | = backfilling = | ||
Line 49: | Line 66: | ||
= what to do with the data = | = what to do with the data = | ||
Once you have retriggered/backfilled a job, now you wait for it to finish. opt tests usually finish in <30 minutes once they start running- debug can be up to 1 hour. | |||
When your initial tests finish, you might see a view like this: | |||
[[File:TH_repeat.jpg|500px]] | |||
Here you can see the 2-4 oranges per push. Check each failure to make sure the same test is failing. In the above case that is true and we need to go further back in history. | |||
After repeating the process a few times, the root cause will become visible: | |||
[[File:TH_rootcause.jpg|500px]] | |||
You can see that we switched from bc1 -> bc2 as the failing test, so now the filter is on bc instead of bc1. you can see a clear pattern of failures for every push and then almost no failures before the offending patch landed. | |||
= exceptions and odd things = | = exceptions and odd things = | ||
some common exceptions to watch out for: | some common exceptions to watch out for: | ||
Line 59: | Line 87: | ||
* root cause looks like a merge, repeat on the other integration branch | * root cause looks like a merge, repeat on the other integration branch | ||
* rarely but sometimes failures occur on mozilla-central landings, or as a result of code merging | * rarely but sometimes failures occur on mozilla-central landings, or as a result of code merging | ||
* sometimes it is obvious from check-in messages (or TV failures) that the failing test case was modified on a certain push: If the test was modified around the time it started failing, that's suspicious and can be used as a short-cut to find the regressing changeset. |