Confirmed users
1,759
edits
(→how many retriggers: - first image) |
|||
(8 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
= finding bugs to work on = | = finding bugs to work on = | ||
We have a [https://charts.mozilla.org/ | We have a [https://charts.mozilla.org/FreshOranges/index.html fresh oranges dashboard] which looks like neglected oranges, except that it shows new failures that are high frequency and ignoring [stockwell infra] bugs. | ||
As these are new bugs, there will be issues here that are infra or harness related | As these are new bugs, there will be issues here that are infra or harness related. Use this as an opportunity to annotate [stockwell infra] if it is build, taskcluster, network, machine related. Otherwise, rules are similar to disable-recommended: If a test case is in the bugzilla summary, we should be able to retrigger and find the patch which caused this to become so frequent. | ||
'''Skip test-verify bugs''': test-verify already repeats tests, and only runs tests which were modified on a push. There is no need to retrigger or backfill a test-verify failure. | |||
= choosing a config to test = | = choosing a config to test = | ||
Line 8: | Line 10: | ||
If there is not a clear winner, then consider a few factors which could help: | If there is not a clear winner, then consider a few factors which could help: | ||
* debug typically provides more data, but takes longer | * debug typically provides more data than opt, but takes longer | ||
* pgo is harder to backfill and builds take longer | * pgo is harder to backfill and builds take longer: try to avoid this | ||
* ccov/jsdcov builds/tests are only run on mozilla-central | * ccov/jsdcov builds/tests are only run on mozilla-central: avoid these configs | ||
* nightly is only run on mozilla-central | * nightly is only run on mozilla-central: avoid these configs | ||
* mac osx has a limited device pool | * mac osx has a limited device pool: try to pick linux or windows | ||
= choosing a starting point = | = choosing a starting point = | ||
Line 64: | Line 66: | ||
= what to do with the data = | = what to do with the data = | ||
Once you have retriggered/backfilled a job, now you wait for it to finish. opt tests usually finish in <30 minutes once they start running- debug can be up to 1 hour. | |||
When your initial tests finish, you might see a view like this: | |||
[[File:TH_repeat.jpg|500px]] | |||
Here you can see the 2-4 oranges per push. Check each failure to make sure the same test is failing. In the above case that is true and we need to go further back in history. | |||
After repeating the process a few times, the root cause will become visible: | |||
[[File:TH_rootcause.jpg|500px]] | |||
You can see that we switched from bc1 -> bc2 as the failing test, so now the filter is on bc instead of bc1. you can see a clear pattern of failures for every push and then almost no failures before the offending patch landed. | |||
= exceptions and odd things = | = exceptions and odd things = | ||
some common exceptions to watch out for: | some common exceptions to watch out for: | ||
Line 74: | Line 87: | ||
* root cause looks like a merge, repeat on the other integration branch | * root cause looks like a merge, repeat on the other integration branch | ||
* rarely but sometimes failures occur on mozilla-central landings, or as a result of code merging | * rarely but sometimes failures occur on mozilla-central landings, or as a result of code merging | ||
* sometimes it is obvious from check-in messages (or TV failures) that the failing test case was modified on a certain push: If the test was modified around the time it started failing, that's suspicious and can be used as a short-cut to find the regressing changeset. |