Buildbot/Talos/Sheriffing: Difference between revisions

various clean ups and updates
m (→‎Expired Alerts: - removed)
(various clean ups and updates)
Line 1: Line 1:
= Overview =
= Overview =
The code sheriff team does a great job of finding regressions in unittests and getting fixes for them or backing stuff out. This keeps our trees green and usable while thousands of checkins a month take place!
The code sheriff team does a great job of finding regressions in unittests and getting fixes for them or backing stuff out. This keeps our trees green and usable while thousands of checkins a month take place!


For talos, we run about 85 jobs per push (out of ~1500+ total test jobs) to measure the performance of desktop builds.  These jobs are green and the code sheriffs have little to do.
A Performance Sheriff's job is similar: to make sure that performance regressions in Firefox are detected and dealt with. They look at the data produced by performance test jobs (mainly Talos), find regressions, determine their root cause(s) and gets bugs on file to track all issues and make interested parties aware of what is going on.
 
Enter the role of a Performance Sheriff. This role looks at the data produced by these test jobs and finds regressions, root causes and gets bugs on file to track all issues and make interested parties aware of what is going on.


= What is an alert =
= What is an alert =
As of January 2016, alerts are generated in [[https://treeherder.mozilla.org/perf.html#/alerts?status=0&framework=1 PerfHerder]].  These are generated by programatically verifying there is a sustained regression over time (original data point + [https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Noise_FAQ#Why_do_we_need_12_future_data_points 12 future data points]]).
As of January 2016, alerts are generated in [https://treeherder.mozilla.org/perf.html#/alerts?status=0&framework=1 Perfherder].  These are generated by programatically verifying there is a sustained regression over time ([https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Noise_FAQ#Why_do_we_need_12_future_data_pointsoriginal data point + 12 future data points]).


There is an alert summary outlining the alerts which match the same set of revisions.  For the summary there are a few pieces of information:
There is an alert summary outlining the alerts which match the same set of revisions.  For the summary there are a few pieces of information:
Line 25: Line 23:
* old score (median score of the previous 12 commits)
* old score (median score of the previous 12 commits)
* new score (median score of the future 12 commits)
* new score (median score of the future 12 commits)
* [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Alert_FAQ#Why_does_Alert_Manager_print_-xx.25 % change / values]]
* [https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Alert_FAQ#Why_does_Alert_Manager_print_-xx.25 % change / values]
* bar chart to show severity, green = improvement, red = regression
* bar chart to show severity, green = improvement, red = regression
* Confidence value (from the t-test code)
* Confidence value (from the t-test code)
Line 34: Line 32:
This is a manual process that needs to be done for every alert.  We need to:
This is a manual process that needs to be done for every alert.  We need to:
* Look at the graph and determine the original branch, date, revision where the alert occurred
* Look at the graph and determine the original branch, date, revision where the alert occurred
* Look at TreeHerder and determine if we have all the data.
* Look at Treeherder and determine if we have all the data.
* Retrigger jobs if needed (more [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Noise_FAQ#What_is_Noise noise]], more retriggers)
* Retrigger jobs if needed (more [https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Noise_FAQ#What_is_Noise noise], more retriggers)
* Once you have more data, look at the data in [https://treeherder.mozilla.org/perf.html#/comparechooser compare view] to see if other tests/platforms have changed
* Once you have more data, look at the data in [https://treeherder.mozilla.org/perf.html#/comparechooser compare view] to see if other tests/platforms have changed
* Add all related alerts you see to the summary with the reassign button
* Add all related alerts you see to the summary with the reassign button


== Determining the root cause from the Perfherder ==
== Determining the root cause from the Perfherder ==
When viewing a single alert and clicking on the graph link,  Perfherder automatically show multiple branches for the given test/platform.  This helps you determine the root branch.  It is best to [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/GraphServer_FAQ#Zooming zoom]] in and out to verify where the regression is.
When viewing a single alert and clicking on the graph link,  Perfherder automatically show multiple branches for the given test/platform.  This helps you determine the root branch.  It is best to [https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/GraphServer_FAQ#Zooming zoom] in and out to verify where the regression is.


While this isn't always clear, most of the time it is easy to see another alert on a different branch and mark the current one as a downstream if needed.
While this isn't always clear, most of the time it is easy to see another alert on a different branch and mark the current one as a downstream if needed.
Line 72: Line 70:


== Determining the scope of the regression from Perfherder ==
== Determining the scope of the regression from Perfherder ==
Once you have the spot, you can validate the other platforms by [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/GraphServer_FAQ#Adding_additional_Data_Points adding additional data sets]] to the graph.  It is best here to zoom out a bit as the regression might be a few revisions off on different platforms due to [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_coalescing coalescing]].
Once you have the spot, you can validate the other platforms by [https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/GraphServer_FAQ#Adding_additional_Data_Points adding additional data sets] to the graph.  It is best here to zoom out a bit as the regression might be a few revisions off on different platforms due to [https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_coalescing coalescing].


== Cases to watch out for ==
== Cases to watch out for ==
There are many reasons for an alert and different scenarios to be aware of:
There are many reasons for an alert and different scenarios to be aware of:
* [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_a_backout backout]] (usually within 1 week causing a similar regression/improvement)
* [https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_a_backout backout] (usually within 1 week causing a similar regression/improvement)
* [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_PGO pgo/nonpgo]] (some errors are pgo only and might be a side effect of pgo).  We only ship PGO, so these are the most important.
* [https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_PGO pgo/nonpgo] (some errors are pgo only and might be a side effect of pgo).  We only ship PGO, so these are the most important.
* test/infrastructure change - once in a while we change big things about our tests or infrastructure and it affects our tests (we need bugs to document these those)
* test/infrastructure change - once in a while we change big things about our tests or infrastructure and it affects our tests (we need bugs to document these those)
* [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_a_merge Merged]] - sometimes the root cause looks to be a merge, this is a normall a side effect of [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_coalescing Coalescing]].
* [https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_a_merge Merged]] - sometimes the root cause looks to be a merge, this is a normall a side effect of [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_coalescing Coalescing].
* [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_coalescing Coalesed]] - this is when we don't run every job on every platform on every push and sometimes we have a set of changes
* [https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_coalescing Coalesed] - this is when we don't run every job on every platform on every push and sometimes we have a set of changes
* Regular regression - the normal case where we get an alert and we see it merge from branch to branch
* Regular regression - the normal case where we get an alert and we see it merge from branch to branch


Line 86: Line 84:
Every release of Firefox we create a tracking bug (ex. {{bug|1122690}} - Firefox 38) which we use to associate all regressions found during that release.  The reason for this is 2 fold:
Every release of Firefox we create a tracking bug (ex. {{bug|1122690}} - Firefox 38) which we use to associate all regressions found during that release.  The reason for this is 2 fold:
* We can go to one spot and see what regressions we have for reference on new bugs or to follow up.
* We can go to one spot and see what regressions we have for reference on new bugs or to follow up.
* When we [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_an_uplift uplift]] it is important to see which alerts we are expecting
* When we [https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_an_uplift uplift] it is important to see which alerts we are expecting


These bugs just contain a set of links to other bugs, no conversation is needed.
These bugs just contain a set of links to other bugs, no conversation is needed.
Line 95: Line 93:
Here are some things to check/verify when filing a bug:
Here are some things to check/verify when filing a bug:
* Product/Component - this should be the same as the bug which is the root cause, if >1 bug, file in [https://bugzilla.mozilla.org/enter_bug.cgi?product=Testing&component=Talos Talos]
* Product/Component - this should be the same as the bug which is the root cause, if >1 bug, file in [https://bugzilla.mozilla.org/enter_bug.cgi?product=Testing&component=Talos Talos]
* Dependent/Block bugs - For a new bug, add the [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing#Tracking_bugs tracking bug]] (for the current version) and root cause bug(s) as blocking this bug
* Dependent/Block bugs - For a new bug, add the [https://wiki.mozilla.org/Buildbot/Talos/Sheriffing#Tracking_bugs tracking bug] (for the current version) and root cause bug(s) as blocking this bug
* CC list - cc :jmaher, :avih, :wlach, :avih, patch author(s) and reviewer(s), and owner of the tests as documented on the [[https://wiki.mozilla.org/Buildbot/Talos/Tests talos tests wiki]]  
* CC list - cc :jmaher, :avih, :wlach, :avih, patch author(s) and reviewer(s), and owner of the tests as documented on the [https://wiki.mozilla.org/Buildbot/Talos/Tests talos tests wiki]  
* Summary of bug should have a check to make sure the revision is accurate
* Summary of bug should have a check to make sure the revision is accurate
* The description is auto suggested as well, please verify the revision here
* The description is auto suggested as well, please verify the revision here


As a note, the generated description refers the patch author to [[https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling guidelines and expectations]] for them about how and when to respond.
As a note, the generated description refers the patch author to [https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling guidelines and expectations] for them about how and when to respond.


Once a bug is filed it is a good idea to do a few things in another comment:
Once a bug is filed it is a good idea to do a few things in another comment:
1) provide a link to compare view to show you have done retriggers and believe this is valid
* provide a link to compare view to show you have done retriggers and believe this is valid
2) needinfo the patch author (if many patch authors, needinfo :jmaher)
* needinfo the patch author (if many patch authors, needinfo :jmaher)
3) mention how confident you are in the regression (more confidence if you have a lot of retriggers and there is only one patch, less confident if you are waiting on backfilling data, retriggers, try runs, etc.)
* mention how confident you are in the regression (more confidence if you have a lot of retriggers and there is only one patch, less confident if you are waiting on backfilling data, retriggers, try runs, etc.)


= Other common tasks =
= Other common tasks =
Line 111: Line 109:


== Merge Day - Uplifts ==
== Merge Day - Uplifts ==
Every 6 weeks we do an [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_an_uplift uplift]].  These typically result in [[https://elvis314.wordpress.com/2014/12/12/tracking-firefox-performance-as-we-uplift-the-volume-of-alerts-we-get/ dozens of alerts]] for each uplift.
Every 6 weeks we do an [https://wiki.mozilla.org/Buildbot/Talos/Sheriffin/Tree_FAQ#What_is_an_uplift uplift].  These typically result in [https://elvis314.wordpress.com/2014/12/12/tracking-firefox-performance-as-we-uplift-the-volume-of-alerts-we-get/ dozens of alerts] for each uplift.


The job here is to triage alerts as we usually do, except in this case we have a much larger volume of alerts.  One thing here is we have alerts from the upstream branch.  Take for example when we uplift Mozilla-Central to Mozilla-Aurora.  We have a tracking bug for each release, and there is a list of bugs (keep in mind some are resolved as wontfix).  In a perfect world (half the time) we can match up the alerts that are showing up on Mozilla-Aurora with the bugs that have already been filed.  The job here is to verify and add bugs to keep track of what is there.
The job here is to triage alerts as we usually do, except in this case we have a much larger volume of alerts.  One thing here is we have alerts from the upstream branch.  Take for example when we uplift Mozilla-Central to Mozilla-Aurora.  We have a tracking bug for each release, and there is a list of bugs (keep in mind some are resolved as wontfix).  In a perfect world (half the time) we can match up the alerts that are showing up on Mozilla-Aurora with the bugs that have already been filed.  The job here is to verify and add bugs to keep track of what is there.
Line 120: Line 118:
In many cases we resolve a bug and need to wait a day or two for enough data to show up so we can verify the bug is fixed.  It is common to have associated a single bug with many alerts.
In many cases we resolve a bug and need to wait a day or two for enough data to show up so we can verify the bug is fixed.  It is common to have associated a single bug with many alerts.


It is good practice to update the state of the alert summary in Perfherder Alerts so it accurately reflects the state of the related bug.
It is good practice to update the state of the alert summary in Perfherder alerts so it accurately reflects the state of the related bug.


= Additional Resources =
= Additional Resources =
* [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Alert_FAQ Alert FAQ]]
* [https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Alert_FAQ Alert FAQ]
* [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Noise_FAQ Noise FAQ]]
* [https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Noise_FAQ Noise FAQ]
* [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/GraphServer_FAQ GraphServer FAQ]]
* [https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/GraphServer_FAQ GraphServer FAQ]
* [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ Tree FAQ]]
* [https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ Tree FAQ]
Confirmed users
955

edits