Auto-tools/Projects/Signal From Noise/StatusNovember2011: Difference between revisions

← Older edit

Auto-tools/Projects/Signal From Noise/StatusNovember2011 (view source)

Revision as of 22:47, 28 December 2012

140 bytes added , 28 December 2012

m

→‎State of Statistics, November 2011

Jhammel

947

edits

@@ Line 11: / Line 11: @@
 * a script versioned with graphserver (but unrelated to and unused by graphs.mozilla.org) mails dev-tree-management when a regression or improvement is detected: http://hg.mozilla.org/graphs/file/tip/server/analysis/analyze_talos.py . The methodology used is available from inspection of the script or https://wiki.mozilla.org/images/c/c0/Larres-thesis.pdf#page=73
-It is a common misconception that [https://tbpl.mozilla.org/ TBPL] talos tests turn orange for test failure.  In fact, as long as no infrastructure problems are encountered, Talos runs always report green regardless of the performance of the system under test.  Graphserver and the talos regression emails are the only source of information as to whether a regression or improvement has resulted.  However, the existing methods of calculating statistics (professing to detect regressions as low as 1%)  lead to a high degree of noise.  The consequence is confusion amongst developers with regard to whether a reported regression is legitimate or not; the consequence of this is that the regression emails are largely ignored since there are so many of them and most of them are merely noise.
+It is a common misconception that [https://tbpl.mozilla.org/ TBPL] talos tests turn orange for test failure.  In fact, as long as no infrastructure problems are encountered, Talos runs always report green regardless of the performance measurements of the system under test.  Graphserver and the talos regression emails are the only source of information as to whether a regression or improvement has resulted.  However, the existing methods of calculating statistics (professing to detect regressions as low as 1%)  lead to a high degree of noise and therefore a high ratio of false positives (and negatives) to actual regressions detected.  The consequence is confusion amongst developers with regard to whether a reported regression is legitimate or not; the consequence of this is that the regression emails are largely ignored since there are so many of them and most of them are merely noise.
 It is another common misconception that all of the "Talos regression/improvement" emails actually come from the Talos tests.  Some (e.g. numbers of constructors) do not.
@@ Line 27: / Line 27: @@
 === Types of Talos Tests ===
-Talos has, in essence, three different kinds of tests although the lines are blurred:
+Talos has, in essence, three different kinds of tests, although the lines between them are blurred:
 * startup tests : https://wiki.mozilla.org/Buildbot/Talos#Startup_Tests
 * pageloader tests :  https://wiki.mozilla.org/Buildbot/Talos#Page_Load_Tests
@@ Line 56: / Line 56: @@
    To determine whether a good point is "good" or "bad", we take 20-30 points of historical data, and 5 points of future data.  We compare these using a t-test.  See https://wiki.mozilla.org/images/c/c0/Larres-thesis.pdf#page=74 . Regressions are mailed to the dev-tree-management mailing list.  Regressions are calculated by the analyze_talos.py script which uses a configuration file based on http://hg.mozilla.org/graphs/file/tip/server/analysis/analysis.cfg.template
-(From https://wiki.mozilla.org/Buildbot/Talos#Regressions .)
+''(from https://wiki.mozilla.org/Buildbot/Talos#Regressions)''
-In practice a high amount of noise and false positives are observed with respect to regression or improvement detections. https://wiki.mozilla.org/images/c/c0/Larres-thesis.pdf#page=74 points out the general methodology used by this script and statistical shortcomings and potentially faulty assumptions going into it.  One notable violation of assumptions is that the t-test used assumes a normal distribution which we know for a fact not to be true (as documented elsewhere in the thesis).
+In practice a high amount of noise and false positives (and negatives) are observed with respect to regression or improvement detections. https://wiki.mozilla.org/images/c/c0/Larres-thesis.pdf#page=74 points out the general methodology used by this script and statistical shortcomings and potentially faulty assumptions going into it.  One notable violation of assumptions is that the t-test used assumes a normal distribution which we know for a fact not to be true (as documented elsewhere in the thesis).
-One tool available to developers is compare-talos: a semi-official web app that exists to compare talos numbers from different runs: http://perf.snarkfest.net/compare-talos/
+One tool available to developers is compare-talos: a semi-official web app that exists to compare talos numbers from different runs: http://perf.snarkfest.net/compare-talos/ .
 Larres (see https://wiki.mozilla.org/images/c/c0/Larres-thesis.pdf ) and Lewchuk (see https://bugzilla.mozilla.org/show_bug.cgi?id=710484 , https://groups.google.com/forum/#!msg/mozilla.dev.platform/kXUFafYInWs/XRCsrapUUGAJ ) investigated Talos statistics and suggested several potential areas of improvement.