Auto-tools/Projects/Signal From Noise/StatusNovember2012: Difference between revisions

→‎State of Statistics: November 2012: consolidate wrt ignore_first:5
(→‎State of Statistics: November 2012: from https://wiki.mozilla.org/Auto-tools/Projects/Signal_From_Noise/Execution2012 , https://bugzilla.mozilla.org/show_bug.cgi?id=803179#c18)
(→‎State of Statistics: November 2012: consolidate wrt ignore_first:5)
Line 47: Line 47:


The Mozilla Metrics team, https://wiki.mozilla.org/Metrics , worked as part of Signal from Noise to audit our performance statistical methodology and help develop better models. Metrics looked at the following issues:
The Mozilla Metrics team, https://wiki.mozilla.org/Metrics , worked as part of Signal from Noise to audit our performance statistical methodology and help develop better models. Metrics looked at the following issues:
* Determine source of variation in the data: After looking at the data from running experiments, Metrics determined two main sources of variation in the data. First, aggregating all the test pages into a single number was hiding true signal from noise as the pageload times for the 100 pages were very different. Second, the way Talos data was being collected before Q1 2012 introduced a large variation within the replicates of each test page.
* Determine source of variation in the data: After looking at the data from running experiments, Metrics determined two main sources of variation in the data. First, aggregating all the test pages into a single number was hiding true signal from noise as the pageload times for the 100 pages were very different. Second, the way Talos data was being collected before Q1 2012 introduced a large variation within the replicates of each test page.
* Interleaved/non-interleaved tests: as of Q1 2012, pageload tests (see https://wiki.mozilla.org/Buildbot/Talos#Page_Load_Tests ) were run such that the entire pageset was cycled through 'N' times, where 'N' is the number of replicates per page.  We were concerned that this could be a source of our noise. This issue was investigated, see http://elvis314.wordpress.com/2012/03/12/reducing-the-noise-in-talos/ and http://people.mozilla.org/~ctalbert/TalosPlots/rowmajor_change/index.html. This way, the "within" variation for individual test pages decreased (which means more powerful in detecting regressions between pushes)
* Interleaved/non-interleaved tests: as of Q1 2012, pageload tests (see https://wiki.mozilla.org/Buildbot/Talos#Page_Load_Tests ) were run such that the entire pageset was cycled through 'N' times, where 'N' is the number of replicates per page.  We were concerned that this could be a source of our noise. This issue was investigated, see http://elvis314.wordpress.com/2012/03/12/reducing-the-noise-in-talos/ and http://people.mozilla.org/~ctalbert/TalosPlots/rowmajor_change/index.html. This way, the "within" variation for individual test pages decreased (which means more powerful in detecting regressions between pushes)
* Non-normal distributions - https://wiki.mozilla.org/Metrics/Talos_Investigation#Non_Normal_distributions : Several non-normal distributions were found amongst the Talos data sets, including multi-modal distributions.  One of the causes of multimodality was due to aggregation of pages with  very different pageload times due to different characteristics of the  pages we are testing in tp 5. Hence, it is crucial to move to  page-centric testing, rather than aggregated testing.
* Non-normal distributions - https://wiki.mozilla.org/Metrics/Talos_Investigation#Non_Normal_distributions : Several non-normal distributions were found amongst the Talos data sets, including multi-modal distributions.  One of the causes of multimodality was due to aggregation of pages with  very different pageload times due to different characteristics of the  pages we are testing in tp 5. Hence, it is crucial to move to  page-centric testing, rather than aggregated testing.
* Determining the number of observations per testpage: It is crucial that we have a good balance machine time for a talos test and having enough replicates for statisical viability of the test results. The optimal number of replicates for each test page for statistical testing is about 30 (J devore, Probability & Statistics for Engineering & Sciences 8th ed. p. 226). However, due to the time constraints, we decided to collect 25 replicates (still a big improvement from previous, when we collected 10 replicates but not optimal).
* Determining the number of observations per testpage: It is crucial that we have a good balance machine time for a talos test and having enough replicates for statisical viability of the test results. The optimal number of replicates for each test page for statistical testing is about 30 (J devore, Probability & Statistics for Engineering & Sciences 8th ed. p. 226). However, due to the time constraints, we decided to collect 25 replicates (still a big improvement from previous, when we collected 10 replicates but not optimal).
* Ignore_first:5 : Metrics determined that ignoring the first 5 data points on pageload tests  increased statistical viability because most of the variation was coming from the first few data points ; https://bugzilla.mozilla.org/show_bug.cgi?id=731391
 
https://wiki.mozilla.org/images/d/dd/Talos_Statistical_Analysis_Writeup.pdf.
* pageload tests ignore the first 5 data points (''Ignore_first:5'') : Metrics determined that ignoring the first 5 data points on pageload tests  increased statistical viability because most of the variation was coming from the first few data points ; see https://bugzilla.mozilla.org/show_bug.cgi?id=731391 , https://wiki.mozilla.org/images/d/dd/Talos_Statistical_Analysis_Writeup.pdf.
 
* Audit of tp5 pageset : The tp5 (see https://wiki.mozilla.org/Buildbot/Talos#tp5 ) pageset was audited as some of the pages had significantly large variation within replicates: https://wiki.mozilla.org/images/3/38/Tp5_Good_Pages.pdf (Metrics recommend decreasing the size of test pages in tp5 and increasing the number of replicates).
* Audit of tp5 pageset : The tp5 (see https://wiki.mozilla.org/Buildbot/Talos#tp5 ) pageset was audited as some of the pages had significantly large variation within replicates: https://wiki.mozilla.org/images/3/38/Tp5_Good_Pages.pdf (Metrics recommend decreasing the size of test pages in tp5 and increasing the number of replicates).
* Quality of data: Some pages show systematic patterns which may  indicate that there is a problem with the data being collected (may be due to hardware, software, validity of test pages, etc.). This should be  investigated to ensure that the data we collect for testing correctly  represents what we are trying to measure.
* Quality of data: Some pages show systematic patterns which may  indicate that there is a problem with the data being collected (may be due to hardware, software, validity of test pages, etc.). This should be  investigated to ensure that the data we collect for testing correctly  represents what we are trying to measure.
* New method for regression detection: https://wiki.mozilla.org/images/d/dd/Talos_Statistical_Analysis_Writeup.pdf : Working with Datazilla results for tp5 test pages, Metrics developed a regression detection algorithm.To compare the mean of each page to of the new push to the mean of each page to the current push, hypothesis tests are conducted http://en.wikipedia.org/wiki/Statistical_hypothesis_testing. Welch's t-test is used to determine whether a page has regressed for the given new push. Moving to page-centric testing led to multiple hypothesis testing problem, and to correct for the inflation of false positives, False Discovery Rate Procedure (FDR) is used: http://www.stat.cmu.edu/~genovese/talks/hannover1-04.pdf.  Due to the natural variation between consecutive pushes, exponential smoothing was implemented before performing FDR procedure. Code for this is available in https://github.com/mozilla/datazilla-metrics
* New method for regression detection: https://wiki.mozilla.org/images/d/dd/Talos_Statistical_Analysis_Writeup.pdf : Working with Datazilla results for tp5 test pages, Metrics developed a regression detection algorithm.To compare the mean of each page to of the new push to the mean of each page to the current push, hypothesis tests are conducted http://en.wikipedia.org/wiki/Statistical_hypothesis_testing. Welch's t-test is used to determine whether a page has regressed for the given new push. Moving to page-centric testing led to multiple hypothesis testing problem, and to correct for the inflation of false positives, False Discovery Rate Procedure (FDR) is used: http://www.stat.cmu.edu/~genovese/talks/hannover1-04.pdf.  Due to the natural variation between consecutive pushes, exponential smoothing was implemented before performing FDR procedure. Code for this is available in https://github.com/mozilla/datazilla-metrics


* pageload tests ignore the first 5 data points: http://k0s.org:8080/?show=active
* we run non-interleaved for pageload tests
* we run non-interleaved for pageload tests
* we use more replicates per page
* we use more replicates per page
947

edits