|
|
Line 30: |
Line 30: |
| * Datazilla should be able to be scalable enough to accumulate data per-push and generate a "regression/improvement" analysis for that push in real time. | | * Datazilla should be able to be scalable enough to accumulate data per-push and generate a "regression/improvement" analysis for that push in real time. |
| * The system should also provide a UI atop its own REST interfaces so that an interested developer can start on TBPL and drill into the results of a push. The developer should be able to drill all the way down to the raw replicate values for the page (i.e. each page is loaded some number of times, and you should be able to drill down to that level if you want to.) | | * The system should also provide a UI atop its own REST interfaces so that an interested developer can start on TBPL and drill into the results of a push. The developer should be able to drill all the way down to the raw replicate values for the page (i.e. each page is loaded some number of times, and you should be able to drill down to that level if you want to.) |
|
| |
| == Metrics ==
| |
| The Mozilla Metrics team, https://wiki.mozilla.org/Metrics , worked as part of Signal from Noise to audit our performance statistical methodology and help develop better models. Metrics looked at the following issues:
| |
| * Determine source of variation in the data: After looking at the data from running experiments, Metrics determined two main sources of variation in the data. First, aggregating all the test pages into a single number was hiding true signal from noise as the pageload times for the 100 pages were very different. Second, the way Talos data was being collected before Q1 2012 introduced a large variation within the replicates of each test page.
| |
| * Interleaved/non-interleaved tests: as of Q1 2012, pageload tests (see https://wiki.mozilla.org/Buildbot/Talos#Page_Load_Tests ) were run such that the entire pageset was cycled through 'N' times, where 'N' is the number of replicates per page. We were concerned that this could be a source of our noise. This issue was investigated, see http://elvis314.wordpress.com/2012/03/12/reducing-the-noise-in-talos/ and http://people.mozilla.org/~ctalbert/TalosPlots/rowmajor_change/index.html. This way, the "within" variation for individual test pages decreased (which means more powerful in detecting regressions between pushes)
| |
| * Non-normal distributions - https://wiki.mozilla.org/Metrics/Talos_Investigation#Non_Normal_distributions : Several non-normal distributions were found amongst the Talos data sets, including multi-modal distributions. One of the causes of multimodality was due to aggregation of pages with very different pageload times due to different characteristics of the pages we are testing in tp 5. Hence, it is crucial to move to page-centric testing, rather than aggregated testing.
| |
| * Determining the number of observations per testpage: It is crucial that we have a good balance machine time for a talos test and having enough replicates for statisical viability of the test results. The optimal number of replicates for each test page for statistical testing is about 30 (J devore, Probability & Statistics for Engineering & Sciences 8th ed. p. 226). However, due to the time constraints, we decided to collect 25 replicates (still a big improvement from previous, when we collected 10 replicates but not optimal).
| |
| * Ignore_first:5 : Metrics determined that ignoring the first 5 data points on pageload tests increased statistical viability because most of the variation was coming from the first few data points ; https://bugzilla.mozilla.org/show_bug.cgi?id=731391
| |
| https://wiki.mozilla.org/images/d/dd/Talos_Statistical_Analysis_Writeup.pdf.
| |
| * Audit of tp5 pageset : The tp5 (see https://wiki.mozilla.org/Buildbot/Talos#tp5 ) pageset was audited as some of the pages had significantly large variation within replicates: https://wiki.mozilla.org/images/3/38/Tp5_Good_Pages.pdf (Metrics recommend decreasing the size of test pages in tp5 and increasing the number of replicates).
| |
| * Quality of data: Some pages show systematic patterns which may indicate that there is a problem with the data being collected (may be due to hardware, software, validity of test pages, etc.). This should be investigated to ensure that the data we collect for testing correctly represents what we are trying to measure.
| |
| * New method for regression detection: https://wiki.mozilla.org/images/d/dd/Talos_Statistical_Analysis_Writeup.pdf : Working with Datazilla results for tp5 test pages, Metrics developed a regression detection algorithm.To compare the mean of each page to of the new push to the mean of each page to the current push, hypothesis tests are conducted http://en.wikipedia.org/wiki/Statistical_hypothesis_testing. Welch's t-test is used to determine whether a page has regressed for the given new push. Moving to page-centric testing led to multiple hypothesis testing problem, and to correct for the inflation of false positives, False Discovery Rate Procedure (FDR) is used: http://www.stat.cmu.edu/~genovese/talks/hannover1-04.pdf. Due to the natural variation between consecutive pushes, exponential smoothing was implemented before performing FDR procedure. Code for this is available in https://github.com/mozilla/datazilla-metrics
| |