Auto-tools/Projects/Alerts: Difference between revisions
Klahnakoski (talk | contribs) |
Klahnakoski (talk | contribs) |
||
Line 65: | Line 65: | ||
Once the test results are retrieved, and sorted by push date, several window functions are run over the data: They calculate the past stats, future stats, t-test pvalue, and median test pvalue, among other things. | Once the test results are retrieved, and sorted by push date, several window functions are run over the data: They calculate the past stats, future stats, t-test pvalue, and median test pvalue, among other things. | ||
Decision of *if* there is a regression is done with the median test. The median test is insensitive to amplitude, which allows it to ignore the intermittent non-Gaussian noise we witness in many if the | Decision of *if* there is a regression is done with the median test. The median test is insensitive to amplitude, which allows it to ignore the intermittent non-Gaussian noise we witness in many if the time series. The median test is only good at detecting clear discontinuities in the time series; this is fine because that is what we are looking for. The median test fails at isolating which revision is closest to the discontinuity, and picks any revision belonging to an increasing (or decreasing) slope. Slowly changing statistics are completely invisible to it. If the median test signals there may be a regression, the t-test is used to determine the specific revision to blame. | ||
The t-test's amplitude sensitivity allows us to detect the specific revision closest to the discontinuity. | |||
* what window size | * what window size |
Revision as of 12:32, 20 June 2014
Datazilla Alerting System
The Datazilla Alerting System, also known as dzAlerts. Is a daemon that periodically inspects the Talos, B2G, and Eideticker performance results and generates alerts when there is a statistically significant regression in performance.
Collaboration
Features and development are ongoing
- Signal from Noise meeting (notes on etherpad)
- IRC: #ateam@irc.mozilla.org
- email: klahnakoski@mozilla.com
Objective
dzAlerts' objective is to provide high quality, and detailed alerts on performance regressions.
History
dzAlerts started as code inside Datazilla, hence the name. A few design changes were made during development:
- The alerting code was distinct from the Datazilla code in that it was designed as stand alone deamons, with little integration with the main UI-centric code.
- Integrating the alerts data access into the existing Datazilla database access pattern required too much boilerplate code that required additional maintenance as the project evolved.
- The Datazilla database schema was not designed for pulling long time series on individual tests in a performant way.
- The release schedules were different.
dzAlerts is now completely separate, and uses an ElasticSearch cluster to pull slices out of the data cube of test results. It maintains enough information about the alerts so it can provide links to the Datazilla UI.
Nomenclature
The nomenclature used by dzAlerts is a little different than the surrounding applications it deals with. This is to be consistent with the definitions of the same name overall, and to provide space for naming the fine features dzAlerts has visibility to, but are invisible to other systems.
- Test Suite - or "suite" for short, is a set of bundled tests. All the "tests" referred to by Datazilla and Talos are actually bundles of individual tests. Dromeao_css, kraken, tp5o, and tscrollx are examples of test suites
- Test - a single test responsible for testing a feature. Dromaeo_css has 6 tests called dojo.html, ext.html, jquery.html, mootools.html, prototype.html and yui.html. These can also be called "pages", or "page tests", depending on the suite.
- Replicate - Each test is executed (replicated) multiple times, and the result stored in an array. Each result is called a "replicate", and the array of results is usually referred to as "the replicates".
- Sub-Test - A test should be an atom, but unfortunately some tests bundle yet-deeper-but-unnamed tests into their replicates. These buried tests are called "sub-tests". To handle the sub-tests, the replicates are further ETLed into normal tests for dzAlerts to digest. more below
- Time Series - A statistic from a family of test results which can reasonably be compared to each other, and sorted by time (usually by push date).
General Design
The are two main parts to dzAlerts. The ETL step copies data from Datazilla to the ElasticSearch cluster, and the dzAlerts daemon is responsible for identification of performance regressions, and filling a database with what it finds.
ETL
The Extract Transform Load (ETL) daemon is responsible for
- Extracting the JSON test suite results from Datazilla,
- Transforming the bundle of test results into individual records, and
- Loading them into the ElasticSearch cluster.
The transform step is the most interesting, but still simple: In the case of tp5o, of the 50 page tests results is given it's own record with a copy of all the tp5o metadata (like platform, branch, test_run_time, etc). This is highly redundant, but allows ElasticSearch to index each test simply for fast retrieval.
Dromeao* ETL
The two dromaeo* suites (dromaeo_css, dromaeo_dom) are a little more complicated, and best demonstrated with couple charts from Datazilla. The top chart shows modify.html test mean and variance. One of the tests is highlighted, and you can see the individual replicates in the bottom chart. Looking at the replicates, we see they are in sets-of-five: Each five replicates has approximately the same value, and each set-of-five seems distinct from the others. Now that you see the pattern in the replicates, you can understand why the variance shown in the above chart is consistently high.
This same pattern happens in all tests of both dromaeo_css and dromaeo_dom.
Pulling Data
Requesting a time time series is done by simply filtering the test results by product, branch, platform and test name. ElasticSearch can pull months of Talos test results for any one combination in under a second. Each test is made of many replicates, and those replicates are reduced to a single statistic (usually median) and forwarded for analysis.
Analysis
Once the test results are retrieved, and sorted by push date, several window functions are run over the data: They calculate the past stats, future stats, t-test pvalue, and median test pvalue, among other things.
Decision of *if* there is a regression is done with the median test. The median test is insensitive to amplitude, which allows it to ignore the intermittent non-Gaussian noise we witness in many if the time series. The median test is only good at detecting clear discontinuities in the time series; this is fine because that is what we are looking for. The median test fails at isolating which revision is closest to the discontinuity, and picks any revision belonging to an increasing (or decreasing) slope. Slowly changing statistics are completely invisible to it. If the median test signals there may be a regression, the t-test is used to determine the specific revision to blame.
The t-test's amplitude sensitivity allows us to detect the specific revision closest to the discontinuity.
- what window size
- currently: 20 data point before/after (globally adjustable, working on making it a per test parameter)
- how to handle bimodal data (i.e. v8, dromaeo, etc.)
- currently: using median as our metric mitigates some of the bimodal problem by naturally selecting the most common mode.
- ideally: using a mixture model (http://en.wikipedia.org/wiki/Mixture_model) will help characterize the two modes and help us construct a useful aggregate
dzAlerts uses the median test, and more points allow it to produce less false positives on (non-guassian) noisy data. The t-test is too sensitive to the non-gaussian noise we are witnessing and generating too many false alerts.