Revision as of 12:32, 20 June 2014

Datazilla Alerting System

The Datazilla Alerting System, also known as dzAlerts. Is a daemon that periodically inspects the Talos, B2G, and Eideticker performance results and generates alerts when there is a statistically significant regression in performance.

Collaboration

Features and development are ongoing

Signal from Noise meeting (notes on etherpad)
IRC: #ateam@irc.mozilla.org
email: klahnakoski@mozilla.com

Objective

dzAlerts' objective is to provide high quality, and detailed alerts on performance regressions.

History

dzAlerts started as code inside Datazilla, hence the name. A few design changes were made during development:

The alerting code was distinct from the Datazilla code in that it was designed as stand alone deamons, with little integration with the main UI-centric code.
Integrating the alerts data access into the existing Datazilla database access pattern required too much boilerplate code that required additional maintenance as the project evolved.
The Datazilla database schema was not designed for pulling long time series on individual tests in a performant way.
The release schedules were different.

dzAlerts is now completely separate, and uses an ElasticSearch cluster to pull slices out of the data cube of test results. It maintains enough information about the alerts so it can provide links to the Datazilla UI.

Nomenclature

The nomenclature used by dzAlerts is a little different than the surrounding applications it deals with. This is to be consistent with the definitions of the same name overall, and to provide space for naming the fine features dzAlerts has visibility to, but are invisible to other systems.

Test Suite - or "suite" for short, is a set of bundled tests. All the "tests" referred to by Datazilla and Talos are actually bundles of individual tests. Dromeao_css, kraken, tp5o, and tscrollx are examples of test suites
Test - a single test responsible for testing a feature. Dromaeo_css has 6 tests called dojo.html, ext.html, jquery.html, mootools.html, prototype.html and yui.html. These can also be called "pages", or "page tests", depending on the suite.
Replicate - Each test is executed (replicated) multiple times, and the result stored in an array. Each result is called a "replicate", and the array of results is usually referred to as "the replicates".
Sub-Test - A test should be an atom, but unfortunately some tests bundle yet-deeper-but-unnamed tests into their replicates. These buried tests are called "sub-tests". To handle the sub-tests, the replicates are further ETLed into normal tests for dzAlerts to digest. more below
Time Series - A statistic from a family of test results which can reasonably be compared to each other, and sorted by time (usually by push date).

General Design

The are two main parts to dzAlerts. The ETL step copies data from Datazilla to the ElasticSearch cluster, and the dzAlerts daemon is responsible for identification of performance regressions, and filling a database with what it finds.

ETL

The Extract Transform Load (ETL) daemon is responsible for

Extracting the JSON test suite results from Datazilla,
Transforming the bundle of test results into individual records, and
Loading them into the ElasticSearch cluster.

The transform step is the most interesting, but still simple: In the case of tp5o, of the 50 page tests results is given it's own record with a copy of all the tp5o metadata (like platform, branch, test_run_time, etc). This is highly redundant, but allows ElasticSearch to index each test simply for fast retrieval.

Dromeao* ETL

The two dromaeo* suites (dromaeo_css, dromaeo_dom) are a little more complicated, and best demonstrated with couple charts from Datazilla. The top chart shows modify.html test mean and variance. One of the tests is highlighted, and you can see the individual replicates in the bottom chart. Looking at the replicates, we see they are in sets-of-five: Each five replicates has approximately the same value, and each set-of-five seems distinct from the others. Now that you see the pattern in the replicates, you can understand why the variance shown in the above chart is consistently high.

This same pattern happens in all tests of both dromaeo_css and dromaeo_dom.

Pulling Data

Requesting a time time series is done by simply filtering the test results by product, branch, platform and test name. ElasticSearch can pull months of Talos test results for any one combination in under a second. Each test is made of many replicates, and those replicates are reduced to a single statistic (usually median) and forwarded for analysis.

Analysis

Once the test results are retrieved, and sorted by push date, several window functions are run over the data: They calculate the past stats, future stats, t-test pvalue, and median test pvalue, among other things.

Decision of *if* there is a regression is done with the median test. The median test is insensitive to amplitude, which allows it to ignore the intermittent non-Gaussian noise we witness in many if the time series. The median test is only good at detecting clear discontinuities in the time series; this is fine because that is what we are looking for. The median test fails at isolating which revision is closest to the discontinuity, and picks any revision belonging to an increasing (or decreasing) slope. Slowly changing statistics are completely invisible to it. If the median test signals there may be a regression, the t-test is used to determine the specific revision to blame.

The t-test's amplitude sensitivity allows us to detect the specific revision closest to the discontinuity.

what window size
- currently: 20 data point before/after (globally adjustable, working on making it a per test parameter)

how to handle bimodal data (i.e. v8, dromaeo, etc.)
- currently: using median as our metric mitigates some of the bimodal problem by naturally selecting the most common mode.
- ideally: using a mixture model (http://en.wikipedia.org/wiki/Mixture_model) will help characterize the two modes and help us construct a useful aggregate

dzAlerts uses the median test, and more points allow it to produce less false positives on (non-guassian) noisy data. The t-test is too sensitive to the non-gaussian noise we are witnessing and generating too many false alerts.

Math

Window Functions

Code

https://github.com/klahnakoski/datazilla-alerts

Auto-tools/Projects/Alerts: Difference between revisions