Auto-tools/Projects/OrangeFactor: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
No edit summary
m (Make decommission note larger)
 
(31 intermediate revisions by 7 users not shown)
Line 1: Line 1:
== Status ==
= OrangeFactor has been decommissioned =
'''It has been succeeded by [[EngineeringProductivity/Projects/IntermittentFailuresView|Intermittent Failures View]].'''


We have weekly meetings! 
== About ==


Wednesdays, 2PM PDT, 1-800-707-2533, PIN: 369, Conf# 304
The [https://brasstacks.mozilla.com/orangefactor/ OrangeFactor] web app is a tool for tracking and analysing intermittent test failures encountered that occur during Firefox/gecko continuous integration automation.
(or) 650-903-0800 ext: 92, Conf# 304


* [[Auto-tools/Projects/WarOnOrange/2010-11-03|November 3, 2010]]
NB: OrangeFactor is considered near end of life. A replacement will likely use Treeherder's API as a backend instead.
* [[Auto-tools/Projects/WarOnOrange/2010-10-20|October 20, 2010]]
* [[Auto-tools/Projects/WarOnOrange/2010-10-13|October 13, 2010]]
* [[Auto-tools/Projects/WarOnOrange/2010-09-22|September 22, 2010]]


== Goals ==
== Contributing ==


'''Primary Goal'''
For help with OrangeFactor, please contact #ateam or :emorley, :gbrown, :jmaher.


To develop a web dashboard that is useful for identifying and tracking the state of intermittent oranges in our tinderbox unit tests. This should help developers identify which oranges are most 'interesting', and should give people a notion of the overall state of oranges over time.
File bugs under [https://bugzilla.mozilla.org/enter_bug.cgi?product=Tree+Management&component=OrangeFactor Tree Management::OrangeFactor]. See open bugs [https://bugzilla.mozilla.org/buglist.cgi?resolution=---&component=OrangeFactor&product=Tree+Management here]


'''Secondary Goals'''
The OrangeFactor web app can be run locally. See the instructions at:
https://hg.mozilla.org/automation/orangefactor/file/default/README.txt


Since the implementation of the dashboard will require tinderbox failures be put into a database, we could potentially use this database in the tinderbox+pushlog UI, which would allow it to query data from a (fast) database, rather than parsing buildbot logs as it sometimes currently does.
== Architecture ==


== History ==
* [https://brasstacks.mozilla.com/ brasstacks.mozilla.com]
** SSL termination occurs on the box.
** Only listens on port 443, since HTTP->HTTPS redirection performed by Zeus.


'''Topfails'''
* [https://brasstacks.mozilla.com/orangefactor/ OrangeFactor UI]:
** Static UI that interacts with the OrangeFactor API.
** Served by [https://openresty.org/ OpenResty].
** [https://groups.google.com/forum/#!topic/mozilla.dev.tree-management/cIc63kVSN5M Requires LDAP credentials to access].
** [https://hg.mozilla.org/automation/orangefactor/file/default/html Source].


[http://brasstacks.mozilla.com/topfails/ Topfails] was the first database-driven orange tracker developed in our team. It shows failures in terms of overall occurrences. It suffers from a buggy log parser, and a UI with relatively few views.
* [https://brasstacks.mozilla.com/orangefactor/api/ OrangeFactor REST API]:
** Python FastCGI app reverse proxied by OpenResty.
** [https://groups.google.com/forum/#!topic/mozilla.dev.tree-management/cIc63kVSN5M Requires LDAP credentials to access].
** Read-only API apart from job classification submissions from [https://github.com/mozilla/treeherder Treeherder], which use [https://github.com/kumar303/mohawk Hawk authentication].
** API responses are a combination of results from OrangeFactor's ES instance and the public hg.mozilla.org pushlog.
** ES queries are made using [https://github.com/aparo/pyes pyes], plus a helper library we've written on top of it, [https://hg.mozilla.org/automation/mozautoeslib/ mozautoeslib].
** [https://hg.mozilla.org/automation/orangefactor/file/default/server Source].


source: http://hg.mozilla.org/automation/topfails/
* OrangeFactor Elasticsearch instance:
** Index `bugs`: The intermittent test failure records submitted by Treeherder via OrangeFactor's REST API.
** Index `bzcache`: A cache of public `keyword:intermittent-failure` bugs populated via a brasstacks cron job.


'''Orange Factor'''
* OrangeFactor bzcache refresh task:
** brasstacks cron job run every four hours, that populates the `bzcache` index on the OrangeFactor Elasticsearch instance.
** Fetches bug data using unauthenticated requests to Bugzilla's REST API.
** [https://github.com/jonallengriffin/bzcache/blob/master/bzcache/bz_cache_refresh.py Source].


[http://jmaher.couchone.com/orange_factor/_design/woo/orange.html Orange Factor] is a newer dashboard by jmaher. It calculates the average number of oranges per push (the 'orange factor'), and tracks that number over time. We're currently using it as a base to explore the usefulness of other statistics.
* OrangeFactor mailer task:
** brasstacks cron job <del>run every week</del> now disabled ({{bug|1426416}}), that emails a summary of top failures to [https://groups.google.com/forum/#!forum/mozilla.dev.tree-alerts mozilla.dev.tree-alerts].
** Fetches data from the OrangeFactor REST API.
** [https://hg.mozilla.org/automation/orangefactor/file/default/woo_mailer.py Source].


source: http://github.com/jmaher/Orange-Factor
* OrangeFactor bug commenter task:
** brasstacks cron job run in both a daily and weekly variant, that adds failure summary comments to bugs associated with the intermittent test failures ([https://bugzilla.mozilla.org/show_bug.cgi?id=1365219#c6 example]).
** Fetches data from the OrangeFactor REST API.
** Posts bug comments using the orangefactor@bots.tld account, which has no additional permissions beyond a standard user account.
** [https://hg.mozilla.org/automation/orangefactor/file/default/woo_commenter.py Source].


[[Auto-tools/Projects/WarOnOrange/InstallationInstructions|Installation instructions]]
* [https://activedata.allizom.org/ ActiveData] mirror of OrangeFactor Elasticsearch data:
** Currently synced manually by :eykle.
** Plans to automate this in the future ({{bug|1344253}}).


[https://bugzilla.mozilla.org/buglist.cgi?resolution=---&component=Orange%20Factor&product=Testing bugs]
== Deploying ==


== Architecture ==
* Ensure you are a member of the LDAP group `vpn_brasstacks` and are on the sudoers list in puppet.
 
* Either connect to MozillaVPN or have the SSH jumphost set up via a `ProxyCommand` in your SSH config, then ssh to `brasstacks1.dmz.scl3.mozilla.com`.
The system will have several moving parts:
* To pull new changes from the Hg repo:
 
sudo -u webtools hg pull -uv -R /home/webtools/apps/orangefactor/src/orangefactor/
* a [http://hive.apache.org/ Hive] database ({{bug|601028}}), hosted by the metrics team, which will store raw logs, and an [http://www.elasticsearch.com ElasticSearch] database, which will store data parsed out from the buildbot logs
* To then pick up any changes to the Python REST API (not required for changes to UI static assets or the Python scripts run by cron):
 
sudo service orangefactor stop; sudo service orangefactor start
* a [http://flume.csail.mit.edu/doku.php?id=flume flume] agent, which will move logs from the into the Hive and ElasticSearch databases hosted by metrics
* It's no longer necessary to sync the UI static assets in the source directory with the openresty/nginx default public root, since they are served directly from the source directory.
 
* a unittest [http://hg.mozilla.org/automation/logparser/ logparser] ({{bug|601216}}), that will parse buildbot test logs and produce output that gets fed into the ElasticSearch db via flume
 
* a web dashboard that pulls data from the database and displays various interesting statistics about it


== Making Oranges Interesting ==
== Making Oranges Interesting ==
Line 63: Line 83:
* identify interesting patterns in failures; some failures may occur more frequently on certain OS's, build types, architectures, or other factors; by providing views which can track oranges across a range of factors, we might be able to provide developers with data that would help them reproduce failures or give them insight into their cause
* identify interesting patterns in failures; some failures may occur more frequently on certain OS's, build types, architectures, or other factors; by providing views which can track oranges across a range of factors, we might be able to provide developers with data that would help them reproduce failures or give them insight into their cause


* identify overall trends in orange occurrences, already part of [http://jmaher.couchone.com/orange_factor/_design/woo/orange.html Orange Factor]; this can help track the 'orangeness' of a product over time, and can help measure the helpfulness of orange-fixing activities
* identify overall trends in orange occurrences, already part of the [http://jmaher.couchone.com/orange_factor/_design/woo/orange.html legacy Orange Factor app]; this can help track the 'orangeness' of a product over time, and can help measure the helpfulness of orange-fixing activities


== Dashboard Views ==
== Attacking Oranges ==


A list of dashboard views that may be interesting. We're currently using Orange Factor as a platform to experiment with views.
https://developer.mozilla.org/en/QA/Fixing_intermittent_oranges


* {{done|display}} of overall orange factor over time
== History ==
** do we need to be able display orange factor per OS, build type, etc?
 
* {{done|display}} of failures/day, for a given failure
These projects are deprecated and replaced by the new War on Orange/OrangeFactor application.
* {{done|dipslay}} of failures/commit/day, for a given failure
 
* {{ok|display}} of moving averages of the above
'''Topfails'''
** which moving averages are most useful?
 
* display of failure frequencies which exceed certain limits (probably based on standard deviation)
[http://brasstacks.mozilla.com/topfails/ Topfails] was the first database-driven orange tracker developed in our team.  It shows failures in terms of overall occurrences.  It suffers from a buggy log parser, and a UI with relatively few views.
* display of most common failures, in aggregate, and separated by various factors: platform, OS version, architecture, build type, etc
 
* other...?
source: http://hg.mozilla.org/automation/topfails/
 
'''Old Orange Factor'''
 
[http://jmaher.couchone.com/orange_factor/_design/woo/orange.html Orange Factor] is a newer dashboard by jmaher.  It calculates the average number of oranges per push (the 'orange factor'), and tracks that number over time.  We're currently using it as a base to explore the usefulness of other statistics.


source: http://github.com/jmaher/Orange-Factor


== Statistics ==
See also [http://people.mozilla.com/~mcote/war-on-orange/war-on-orange-paper-testistanbul.pdf this paper] and the accompanying [http://people.mozilla.com/~mcote/war-on-orange/war-on-orange-testistanbul-slides-presentation.pdf slides], prepared for the TestIstanbul 2012 conference.


The amount of information yielded from the parsed logs is vast.  The raw data will be noisy and the trends will not be easily discerned.  So statistical analysis should be used to manipulate the data and seek insight into trends.
== ActiveData ==


[[Auto-tools/Projects/WarOnOrange/Statistics|War On Orange Statistics]]
The OrangeFactor ElasticSearch metadata is replicated to [https://activedata.allizom.org/ ActiveData], and can be queried there using the "orange_factor" index:


== How Tinderbox Stores Its Data ==
: <code>{"from": "orange_factor"}</code>


Tinderbox stores logs in the format
== REST API ==


    1291054515.1291054887.26068.gz
The War on Orange site pulls its data from a REST API. Other applications can hook into this to get the raw orange data.


or
The API root is at http://brasstacks.mozilla.com/orangefactor/api/.  Parameters are passed via the query string, eg. ?key1=value1&key2=value2. Example: http://brasstacks.mozilla.com/orangefactor/api/count?startday=2011-05-21&endday=2011-05-27&tree=mozilla-central


    xxx.yyy.zzz.gz
All returned data is in JSON format.


where
=== bybug ===


    * xxx is approximately the time that buildbot picked up the test to run
Provides a date-indexed list of oranges, with bug numbers, along with minimal details of each bug.
    * yyy is the time the log was e-mailed to tinderbox
    * zzz is a complete mystery


Tinderbox maintains a list of bug->log associations at http://tinderbox.mozilla.org/Firefox/notes.txt. The format used therein is:
Parameters:
* startday: Mandatory. In ISO format, e.g. 2011-05-27.
* endday: Mandatory. Also in ISO format.
* bugid: Optional. Return orange data for this bug only.
* tree: Optional. Return information about this tree only. Defaults to mozilla-central. Pass "all" for orange data on all trees.
* type: Optional. Return information for this build type only. Must be "opt" or "debug". Defaults to none (both build types).


    1291056044|WINNT 5.2 mozilla-central debug test mochitests-2/5|jmathies@mozilla.com|1291059392|Bug%20614474
Returns an object with two properties:
* oranges: An object with dates as properties, e.g. data['oranges']['2011-05-27'].  Each property is another object with orange data for the day, with the following properties:
** orangecount: total number of oranges for that day, e.g. 54.
** testruns: number of test runs that day, e.g. 24.  The "Orange Factor" is orangecount/testruns.
** oranges: details of the oranges that occurred that day.  It is an array of objects, each one having these simple properties:
*** buildtype
*** date
*** starttime
*** timestamp
*** platform
*** branch
*** machinename
*** test
*** bug
*** revision
* bugs: An object with bug ids as properties. Each bug in the above list of oranges is represented here. The information is gathered via pulse and thus is quicker to access than querying Bugzilla. Only a few basic properties are available; for more detailed info, you will have to consult Bugzilla:
** id
** status
** summary


or
=== count ===


    yyy|mmm|[user]|ttt|[bug #]
Returns a date-indexed summary of orange data.


where
The parameters are the same as for bybug.  The returned data is also the same, except that the 'bugs' property is not returned, and the array of orange details (data['oranges'][<date>]['oranges']) is empty. This is a faster way to get just the summarized numerical data for, e.g., Orange Factor calculations.


    * yyy is the same as yyy above
=== bugdetails ===
    * mmm is a string representing the testrun, in a format which isn't in the raw buildbot log
    * ttt is the time that the bug was starred


None of this data can be found in the raw buildbot logs themselves, although yyy is approximately the same as the timestamp of the logfile on stage.mozilla.org (they're not exact though, there is usually a few seconds difference between the time the log was e-mailed to tinderbox (yyy) and the time the log was copied to stage).
Returns minimal details for one or more bugs.


== ElasticSearch Queries ==
The only parameter is "bugid", which takes a bug id or a comma-separated list of bug ids.


The log metadata is all stored in ElasticSearch, see the [[Auto-tools/Projects/WarOnOrange/ElasticSearch]] page for details on querying this database.
Returns the same data as the "bugs" property of the bybug returned data.

Latest revision as of 09:07, 7 June 2018

OrangeFactor has been decommissioned

It has been succeeded by Intermittent Failures View.

About

The OrangeFactor web app is a tool for tracking and analysing intermittent test failures encountered that occur during Firefox/gecko continuous integration automation.

NB: OrangeFactor is considered near end of life. A replacement will likely use Treeherder's API as a backend instead.

Contributing

For help with OrangeFactor, please contact #ateam or :emorley, :gbrown, :jmaher.

File bugs under Tree Management::OrangeFactor. See open bugs here

The OrangeFactor web app can be run locally. See the instructions at: https://hg.mozilla.org/automation/orangefactor/file/default/README.txt

Architecture

  • brasstacks.mozilla.com
    • SSL termination occurs on the box.
    • Only listens on port 443, since HTTP->HTTPS redirection performed by Zeus.
  • OrangeFactor Elasticsearch instance:
    • Index `bugs`: The intermittent test failure records submitted by Treeherder via OrangeFactor's REST API.
    • Index `bzcache`: A cache of public `keyword:intermittent-failure` bugs populated via a brasstacks cron job.
  • OrangeFactor bzcache refresh task:
    • brasstacks cron job run every four hours, that populates the `bzcache` index on the OrangeFactor Elasticsearch instance.
    • Fetches bug data using unauthenticated requests to Bugzilla's REST API.
    • Source.
  • OrangeFactor mailer task:
  • OrangeFactor bug commenter task:
    • brasstacks cron job run in both a daily and weekly variant, that adds failure summary comments to bugs associated with the intermittent test failures (example).
    • Fetches data from the OrangeFactor REST API.
    • Posts bug comments using the orangefactor@bots.tld account, which has no additional permissions beyond a standard user account.
    • Source.
  • ActiveData mirror of OrangeFactor Elasticsearch data:
    • Currently synced manually by :eykle.
    • Plans to automate this in the future (bug 1344253).

Deploying

  • Ensure you are a member of the LDAP group `vpn_brasstacks` and are on the sudoers list in puppet.
  • Either connect to MozillaVPN or have the SSH jumphost set up via a `ProxyCommand` in your SSH config, then ssh to `brasstacks1.dmz.scl3.mozilla.com`.
  • To pull new changes from the Hg repo:
sudo -u webtools hg pull -uv -R /home/webtools/apps/orangefactor/src/orangefactor/
  • To then pick up any changes to the Python REST API (not required for changes to UI static assets or the Python scripts run by cron):
sudo service orangefactor stop; sudo service orangefactor start
  • It's no longer necessary to sync the UI static assets in the source directory with the openresty/nginx default public root, since they are served directly from the source directory.

Making Oranges Interesting

Currently, our intermittent oranges are not very interesting. After they've been identified, they are usually more-or-less ignored. This has caused us to accumulate oranges to the point where we have to deal with several of them for every commit (and by 'deal with', I mean 'log it and forget it'), which is time consuming for the sheriffs and for anyone who pushes a commit. At the same time, it demotivates any effort to actually fix them.

We'd like to help change that. We think we can help by creating a dashboard to analyze oranges in the following ways:

  • identify the oranges that occur most frequently; these are the oranges that would produce the greatest improvement in our orange factor if fixed
  • identify significant changes in the frequency of a given orange; if a known intermittent orange suddenly begins to occur more frequently, it may be related to a recent code change, and this might give developers more information about when/why it occurred, which would hopefully help in fixing it
  • identify interesting patterns in failures; some failures may occur more frequently on certain OS's, build types, architectures, or other factors; by providing views which can track oranges across a range of factors, we might be able to provide developers with data that would help them reproduce failures or give them insight into their cause
  • identify overall trends in orange occurrences, already part of the legacy Orange Factor app; this can help track the 'orangeness' of a product over time, and can help measure the helpfulness of orange-fixing activities

Attacking Oranges

https://developer.mozilla.org/en/QA/Fixing_intermittent_oranges

History

These projects are deprecated and replaced by the new War on Orange/OrangeFactor application.

Topfails

Topfails was the first database-driven orange tracker developed in our team. It shows failures in terms of overall occurrences. It suffers from a buggy log parser, and a UI with relatively few views.

source: http://hg.mozilla.org/automation/topfails/

Old Orange Factor

Orange Factor is a newer dashboard by jmaher. It calculates the average number of oranges per push (the 'orange factor'), and tracks that number over time. We're currently using it as a base to explore the usefulness of other statistics.

source: http://github.com/jmaher/Orange-Factor

See also this paper and the accompanying slides, prepared for the TestIstanbul 2012 conference.

ActiveData

The OrangeFactor ElasticSearch metadata is replicated to ActiveData, and can be queried there using the "orange_factor" index:

{"from": "orange_factor"}

REST API

The War on Orange site pulls its data from a REST API. Other applications can hook into this to get the raw orange data.

The API root is at http://brasstacks.mozilla.com/orangefactor/api/. Parameters are passed via the query string, eg. ?key1=value1&key2=value2. Example: http://brasstacks.mozilla.com/orangefactor/api/count?startday=2011-05-21&endday=2011-05-27&tree=mozilla-central

All returned data is in JSON format.

bybug

Provides a date-indexed list of oranges, with bug numbers, along with minimal details of each bug.

Parameters:

  • startday: Mandatory. In ISO format, e.g. 2011-05-27.
  • endday: Mandatory. Also in ISO format.
  • bugid: Optional. Return orange data for this bug only.
  • tree: Optional. Return information about this tree only. Defaults to mozilla-central. Pass "all" for orange data on all trees.
  • type: Optional. Return information for this build type only. Must be "opt" or "debug". Defaults to none (both build types).

Returns an object with two properties:

  • oranges: An object with dates as properties, e.g. data['oranges']['2011-05-27']. Each property is another object with orange data for the day, with the following properties:
    • orangecount: total number of oranges for that day, e.g. 54.
    • testruns: number of test runs that day, e.g. 24. The "Orange Factor" is orangecount/testruns.
    • oranges: details of the oranges that occurred that day. It is an array of objects, each one having these simple properties:
      • buildtype
      • date
      • starttime
      • timestamp
      • platform
      • branch
      • machinename
      • test
      • bug
      • revision
  • bugs: An object with bug ids as properties. Each bug in the above list of oranges is represented here. The information is gathered via pulse and thus is quicker to access than querying Bugzilla. Only a few basic properties are available; for more detailed info, you will have to consult Bugzilla:
    • id
    • status
    • summary

count

Returns a date-indexed summary of orange data.

The parameters are the same as for bybug. The returned data is also the same, except that the 'bugs' property is not returned, and the array of orange details (data['oranges'][<date>]['oranges']) is empty. This is a faster way to get just the summarized numerical data for, e.g., Orange Factor calculations.

bugdetails

Returns minimal details for one or more bugs.

The only parameter is "bugid", which takes a bug id or a comma-separated list of bug ids.

Returns the same data as the "bugs" property of the bybug returned data.