MetricsDataPing: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
Line 21: Line 21:
== Data Elements<br>  ==
== Data Elements<br>  ==


A directory of elements collected by the various data collection pings ((Metrics Data Collection Ping, Blocklist, AUS Ping, Version Check Ping, Services AMO, Telemetry) can be found here: [https://metrics.etherpad.mozilla.org/8]<br>
A directory of elements collected by the various data collection pings (Metrics Data Collection Ping, Blocklist, AUS Ping, Version Check Ping, Services AMO, Telemetry) can be found here: [https://metrics.etherpad.mozilla.org/8]<br>
<br>
<br>
The schema and definitions of data elements found in the Metrics Ping can be founder here [https://intranet.mozilla.org/Metrics/MetricsDataPingClientDataSchema]
The schema and definitions of data elements found in the Metrics Ping can be founder here [https://metrics.etherpad.mozilla.org/ep/pad/view/ro.B-/latest]
<br> <br>
<br> <br>
'''Document Identifier Strategy'''
'''Document Identifier Strategy'''

Revision as of 03:50, 27 January 2012

Description

Measure adoption, retention, and aggregated search counts by engine. Record possible explanatory dimensions using a statistically unbiased and sound approach. Comparable projects that collect user data are TestPilot and Telemetry. Participants in these programs are self selected. It has been demonstrated that data retrieved from TestPilot is biased and not representative of the Firefox population.

Opt-in vs. Opt-out: Layman's Explanation

Opt-in: assumes each user is not in the data collection sample as default position, the user is requested to join via UX elements – thus the user action is to opt-in to the data collection process for some minimal period

Opt-out: assumes that all users are included in the sample for maximum data coverage and full representativeness and thus we achieve full comprehensiveness – users are able to opt-out via an opt-out mechanism

How are they different?

  • Superficially the consequences can may seem the same but the validity of any conclusions drawn from data under the two alternative approaches is quite different
  • The attempt to reach a representative set of user data is the key differentiator between the approaches. In standard surveys one can easily see that the responses of those who offer or volunteer to take a survey are likely to be quite different from these of a rigorously administered survey. Such self-selection bias is a key weakness of online data collection also.

What difference does it make? Quite a bit!

  • We want to acquire representative data and analyze it for the ‘de-averaged’ benefit of multiple but still large sub-populations of users
  • Each subpopulation requires insights and actions that are not of the ‘one size fits all’ variety

Data Elements

A directory of elements collected by the various data collection pings (Metrics Data Collection Ping, Blocklist, AUS Ping, Version Check Ping, Services AMO, Telemetry) can be found here: [1]

The schema and definitions of data elements found in the Metrics Ping can be founder here [2]

Document Identifier Strategy

Each profile will generate a UUID to be used as the document key. Each day's submission will use that UUID, and this will also be the key for that profile's cumulative data on the server. As of 2011/12/01, the intention is for the server to merge on the fly as new documents are submitted, not persisting the individual submissions at all.

Client-side

The meta bug for the client side measurement system can be found here : https://bugzilla.mozilla.org/show_bug.cgi?id=718066

Sample JSON output that is recieved mozilla server side:

2011/11/04:
{
    "ver": 1,
    "uuid": "e8a583fe-98ec-45be-9e44-96a23759067a",
    "lastPingTime": 1320340265,
    "thisPingTime": "2011-11-04T19:30:11.948Z",
    "currentTime": "2011-11-04T19:30:11.962Z",
    "env": {
        "reason": "idle-daily",
        "OS": "Linux",
        "appID": "{ec8030f7-c20a-464f-9b0e-13a3a9e97384}",
        "appVersion": "10.0a1",
        "appVendor": "Mozilla",
        "appName": "Firefox",
        "appBuildID": "20111104162615",
        "appABI": "x86_64-gcc3",
        "appUpdateChannel": "default",
        "appDistribution": "default",
        "appDistributionVersion": "default",
        "platformBuildID": "20111103103700",
        "platformVersion": "10.0a1",
        "locale": "en-US",
        "name": "Linux",
        "version": "2.6.38-12-generic",
        "cpucount": 4,
        "memsize": 7889,
        "arch": "x86-64"
    },
    "simpleMeasurements": {
        "uptime": 0,
        "main": 3,
        "firstPaint": 629,
        "sessionRestored": 502,
        "isDefaultBrowser": false,
        "crashCountSubmitted": 1,
        "profileAge": 31,
        "addonCount": 2,
        "addons": [
            {
                "id": "crashme@ted.mielczarek.org",
                "appDisabled": false,
                "version": "0.3",
                "installDate": "2011-10-25T15:02:03.000Z",
                "updateDate": "2011-10-25T15:02:03.000Z"
            },
            {
                "id": "mozmetrics@mozilla.org",
                "appDisabled": false,
                "version": "0.1",
                "installDate": "2011-10-11T14:59:08.000Z",
                "updateDate": "2011-10-26T13:26:45.000Z"
            }
        ]
    },
    "events": {
        "search": {
            "abouthome": {
                "Google": 1
            },
            "searchbar": {
                "Google": 3,
                "Amazon.com": 1,
                "Other": 1
            },
            "urlbar": {
                "Google": 1
            }
        },
        "sessions": {
            "completedSessions": 16,
            "completedSessionTime": 829,
            "completedSessionActiveTime": 535,
            "abortedSessions": 2,
            "abortedSessionTime": 7,
            "abortedSessionActiveTime": 15,
            "abortedSessionAvg": 4,
            "abortedSessionMed": 4,
            "currentSessionActiveTime": 10,
            "currentSessionTime": 20,
            "aboutSessionRestoreStarts": 0
        },
        "corruptedEvents": 0
    }
}

Server-side

  • Clients will POST data to the configured URL not more than once every 24 hours.
  • The first timer check should be one minute after startup.
  • The POST data will consist of a JSON document containing a document ID and all the metrics that were collected since the last submission.
  • The server side will receive the POST request and perform GeoIP location on the IP address. The raw IP will never be stored. The GeoIP data and submission timestamp will be added to the JSON document.
  • The server will store the JSON document into a daily staging collection with all other documents received during that date, UTC.
  • The server will return an HTTP response to the client indicating success of the storage and a document ID. For the initial feature release, this ID will be the same as the one passed in (i.e. an installation GUID). It can easily be changed to be new each time (i.e. a document GUID). If the ID is new, the client should store it to be returned on the next submission.
  • In the future this response might also include instructions to the client for things such as changing timing or MetricsDataPing configuration.
  • Asynchronously, the server will retrieve a document with the same document ID from the "latest" bucket if one exists and will insert/update the "latest" bucket with a merged document that does not include any metrics we wish to avoid collecting longitudally per installation such as GeoIP. This "current" bucket is used to perform retention analysis since it will have the last submitted data for any installation even if it is no longer in use. We will set a retention policy for when these inactive installation documents shall be deleted from the "latest" bucket.
  • Longitudinal data for 6 months (e.g. intensity of use) is stored cumulatively in the JSON objects indexed by GUID. Anything older than 6 months is deleted.
  • At the end of the day, UTC, the server will aggregate all the documents submitted on that date and store the aggregate data (with no installation ID) in aggregate history tables in our data warehouse.
  • There will be UI elements inside of Firefox that allows users to delete all their data (remote and locally.

Data Access Policies

Members of the Metrics Team can access this data for strategic advisory, business operations, analytical research purposes. A more comprehensive set of policies applicable to data at Mozilla in general needs to be determined, presumably in conjuction with the UDC.
Access control is currently based on the following criteria:

  • Must be a member of the metrics team
  • Must have an SSH account with LDAP integrated key
  • Must have MPT-VPN access

User Data

Complete transparancy in the data we collect and how it is used is achieved through blog posts and easy access to UI elements to turn off/on data collection. As a first step:

In subsequent versions compare user installation data to segments of the population by dimensions (e.g. OS/ hardware/ version etc). For example, given a user's OS and number of extensions, how does his/her startup time compare to a 'peer group' sharing similar characteristics.

UI Implementation

The Metrics Team is consulting with UX to determine the proper UI implementation.  Given the opt-out requirement, UX proposes a check box to opt-out in the preferences pane and notifying users through non-modal and non-chrome channels (blog posts, privacy policies, download pages).

see: https://bugzilla.mozilla.org/show_bug.cgi?id=707970

Security Reviews

Review for Bagheera, the back end server that recieves and stores user data: https://bugzilla.mozilla.org/show_bug.cgi?id=655746