Data Collection

From MozillaWiki
Revision as of 16:32, 13 July 2016 by Benjamin Smedberg (talk | contribs) (Common pitfalls and questions.)
Jump to navigation Jump to search

Firefox sends various data back to Mozilla. This data keeps the browser up to date, powers various features, provides user support, and helps improve the product itself. This page documents the policy for how and why we add new data collection metrics. The owner and peers of the Firefox Data Collection policy module are responsible for making decisions about data collection systems and measurements.

Owner: Benjamin Smedberg (:bsmedberg) Peers: Ally Naagktgeboren

Data Collection Levels

There are two "levels" of data collection in Firefox:

  • opt-in data (Telemetry) is collected from users who express a choice to help with Firefox development. This includes all users who install prerelease/testing builds, plus release users who have explicitly checked the box in preferences.
  • opt-out data (Firefox Health Report) is collected by default from all Firefox users. Users may choose to turn this data collection off in preferences.

Requirements

Requirements For All Data Collection From Firefox

  • Specifics about the collected data must be documented using the in-tree histogram definitions or the in-tree documentation system (.rst files). This documentation should be detailed-enough that people don't need to read the code implementation to understand what data is being collected.
  • Any changes to data collection must be approved by the data collection module owner or peers by requesting review on the patch which updates the in-tree documentation.
  • The bug or documentation must publicly identify the problem statement that will be solved by collecting data.
  • There must be a person who takes responsibility for the correctness of the data.
  • There must be a concrete plan for using the data, and a person who takes responsibility for this.
  • The data must be included in the Firefox privacy notice. Much of the time, data collection requires no changes, but when changes are required the data stewards will work with Marshall Erwin and the Mozilla legal team to make sure that the privacy notice accurately reflects the collected data.

Note: the data stewards do not typically verify that the patch collects the data correctly according to the documentation. That is the responsibility of the code reviewer.

Requirements for opt-in (telemetry)

  • The requesting team must have a plan to use the data.
  • Exploratory data collection should be temporary and expire after 6 months/5 release cycles but may be renewed as long as the data is still valuable and the team is using it.
  • Permanent or longer-term data collection should have a plan for permanent monitoring.
    • Note: automated monitoring is preferred. However, the current telemetry alerting system is only designed for scalar performance measurements.

Requirements for opt-out data collection (Firefox Health Report)

  • The data must provide user value.

Here are some examples of providing user value:

  • Providing features or services directly to individual users.
  • Providing features or services to Firefox users as a whole.
  • Monitoring and solving product quality issues.
  • Exploratory understanding how users are using the product, if that can be tied back to specific product improvements. This data collection should typically be limited to 6 months/5 release cycles and can be renewed only as long as it still ties to specific product improvements.

Some data is collected as correlations: for example, we collect information about user operating system and version, and about user addon and addon versions, so that we can correlate and monitor crash rates and other error metrics against these groupings.

It is not sufficient that collecting data benefits Mozilla (the company/project). It is necessary that we can map collected data to benefits for either individual Firefox users or Firefox users as a group.

Requesting Approval

It is our intention to review every new data collection within Firefox, but to do so quickly and with minimal overhead. For every new measurement, even a simple new Telemetry probe, please request approval by setting the feedback flag for the data collection module owner or a peer. Simple requests should be handled within a day.

More complex requests, and especially requests which add a new kind of data collection mechanism or require changes to the privacy notice, will require more extensive review. Please consider pinging the team about these as they are being designed! Additional discussions/review may include:

  • Privacy analysis: This may involve requesting feedback from the mozilla.dev.privacy mailing list and/or privacy experts within and outside of Mozilla to discuss the feature and its privacy impact.
  • Data compliance review: a review with the Mozilla data compliance team to ensure that changes match the Mozilla data compliance policies and documents.
  • Legal review: If necessary, the module owner will request a legal review from Mozilla's legal team. A legal review will be necessary for any changes to the privacy policies/notices.
  • Data quality/statistical review: In cases where data analysis and quality is uncertain, the module owner will request additional feedback from the Mozilla metrics team and other experts to validate data analysis plans.
  • UX review: We may request/require feedback from the Firefox UX team on any proposed privacy/data-control UI.

Common Problems

histogram descriptions

  • Histogram descriptions should record *what* is being collected, in detail.
  • It is important to say *when* a value is recorded, because this is often a confusing point when constructing analysis.
  • Include units: for example, indicate whether a time duration is measured in seconds, milliseconds, or microseconds.
  • When counting, be sure to indicate how repeat usage works. for example when counting decoding errors, are multiple issues counted for the same video, or only the first one?

enumerated histograms

Enumerated histograms should either list all the possible enumeration values in the histogram description, or reference a declared enumeration in the tree by name.

keyed histograms

Keyed histograms contain arbitrary strings in the key, so they get extra attention. Please be careful of:

  • Don't use a keyed histogram if you don't need it! Many times a simpler format such as a count or enumeration histogram can solve the same problem.
  • The key should not contain user-input data, or other data that can be used to identify particular users.
  • In general, keys should be a limited set of values. If you expect more than tens of values, the default aggregations for this histogram will blow up. If this is still required, you should file a bug to have the default aggregations disabled.
  • The histogram description should describe exactly what the key contains, and the format.

JS exceptions

It is a common request to record information about JS exceptions in certain context. In the general case, it is very difficult to prove that this information cannot contain personal data. If you have specific types of errors which are thrown at known locations, you can record information about those. This is an important but unsolved problem.

Data Collection Properties

When proposing a new measurement or data system, please consider your requirements and the necessary data properties:

Function:

  • Is the data collection necessary for Firefox to function properly? For example, the automatic update check must be sent in order to keep Firefox up to date.
  • Is there a specific user-visible function planned for the data?
  • Population: Is it necessary to take a measurement from all users? Or is it sufficient to measure only prerelease users?
  • Sampling: is it necessary to get data from all users, or is it sufficient to collect data from a smaller sample?
  • Will data submission be automatic, or will there be opt-in UI?

Analysis and Reporting:

  • Who will be analyzing the data?
  • Will the data that's being collected answer the questions we have?
  • Will it be a single or periodic report?
  • Is it desirable to track data changes over time? With what frequency? With what latency?
  • Will the data reporting be private or public?
  • Will the raw data being collected be private or public?
  • Is it necessary to keep the measurement forever, or is it sufficient to run a short-term experiment/single report?

Privacy (and Legal):

  • Does the data contain sensitive or personal information?
  • Can the data be used in combination with other measurements to identify a particular person?
  • What kind of users controls will be exposed to control data submission?
  • Will users be able to see their own data before or after it has been submitted, either within Firefox or from the server?
  • Does the data conform to the existing Mozilla privacy principles, the Mozilla Privacy Policy, and the Firefox privacy notice?
  • Does this data collection represent any unusual privacy or legal risk to users or Mozilla?


Other Practices

Every year, the data collection owner and peers will survey all of the existing data collection systems with Firefox. This survey has the following goals:

  • To ensure that it is still necessary and useful to collect a piece of data.
  • To re-identify who is responsible for the collection, monitoring, and reporting of collected data.