Data Collection: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(Common pitfalls and questions.)
(Adding myself (Luke Crouch))
 
(105 intermediate revisions by 39 users not shown)
Line 1: Line 1:
Firefox sends various data back to Mozilla. This data keeps the browser up to date, powers various features, provides user support, and helps improve the product itself. This page documents the policy for how and why we add new data collection metrics. The owner and peers of the Firefox Data Collection policy module are responsible for making decisions about data collection systems and measurements.


Owner: [https://mozillians.org/en-US/u/bsmedberg/ Benjamin Smedberg] (:bsmedberg)
At Mozilla, like at many other organizations, we rely on data to make product decisions. But here, unlike many other organizations, we balance our goal of collecting useful, high-quality data with our goal to give users meaningful choice and control over their own data. The Mozilla data collection program was created to ensure we achieve both goals whenever we make a change to how we collect data in our products. 
Peers: [https://mozillians.org/en-US/u/ally/ Ally Naagktgeboren]


== Data Collection Levels ==
In [https://wiki.mozilla.org/index.php?title=Firefox/Data_Collection&diff=1183319&oldid=1181872 November 2017], we revised the program to make our policies clearer and easier to understand and our processes simpler and easier to follow. These changes are designed to reflect our commitment to data collection grounded in:


There are two "levels" of data collection in Firefox:
*  Necessity - We collect only as much data as is necessary when we can demonstrate a clear business case for that data
*  Privacy - We give users meaningful choices and control over their own data
*  Transparency - We make our decisions about data collection public and accessible
*  Accountability - We assign accountability for the design, approval, and implementation of data collection


* opt-in data (Telemetry) is collected from users who express a choice to help with Firefox development. This includes all users who install prerelease/testing builds, plus release users who have explicitly checked the box in preferences.
Owner: Nneka Soyinka
* opt-out data (Firefox Health Report) is collected by default from all Firefox users. Users may choose to turn this data collection off in preferences.


== Requirements ==
Data Stewards:
* [https://people.mozilla.org/p/kennylong/ Kenny Long]
* [https://people.mozilla.org/p/jhirsch Jared Hirsch]
* [https://people.mozilla.org/p/adavis Alex Davis]
* [https://people.mozilla.org/p/TheOne Andreas Wagner]
* [https://people.mozilla.org/p/tlong/ Travis Long]
* [https://people.mozilla.org/p/willkg Will Kahn-Greene]
* [https://people.mozilla.org/p/p--n8wmyowcldls6pvp6ab1pj Roger Yang]
* [https://people.mozilla.org/p/sancus :sancus]
* [https://people.mozilla.org/p/charlie-humphreys Charlie Humphreys]
* [https://people.mozilla.org/p/cboozarjomehri Cameron Boozarjomehri]
* [https://people.mozilla.org/p/chutten/ :chutten]
* [https://people.mozilla.org/p/sergiosonline Sergio Betancourt]
* [https://people.mozilla.org/p/aminomancer Shane Hughes]
* [https://people.mozilla.org/p/roux Roux Buciu]
* [https://people.mozilla.org/p/groovecoder Luke Crouch]


'''Requirements For All Data Collection From Firefox'''
Data stewards come from a variety of teams within Mozilla, including data science, Firefox engineering, mobile products, Pocket, Common Voice, AMO, and Thunderbird. You are welcome to tag any steward for any collection request, without respect to the nature of your collection.


* Specifics about the collected data must be documented using the in-tree histogram definitions or the in-tree documentation system (.rst files). This documentation should be detailed-enough that people don't need to read the code implementation to understand what data is being collected.
Contact Us on Matrix https://chat.mozilla.org/#/room/#data-stewards:mozilla.org
* Any changes to data collection must be approved by the data collection module owner or peers by requesting review on the patch which updates the in-tree documentation.
* The bug or documentation must publicly identify the problem statement that will be solved by collecting data.
* There must be a person who takes responsibility for the correctness of the data.
* There must be a concrete plan for using the data, and a person who takes responsibility for this.
* The data must be included in the Firefox privacy notice. Much of the time, data collection requires no changes, but when changes are required the data stewards will work with Marshall Erwin and the Mozilla legal team to make sure that the privacy notice accurately reflects the collected data.


''Note: the data stewards do not typically verify that the patch collects the data correctly according to the documentation. That is the responsibility of the code reviewer.''
''Note: The data stewards aren't responsible for showing teams how to collect data, although they might be able to provide some guidance if they have time. But the Firefox data engineering team has prepared [http://docs.telemetry.mozilla.org/ data documentation] which can help!''


''' Requirements for opt-in (telemetry)'''
Most assets involved in data review can be found [https://github.com/mozilla/data-review in this repository].  References to who fills out a form when are covered in the documentation below.


* The requesting team must have a plan to use the data.
= Scope =
* Exploratory data collection should be temporary and expire after 6 months/5 release cycles but may be renewed as long as the data is still valuable and the team is using it.
* Permanent or longer-term data collection should have a plan for permanent monitoring.
** ''Note: automated monitoring is preferred. However, the current telemetry alerting system is only designed for scalar performance measurements.''


'''Requirements for opt-out data collection (Firefox Health Report)'''
These guidelines are '''required''' for data collection in products with an active user base and established privacy policies under the Firefox organization, but may be applied to any Mozilla product as needed. Changes to policies themselves or the creation of a policy for a new product is out of scope of what is described here.


* The data must provide ''user value''.
= Key Roles for Data Collection =


Here are some examples of providing user value:
While the number of people involved in data collection can vary by product or project, there are two roles necessary for any project:  


* Providing features or services directly to individual users.
* Data requester - the person requesting data to be collected
* Providing features or services to Firefox users as a whole.
* Data steward - the person who ensures the data collection process is followed and that requested data complies with Mozilla policies
* Monitoring and solving product quality issues.
* Exploratory understanding how users are using the product, ''if that can be tied back to specific product improvements''. This data collection should typically be limited to 6 months/5 release cycles and can be renewed only as long as it still ties to specific product improvements.


Some data is collected as correlations: for example, we collect information about user operating system and version, and about user addon and addon versions, so that we can correlate and monitor crash rates and other error metrics against these groupings.
In some cases a data steward may escalate concerns to the Trust and Legal teams. They are the teams responsible for defining Firefox data collection policies and can field questions about internal policy and laws governing user privacy


''It is not sufficient that collecting data benefits Mozilla (the company/project). It is necessary that we can map collected data to benefits for either individual Firefox users or Firefox users as a group.''
Mozilla always strives to make data reviews public. However, there are sometimes limited sets of circumstances when we may conduct our reviews in a private bug; for example, a service is part of an agreement where the partnership is not yet public.  These reviews will be made public once the actual data collection begins.


== Requesting Approval ==
= Adding or Modifying Data Collection =
The process is slightly different for collections in [https://hg.mozilla.org/mozilla-central/ mozilla-central] code (Firefox Desktop, Firefox & Focus for Android, and Gecko) than it is elsewhere. Please consult the relevant section below.


It is our intention to review every new data collection within Firefox, but to do so quickly and with minimal overhead. For every new measurement, even a simple new Telemetry probe, please request approval by setting the feedback flag for the data collection module owner or a peer. Simple requests should be handled within a day.
== Firefox Desktop, Firefox and Focus for Android, Gecko (from May 7, 2024) ==


More complex requests, and especially requests which add a new kind of data collection mechanism or require changes to the privacy notice, will require more extensive review. Please consider pinging the team about these as they are being designed! Additional discussions/review may include:
When a developer uploads a change to Phabricator that adds or modifies any data collection, Phabricator will automatically add the <tt>needs-data-classification</tt> tag, and explain what happens next.


* Privacy analysis: This may involve requesting feedback from the mozilla.dev.privacy mailing list and/or privacy experts within and outside of Mozilla to discuss the feature and its privacy impact.
If you’re adding or modifying data collection in your Phabricator revision and this doesn’t happen automatically, please manually add this tag and then follow the same procedure.
* Data compliance review: a review with the Mozilla data compliance team to ensure that changes match the Mozilla data compliance policies and documents.
* Legal review: If necessary, the module owner will request a legal review from Mozilla's legal team. A legal review will be necessary for any changes to the privacy policies/notices.
* Data quality/statistical review: In cases where data analysis and quality is uncertain, the module owner will request additional feedback from the Mozilla metrics team and other experts to validate data analysis plans.
* UX review: We may request/require feedback from the Firefox UX team on any proposed privacy/data-control UI.


== Common Problems ==
Once this tag is in place Herald will ask the patch author and reviewer to assess the [[#Data_Collection_Categories|correct category for the data collection ]]:


'''histogram descriptions'''
* If the data being collected fits in the “technical data” or “interaction data” categories described there, use the <tt>data-classification-low</tt> tag.
* If it’s any other category, or patch author and reviewer disagree about the right category, use the <tt>data-classification-high</tt> tag, and go through [[#Step_3:_Sensitive_Data_Collection_Review_Process|the sensitive data collection review process]].
* If you think that the data in question fits in “technical” or “interaction” data but would benefit from additional review, you can also explicitly choose to use the <tt>data-classification-high</tt> tag and thereby opt in to the sensitive data collection review process.


* Histogram descriptions should record *what* is being collected, in detail.
When using Glean for the data collection, the data classification of the new or expanded data collections should match the <tt>data_sensitivity</tt> property in the metric definitions. The entry in the <tt>data_reviews</tt> list should reflect the bug URL.
* It is important to say *when* a value is recorded, because this is often a confusing point when constructing analysis.
* Include units: for example, indicate whether a time duration is measured in seconds, milliseconds, or microseconds.
* When counting, be sure to indicate how repeat usage works. for example when counting decoding errors, are multiple issues counted for the same video, or only the first one?


'''enumerated histograms'''
If the reviewer is unsure or feels uncomfortable making this assessment themselves, they can [mailto:data-stewards@mozilla.com email the data stewards group] or [https://chat.mozilla.org/#/room/#data-stewards:mozilla.org contact them on matrix] for help.


Enumerated histograms should either list all the possible enumeration values in the histogram description, or reference a declared enumeration in the tree by name.
Whichever tag you choose, please '''leave a comment explaining your choice'''. Note that you will not be able to land this revision until the revision has one of these tags and you remove the <tt>needs-data-classification</tt> tag. For low sensitivity data collection, you will be able to land the patch once this sensitivity is marked and you remove the <tt>needs-data-classification</tt> tag. For high sensitivity data collection, the [https://phabricator.services.mozilla.com/project/view/209/ <tt>data-stewards</tt>] group will be added as a blocking reviewer on the patch. They will approve or request changes to the patch based on the [[#Step_3:_Sensitive_Data_Collection_Review_Process|sensitive data collection review process]].


'''keyed histograms'''
Patch authors are encouraged to add these tags themselves, but '''reviewers are responsible for making sure the right tag is used'''.


Keyed histograms contain arbitrary strings in the key, so they get extra attention. Please be careful of:
If you do not yet have a code change but are in the planning stages of a change and want to proactively discuss data collection options, reach out to [mailto:data-stewards@mozilla.com the data stewards group].


* Don't use a keyed histogram if you don't need it! Many times a simpler format such as a count or enumeration histogram can solve the same problem.
== Other Products ==
* The key should not contain user-input data, or other data that can be used to identify particular users.
* In general, keys should be a limited set of values. If you expect more than tens of values, the default aggregations for this histogram will blow up. If this is still required, you should file a bug to have the default aggregations disabled.
* The histogram description should describe exactly what the key contains, and the format.


'''JS exceptions'''
== Step 1: Submit Request ==
To request a review for new or changed Data Collection in a Mozilla product, Data Review requesters are required to provide the following:
* A completed Request Form, documenting what data is to be collected, why Mozilla needs to collect this data, how much data will be collected, and for how long it will be collected:
** Take [https://github.com/mozilla/data-review/blob/main/request.md this request] and fill it out completely.
*** (If you are renewing a previously-reviewed data collection, you may use [https://github.com/mozilla/data-review/blob/main/renewal_request.md this shorter form] instead.)
** If your collection is [https://mozilla.github.io/glean/book/index.html Glean] you can [https://blog.mozilla.org/data/2021/09/07/this-week-in-glean-data-reviews-are-important-glean-parser-makes-them-easy/ use <tt>glean_parser</tt> to generate a partially-filled template for you].
* A bug to attach the completed Request Form to:
** If you already have a bug filed to add the collection code, attach the form to that one.
** If you don't already have a bug, file a new one in your own component, or Firefox::Untriaged if you don't have a component (e.g. if your code's in GitHub).
** Tell Bugzilla that your form's extension is <tt>.txt</tt> so it can render it inline and so your Data Steward can review it more easily.
* A notification so the Data Steward knows it's time to review your Request Form:
** Flag the attached, completed Request Form for <tt>data-review</tt> by setting the <tt>data-review</tt> flag to <tt>?</tt> and choosing your chosen Data Steward in the "Requestee" field that appears.
** If a Data Steward doesn't get to your review within a couple of days, please [https://chat.mozilla.org/#/room/#data-stewards:mozilla.org reach out to us on Element].


It is a common request to record information about JS exceptions in certain context. In the general case, it is very difficult to prove that this information cannot contain personal data. If you have specific types of errors which are thrown at known locations, you can record information about those. This is an important but unsolved problem.
== Step 2: Request is reviewed ==
Data stewards review each request to ensure that it is documented fully and to assign the data collection to one of our 4 privacy categories as described here. tiers. The detailed steps in this process are:


== Data Collection Properties ==
* Data stewards receive a <tt>data-review?</tt> on a file in a bug
* Data stewards complete the [https://github.com/mozilla/data-review/blob/main/review.md data review form] based on the information provided in the data collection request. They ensure that the request:
** Follows Lean Data Practices & Guidelines
** The basic mechanics of what is being measured is documented publicly.
** Our need and justification for the data collection is documented for the record; e.g. there are complete and appropriate answers to questions on the request form.   
** The request aligns with user consent and control mechanisms outlined in the data collection categories listed below


When proposing a new measurement or data system, please consider your requirements and the necessary data properties:
Data stewards document the outcome of their review in the bug with a <tt>data-review+</tt> or <tt>data-review-</tt> and their completed form. Typical outcomes include:
* Unapproved requests are returned to data requesters for changes or clarification.
* Simple requests that fall within Category 1 or 2 are often approved quickly.
* Complex requests that pose broader policy and legal implications may be escalated to the Trust and Legal teams. (See Step 3)
== Step 3: Sensitive Data Collection Review Process ==


Function:
=== Determine if you need to follow this process ===
* Is the data collection necessary for Firefox to function properly? For example, the automatic update check must be sent in order to keep Firefox up to date.
* Is there a specific user-visible function planned for the data?
* Population: Is it necessary to take a measurement from all users? Or is it sufficient to measure only prerelease users?
* Sampling: is it necessary to get data from all users, or is it sufficient to collect data from a smaller sample?
* Will data submission be automatic, or will there be opt-in UI?


Analysis and Reporting:
For any data collection that is classified as category 3 or 4 (described below) – including in pre-release channels and experiments – we require additional review to be performed and an announcement to a mailing list. The reason for this is that while our privacy policies describe what we can do without additional user notice, this is an upper bound; even for collection which fits within the policy, we need to determine whether that collection is appropriate and conforms to our overall commitment to privacy and minimization.
* Who will be analyzing the data?
* Will the data that's being collected answer the questions we have?
* Will it be a single or periodic report?
* Is it desirable to track data changes over time? With what frequency? With what latency?
* Will the data reporting be private or public?
* Will the raw data being collected be private or public?
* Is it necessary to keep the measurement forever, or is it sufficient to run a short-term experiment/single report?


Privacy (and Legal):
=== Create documentation and request review===
* Does the data contain sensitive or personal information?
* Can the data be used in combination with other measurements to identify a particular person?
* What kind of users controls will be exposed to control data submission?
* Will users be able to see their own data before or after it has been submitted, either within Firefox or from the server?
* Does the data conform to the existing Mozilla [https://www.mozilla.org/en-US/privacy/principles/ privacy principles], the [https://www.mozilla.org/en-US/privacy/ Mozilla Privacy Policy], and the [https://www.mozilla.org/en-US/privacy/firefox/ Firefox privacy notice]?
* Does this data collection represent any unusual privacy or legal risk to users or Mozilla?


As a first step, it is important that the details of the implementation, intended use, and value to users be clearly documented for future reference and efficient review. As soon as this is ready (we recommend as early as possible, before you move forward with the implementation), send an email to the [https://groups.google.com/a/mozilla.com/g/data-review data-review@mozilla.com] mailing list.


The initial documentation from engineering/data stewardship and privacy/technical review should be completed as a prerequisite ahead of legal and security.


== Other Practices ==
{| class="wikitable"
|-
! Risk Assessment !! Owner !! Facilitator
|-
| Privacy/Technical Review || Office of the Firefox CTO || Martin Thompson
|-
| Legal/Trust Review || Legal || Nneka Soyinka
|-
| Security Review || Office of the CSO || Marc Perreault
|-
| Data Review || Data || Mark Reid
|}


Every year, the data collection owner and peers will survey all of the existing data collection systems with Firefox. This survey has the following goals:
Facilitators (named above) are expected to express judgement about how much risk is involved and will involve the appropriate reviewers.
 
If the level of risk is determined to be low enough and/or there is clear precedent, further discussion may not be necessary and each reviewer may give a sign-off immediately; otherwise, mitigations should be incorporated and documentation updated once they have been addressed. Live discussion is often very helpful – and should be planned for – when there is significant risk involved.
 
Data collection may not be shipped to users until final sign-offs have been obtained.
 
=== Escalation ===
In the case of a dispute about sensitive data collection and/or which mitigations are appropriate, the proposer or any reviewer should work with one of the facilitators to escalate the decision to the VP/XLT member in charge of the product (e.g., Head of Firefox, Head of Pocket). Depending on the scope and nature of the risk, there may also be cases where escalation goes beyond the immediate product owner (i.e., to the CPO or CEO). When this happens, the facilitator and escalating party:
 
* Give each party a chance to document their recommended approach in writing.
* Share the document with all involved parties for asynchronous review/comment.
* Schedule a meeting for discussion if necessary.
* Record the final decision by the product owner.
 
= Data Collection Categories =
 
There are four "categories" of data collection:
 
; '''Category 1 “Technical data”'''
: This includes information about the machine or software application itself in which there is no or little risk of personal identification.
 
:Examples include OS, crashes and errors, outcome of automated processes like updates, activation, version #s, etc.  This also includes aggregated compatibility information about features and API usage by websites, addons, and other 3rd-party software that interact with the application during usage.
 
:  It also includes information about the user's settings that is necessary to provide functionality. For example, what applications users have connected to a service or what services users have logged into using a Mozilla account.
 
; '''Category 2 “Interaction data”'''
: This includes information about the user’s direct engagement with the service in which there is no or little risk of personal identification.
 
:Examples include how many devices a user has synced, engagement with specific features like clicks, scroll position, audio and session length, status of user preferences, and account activity levels.
 
:  It also includes information about the user's in-product journeys and product choices helpful to understand engagement (attitudes). For example, selections of add-ons or tiles to determine potential interest categories etc.
 
; '''Category 3 “Stored Content & Communications”'''
: This includes information about what people store, sync, communicate or connect to where the information is generally considered to be more sensitive and personal in nature. 
 
:Examples include users' saved URLs or URL history, specific web browsing history, general information about their web browsing history (such as TLDs or categories of webpages visited over time) and potentially certain types of interaction data about specific web pages or stories visited (such as highlighted portions of a story).
 
:  It also includes information such as content saved by users to an individual account like saved URLs, tags, notes, passwords and files as well as communications that users have with one another through a Mozilla service.
 
; '''Category 4 “Highly sensitive or clearly identifiable personal data”'''
: Information that directly identifies a person, or if combined with other data could identify a person. This data may be embedded within specific website content, such as memory contents, dumps, captures of screen data, or DOM data.
 
:  Examples include account registration data like name, password, and email address associated with an account, payment data in connection with subscriptions or donations, contact information such as phone numbers or mailing addresses, email addresses associated with surveys, promotions and customer support contacts.
 
:  It also includes any data from different categories that, when combined, can identify a person, device, household or account.  For example:  Category 1 log data combined with Category 3 saved URLs.
 
: Additional examples are:  voice audio commands (including a voice audio file), speech-to-text or text-to-speech (including transcripts), biometric data, demographic information, and precise location data associated with a persistent identifier, individual or small population cohorts.  This is location inferred or determined from mechanisms other than IP such as wi-fi access points, Bluetooth beacons, cell phone towers or provided directly to us, such as in a survey or a profile.
 
== Eligibility for Default on Data Collection ==
 
At installation, Mozilla’s products and services include one or more preferences and settings.  These preferences and settings typically belong to a data collection state:  a status that describes whether data collection occurs by default or not.
 
{| class="wikitable"
|-
! State !! What it Means
|-
| Default ON || Data may be collected automatically. 
 
Users must have a way to turn off data collection. [https://support.mozilla.org/en-US/kb/telemetry-clientid Learn how to opt out] of data collection in Firefox.
|-
| Default OFF || Data may be collected,  but only if a user takes an clear, express action to opt-in to the collection.  This can be through a configuration option, a prompt or an update through an account profile. 
 
Users must have a way to turn off data collection.
 
|}
 
“'''Release'''” means products that are not experimental. These include Firefox, Pocket, Lockwise, Monitor, and others.
 
“'''Pre-release'''” means experimental products. They are typically identified by the words “Beta,” “Nightly,” “Preview,” “Reference Browser,” or “Developer Edition” in the name of the product.
 
{| class="wikitable"
|-
! Category 1 “Technical data”
|-
|  ''Release & Pre-Release'' - eligible for Default ON.
|}
{| class="wikitable"
|-
! Category 2  “Interaction data”
|-
|  ''Release & Pre-Release'' - eligible for Default ON.
|}
{| class="wikitable"
|-
! Category 3 “Stored Content and Communications”
|-
| ''Release'':  Default OFF.  Default ON requires prior Trust approval.
 
''Pre-Release'': Default ON eligible
 
On a case-by-case basis collections may be eligible to be "Default ON"  if mitigations are identified. Mitigations may include UX changes that make users aware of additional risk, technical mechanisms that remove the risk, or a risk assessment done of a case-by-case basis that determines the risk is limited.
|}
{| class="wikitable"
|-
! Category 4 “Highly Sensitive or Clearly identifiable personal data”
|-
| ''Release & Pre-Release'': Default OFF
Any collection requires prior Trust approval and (i) advance user notice (ii) consent and (iii) an opt-out.
|}
 
= Other Practices =
 
Every year, the data collection owner and peers will survey all of the existing data collection systems for their product or project. This survey has the following goals:


* To ensure that it is still necessary and useful to collect a piece of data.
* To ensure that it is still necessary and useful to collect a piece of data.
* To re-identify who is responsible for the collection, monitoring, and reporting of collected data.
* To re-identify who is responsible for the collection, monitoring, and reporting of collected data.
= Additional References =
[[Data Publishing | Data Publishing process]]

Latest revision as of 19:50, 11 June 2024

At Mozilla, like at many other organizations, we rely on data to make product decisions. But here, unlike many other organizations, we balance our goal of collecting useful, high-quality data with our goal to give users meaningful choice and control over their own data. The Mozilla data collection program was created to ensure we achieve both goals whenever we make a change to how we collect data in our products.

In November 2017, we revised the program to make our policies clearer and easier to understand and our processes simpler and easier to follow. These changes are designed to reflect our commitment to data collection grounded in:

  • Necessity - We collect only as much data as is necessary when we can demonstrate a clear business case for that data
  • Privacy - We give users meaningful choices and control over their own data
  • Transparency - We make our decisions about data collection public and accessible
  • Accountability - We assign accountability for the design, approval, and implementation of data collection

Owner: Nneka Soyinka

Data Stewards:

Data stewards come from a variety of teams within Mozilla, including data science, Firefox engineering, mobile products, Pocket, Common Voice, AMO, and Thunderbird. You are welcome to tag any steward for any collection request, without respect to the nature of your collection.

Contact Us on Matrix https://chat.mozilla.org/#/room/#data-stewards:mozilla.org

Note: The data stewards aren't responsible for showing teams how to collect data, although they might be able to provide some guidance if they have time. But the Firefox data engineering team has prepared data documentation which can help!

Most assets involved in data review can be found in this repository. References to who fills out a form when are covered in the documentation below.

Scope

These guidelines are required for data collection in products with an active user base and established privacy policies under the Firefox organization, but may be applied to any Mozilla product as needed. Changes to policies themselves or the creation of a policy for a new product is out of scope of what is described here.

Key Roles for Data Collection

While the number of people involved in data collection can vary by product or project, there are two roles necessary for any project:

  • Data requester - the person requesting data to be collected
  • Data steward - the person who ensures the data collection process is followed and that requested data complies with Mozilla policies

In some cases a data steward may escalate concerns to the Trust and Legal teams. They are the teams responsible for defining Firefox data collection policies and can field questions about internal policy and laws governing user privacy

Mozilla always strives to make data reviews public. However, there are sometimes limited sets of circumstances when we may conduct our reviews in a private bug; for example, a service is part of an agreement where the partnership is not yet public. These reviews will be made public once the actual data collection begins.

Adding or Modifying Data Collection

The process is slightly different for collections in mozilla-central code (Firefox Desktop, Firefox & Focus for Android, and Gecko) than it is elsewhere. Please consult the relevant section below.

Firefox Desktop, Firefox and Focus for Android, Gecko (from May 7, 2024)

When a developer uploads a change to Phabricator that adds or modifies any data collection, Phabricator will automatically add the needs-data-classification tag, and explain what happens next.

If you’re adding or modifying data collection in your Phabricator revision and this doesn’t happen automatically, please manually add this tag and then follow the same procedure.

Once this tag is in place Herald will ask the patch author and reviewer to assess the correct category for the data collection :

  • If the data being collected fits in the “technical data” or “interaction data” categories described there, use the data-classification-low tag.
  • If it’s any other category, or patch author and reviewer disagree about the right category, use the data-classification-high tag, and go through the sensitive data collection review process.
  • If you think that the data in question fits in “technical” or “interaction” data but would benefit from additional review, you can also explicitly choose to use the data-classification-high tag and thereby opt in to the sensitive data collection review process.

When using Glean for the data collection, the data classification of the new or expanded data collections should match the data_sensitivity property in the metric definitions. The entry in the data_reviews list should reflect the bug URL.

If the reviewer is unsure or feels uncomfortable making this assessment themselves, they can email the data stewards group or contact them on matrix for help.

Whichever tag you choose, please leave a comment explaining your choice. Note that you will not be able to land this revision until the revision has one of these tags and you remove the needs-data-classification tag. For low sensitivity data collection, you will be able to land the patch once this sensitivity is marked and you remove the needs-data-classification tag. For high sensitivity data collection, the data-stewards group will be added as a blocking reviewer on the patch. They will approve or request changes to the patch based on the sensitive data collection review process.

Patch authors are encouraged to add these tags themselves, but reviewers are responsible for making sure the right tag is used.

If you do not yet have a code change but are in the planning stages of a change and want to proactively discuss data collection options, reach out to the data stewards group.

Other Products

Step 1: Submit Request

To request a review for new or changed Data Collection in a Mozilla product, Data Review requesters are required to provide the following:

  • A completed Request Form, documenting what data is to be collected, why Mozilla needs to collect this data, how much data will be collected, and for how long it will be collected:
  • A bug to attach the completed Request Form to:
    • If you already have a bug filed to add the collection code, attach the form to that one.
    • If you don't already have a bug, file a new one in your own component, or Firefox::Untriaged if you don't have a component (e.g. if your code's in GitHub).
    • Tell Bugzilla that your form's extension is .txt so it can render it inline and so your Data Steward can review it more easily.
  • A notification so the Data Steward knows it's time to review your Request Form:
    • Flag the attached, completed Request Form for data-review by setting the data-review flag to ? and choosing your chosen Data Steward in the "Requestee" field that appears.
    • If a Data Steward doesn't get to your review within a couple of days, please reach out to us on Element.

Step 2: Request is reviewed

Data stewards review each request to ensure that it is documented fully and to assign the data collection to one of our 4 privacy categories as described here. tiers. The detailed steps in this process are:

  • Data stewards receive a data-review? on a file in a bug
  • Data stewards complete the data review form based on the information provided in the data collection request. They ensure that the request:
    • Follows Lean Data Practices & Guidelines
    • The basic mechanics of what is being measured is documented publicly.
    • Our need and justification for the data collection is documented for the record; e.g. there are complete and appropriate answers to questions on the request form.
    • The request aligns with user consent and control mechanisms outlined in the data collection categories listed below

Data stewards document the outcome of their review in the bug with a data-review+ or data-review- and their completed form. Typical outcomes include:

  • Unapproved requests are returned to data requesters for changes or clarification.
  • Simple requests that fall within Category 1 or 2 are often approved quickly.
  • Complex requests that pose broader policy and legal implications may be escalated to the Trust and Legal teams. (See Step 3)

Step 3: Sensitive Data Collection Review Process

Determine if you need to follow this process

For any data collection that is classified as category 3 or 4 (described below) – including in pre-release channels and experiments – we require additional review to be performed and an announcement to a mailing list. The reason for this is that while our privacy policies describe what we can do without additional user notice, this is an upper bound; even for collection which fits within the policy, we need to determine whether that collection is appropriate and conforms to our overall commitment to privacy and minimization.

Create documentation and request review

As a first step, it is important that the details of the implementation, intended use, and value to users be clearly documented for future reference and efficient review. As soon as this is ready (we recommend as early as possible, before you move forward with the implementation), send an email to the data-review@mozilla.com mailing list.

The initial documentation from engineering/data stewardship and privacy/technical review should be completed as a prerequisite ahead of legal and security.

Risk Assessment Owner Facilitator
Privacy/Technical Review Office of the Firefox CTO Martin Thompson
Legal/Trust Review Legal Nneka Soyinka
Security Review Office of the CSO Marc Perreault
Data Review Data Mark Reid

Facilitators (named above) are expected to express judgement about how much risk is involved and will involve the appropriate reviewers.

If the level of risk is determined to be low enough and/or there is clear precedent, further discussion may not be necessary and each reviewer may give a sign-off immediately; otherwise, mitigations should be incorporated and documentation updated once they have been addressed. Live discussion is often very helpful – and should be planned for – when there is significant risk involved.

Data collection may not be shipped to users until final sign-offs have been obtained.

Escalation

In the case of a dispute about sensitive data collection and/or which mitigations are appropriate, the proposer or any reviewer should work with one of the facilitators to escalate the decision to the VP/XLT member in charge of the product (e.g., Head of Firefox, Head of Pocket). Depending on the scope and nature of the risk, there may also be cases where escalation goes beyond the immediate product owner (i.e., to the CPO or CEO). When this happens, the facilitator and escalating party:

  • Give each party a chance to document their recommended approach in writing.
  • Share the document with all involved parties for asynchronous review/comment.
  • Schedule a meeting for discussion if necessary.
  • Record the final decision by the product owner.

Data Collection Categories

There are four "categories" of data collection:

Category 1 “Technical data”
This includes information about the machine or software application itself in which there is no or little risk of personal identification.
Examples include OS, crashes and errors, outcome of automated processes like updates, activation, version #s, etc. This also includes aggregated compatibility information about features and API usage by websites, addons, and other 3rd-party software that interact with the application during usage.
It also includes information about the user's settings that is necessary to provide functionality. For example, what applications users have connected to a service or what services users have logged into using a Mozilla account.
Category 2 “Interaction data”
This includes information about the user’s direct engagement with the service in which there is no or little risk of personal identification.
Examples include how many devices a user has synced, engagement with specific features like clicks, scroll position, audio and session length, status of user preferences, and account activity levels.
It also includes information about the user's in-product journeys and product choices helpful to understand engagement (attitudes). For example, selections of add-ons or tiles to determine potential interest categories etc.
Category 3 “Stored Content & Communications”
This includes information about what people store, sync, communicate or connect to where the information is generally considered to be more sensitive and personal in nature.
Examples include users' saved URLs or URL history, specific web browsing history, general information about their web browsing history (such as TLDs or categories of webpages visited over time) and potentially certain types of interaction data about specific web pages or stories visited (such as highlighted portions of a story).
It also includes information such as content saved by users to an individual account like saved URLs, tags, notes, passwords and files as well as communications that users have with one another through a Mozilla service.
Category 4 “Highly sensitive or clearly identifiable personal data”
Information that directly identifies a person, or if combined with other data could identify a person. This data may be embedded within specific website content, such as memory contents, dumps, captures of screen data, or DOM data.
Examples include account registration data like name, password, and email address associated with an account, payment data in connection with subscriptions or donations, contact information such as phone numbers or mailing addresses, email addresses associated with surveys, promotions and customer support contacts.
It also includes any data from different categories that, when combined, can identify a person, device, household or account. For example: Category 1 log data combined with Category 3 saved URLs.
Additional examples are: voice audio commands (including a voice audio file), speech-to-text or text-to-speech (including transcripts), biometric data, demographic information, and precise location data associated with a persistent identifier, individual or small population cohorts. This is location inferred or determined from mechanisms other than IP such as wi-fi access points, Bluetooth beacons, cell phone towers or provided directly to us, such as in a survey or a profile.

Eligibility for Default on Data Collection

At installation, Mozilla’s products and services include one or more preferences and settings. These preferences and settings typically belong to a data collection state: a status that describes whether data collection occurs by default or not.

State What it Means
Default ON Data may be collected automatically.

Users must have a way to turn off data collection. Learn how to opt out of data collection in Firefox.

Default OFF Data may be collected, but only if a user takes an clear, express action to opt-in to the collection. This can be through a configuration option, a prompt or an update through an account profile.

Users must have a way to turn off data collection.

Release” means products that are not experimental. These include Firefox, Pocket, Lockwise, Monitor, and others.

Pre-release” means experimental products. They are typically identified by the words “Beta,” “Nightly,” “Preview,” “Reference Browser,” or “Developer Edition” in the name of the product.

Category 1 “Technical data”
Release & Pre-Release - eligible for Default ON.
Category 2 “Interaction data”
Release & Pre-Release - eligible for Default ON.
Category 3 “Stored Content and Communications”
Release: Default OFF. Default ON requires prior Trust approval.

Pre-Release: Default ON eligible

On a case-by-case basis collections may be eligible to be "Default ON" if mitigations are identified. Mitigations may include UX changes that make users aware of additional risk, technical mechanisms that remove the risk, or a risk assessment done of a case-by-case basis that determines the risk is limited.

Category 4 “Highly Sensitive or Clearly identifiable personal data”
Release & Pre-Release: Default OFF

Any collection requires prior Trust approval and (i) advance user notice (ii) consent and (iii) an opt-out.

Other Practices

Every year, the data collection owner and peers will survey all of the existing data collection systems for their product or project. This survey has the following goals:

  • To ensure that it is still necessary and useful to collect a piece of data.
  • To re-identify who is responsible for the collection, monitoring, and reporting of collected data.

Additional References

Data Publishing process