Data Collection

From MozillaWiki
Revision as of 20:11, 8 September 2020 by Tdsmith (talk | contribs) (One got away)
Jump to navigation Jump to search

At Mozilla, like at many other organizations, we rely on data to make product decisions. But here, unlike many other organizations, we balance our goal of collecting useful, high-quality data with our goal to give users meaningful choice and control over their own data. The Mozilla data collection program was created to ensure we achieve both goals whenever we make a change to how we collect data in our products.

In November 2017, we revised the program to make our policies clearer and easier to understand and our processes simpler and easier to follow. These changes are designed to reflect our commitment to data collection grounded in:

  • Necessity - We collect only as much data as is necessary when we can demonstrate a clear business case for that data
  • Privacy - We give users meaningful choices and control over their own data
  • Transparency - We make our decisions about data collection public and accessible
  • Accountability - We assign accountability for the design, approval, and implementation of data collection

Owner: Alicia Gray

Data Stewards:

Data stewards come from a variety of teams within Mozilla, including data science, Firefox engineering, mobile products, Pocket, Common Voice, AMO, and Thunderbird. You are welcome to tag any steward for any collection request, without respect to the nature of your collection.

Contact Us on Matrix https://chat.mozilla.org/#/room/#data-stewards:mozilla.org

Note: The data stewards aren't responsible for showing teams how to collect data, although they might be able to provide some guidance if they have time. But the Firefox data engineering team has prepared data documentation which can help!

Most assets involved in data review can be found in this repository. References to who fills out a form when are covered in the documentation below.


Key Roles for Data Collection

While the number of people involved in data collection can vary by product or project, there are two roles necessary for any project:

  • Data requester - the person requesting data to be collected
  • Data steward - the person who ensures the data collection process is followed and that requested data complies with Mozilla policies

In some cases a data steward may escalate concerns to the Trust and Legal teams. They are the teams responsible for defining Firefox data collection policies and can field questions about internal policy and laws governing user privacy

Mozilla always strives to make data reviews public. However, there are sometimes limited sets of circumstances when we may conduct our reviews in a private bug; for example, a service is part of an agreement where the partnership is not yet public. These reviews will be made public once the actual data collection begins.

Requesting Data Collection

Step 1: Submit Request

To request a review for new or changed Data Collection in a Mozilla product, Data Review requesters are required to provide the following:

  • A completed Request Form, documenting what data is to be collected, why Mozilla needs to collect this data, how much data will be collected, and for how long it will be collected:
    • Take this request and fill it out completely.
      • (If you are renewing a previously-reviewed data collection, you may use this shorter form instead.)
  • A bug to attach the completed Request Form to:
    • If you already have a bug filed to add the collection code, attach the form to that one.
    • If you don't already have a bug, file a new one in your own component, or Firefox::Untriaged if you don't have a component (e.g. if your code's in GitHub).
    • Tell Bugzilla that your form's extension is .txt so it can render it inline and so your Data Steward can review it more easily.
  • A notification so the Data Steward knows it's time to review your Request Form:
    • Flag the attached, completed Request Form for data-review.
    • If a Data Steward doesn't get to your review within a couple of days, please reach out to us on Matrix.

Step 2: Request is reviewed

Data stewards review each request to ensure that it is documented fully and to assign the data collection to one of our 4 privacy categories as described here. tiers. The detailed steps in this process are:

  • Data stewards receive a data-review? on a file in a bug
  • Data stewards complete the data review form based on the information provided in the data collection request. They ensure that the request:
    • Follows Lean Data Practices & Guidelines
    • The basic mechanics of what is being measured is documented publicly.
    • Our need and justification for the data collection is documented for the record; e.g. there are complete and appropriate answers to questions on the request form.
    • The request aligns with user consent and control mechanisms outlined in the data collection categories listed below

Data stewards document the outcome of their review in the bug with a data-review+ or data-review- and their completed form. Typical outcomes include:

  • Unapproved requests are returned to data requesters for changes or clarification.
  • Simple requests that fall within Category 1 or 2 are often approved quickly.
  • Complex requests that pose broader policy and legal implications may be escalated to the Trust and Legal teams. (See Step 3)

Step 3: (Optional) Escalated Response

More complex requests, like those that call for a new data collection mechanism or require changes to the privacy notice, often require one or more of the following additional reviews:

  • Privacy analysis: Feedback from the mozilla.dev.privacy mailing list and/or privacy experts within and outside of Mozilla to discuss the feature and its privacy impact.
  • Policy compliance review: An assessment from the Mozilla data compliance team to determine if the request matches the Mozilla data compliance policies and documents.
  • Legal review: An assessment from Mozilla’s legal team.

Data stewards participate in these discussion and will document the outcome in the same bug used for the collection request.

Data Collection Categories

There are four "categories" of data collection that apply to Firefox:

Category 1 “Technical data”
This includes information about the machine or software application itself in which there is no or little risk of personal identification.
Examples include OS, crashes and errors, outcome of automated processes like updates, activation, version #s, etc. This also includes aggregated compatibility information about features and API usage by websites, addons, and other 3rd-party software that interact with the application during usage.
It also includes information about the user's settings that is necessary to provide functionality. For example, what applications users have connected to a service or what services users have logged into using a Firefox Account.
Category 2 “Interaction data”
This includes information about the user’s direct engagement with the service in which there is no or little risk of personal identification.
Examples include how many devices a user has synced, engagement with specific features like clicks, scroll position, audio and session length, status of user preferences, and account activity levels.
It also includes information about the user's in-product journeys and product choices helpful to understand engagement (attitudes). For example, selections of add-ons or tiles to determine potential interest categories etc.
Category 3 “Stored Content & Communications”
This includes information about what people store, sync, communicate or connect to where the information is generally considered to be more sensitive and personal in nature.
Examples include users' saved URLs or URL history, specific web browsing history, general information about their web browsing history (such as TLDs or categories of webpages visited over time) and potentially certain types of interaction data about specific web pages or stories visited (such as highlighted portions of a story).
It also includes information such as content saved by users to an individual account like saved URLs, tags, notes, passwords and files as well as communications that users have with one another through a Mozilla service.
Category 4 “Highly sensitive or clearly identifiable personal data”
Information that directly identifies a person, or if combined with other data could identify a person. This data may be embedded within specific website content, such as memory contents, dumps, captures of screen data, or DOM data.
Examples include account registration data like name, password, and email address associated with an account, payment data in connection with subscriptions or donations, contact information such as phone numbers or mailing addresses, email addresses associated with surveys, promotions and customer support contacts.
It also includes any data from different categories that, when combined, can identify a person, device, household or account. For example: Category 1 log data combined with Category 3 saved URLs.
Additional examples are: voice audio commands (including a voice audio file), speech-to-text or text-to-speech (including transcripts), biometric data, demographic information, and precise location data associated with a persistent identifier, individual or small population cohorts. This is location inferred or determined from mechanisms other than IP such as wi-fi access poinits, Bluetooth beacons, cell phone towers or provided directly to us, such as in a survey or a profile.

Eligibility for Default on Data Collection

  • Categories 1 & 2 (Technical & Interaction data)
    • Pre-Release & Release: Data may default on, provided the data is exclusively in these categories (it cannot be in any other category). In Release, an opt-out must be available for most types of Technical and Interaction data. Teams may limit data collection to pre-release populations if appropriate for testing/validation, cost reduction, or risk mitigation.
  • Category 3 (Web activity data)
    • Pre-Release: May be eligible for default on data collection, provided there is an opt-out.
    • Release: Default off.
      • On a case-by-case basis collections may be eligible to be "default on" if mitigations are identified. Mitigations may include UX changes that make users aware of additional risk, technical mechanisms that remove the risk, or a risk assessment done of a case-by-case basis that determines the risk is limited.
  • Category 4 (Highly Sensitive data)
    • Pre-Release: Default off. May be eligible for opt-in data collection by specific users, provided there is (i) advance user notice (ii) consent and (iii) an opt-out.
    • Release: Default off. May be eligible for opt-in data collection by specific users, provided there is (i) advance user notice (ii) consent and (iii) an opt-out.

Other Practices

Every year, the data collection owner and peers will survey all of the existing data collection systems for their product or project. This survey has the following goals:

  • To ensure that it is still necessary and useful to collect a piece of data.
  • To re-identify who is responsible for the collection, monitoring, and reporting of collected data.