Data Collection

From MozillaWiki
Jump to: navigation, search

At Mozilla, like at many other organizations, we rely on data to make product decisions. But here, unlike many other organizations, we balance our goal of collecting useful, high-quality data with our goal to give users meaningful choice and control over their own data. The Mozilla data collection program was created to ensure we achieve both goals whenever we make a change to how we collect data in our products.

In November 2017, we revised the program to make our policies clearer and easier to understand and our processes simpler and easier to follow. These changes are designed to reflect our commitment to data collection grounded in:

  • Necessity - We collect only as much data as is necessary when we can demonstrate a clear business case for that data
  • Privacy - We give users meaningful choices and control over their own data
  • Transparency - We make our decisions about data collection public and accessible
  • Accountability - We assign accountability for the design, approval, and implementation of data collection

Owner: Nneka Soyinka

Data Stewards:

Data stewards come from a variety of teams within Mozilla, including data science, Firefox engineering, mobile products, Pocket, AMO, and Thunderbird. You are welcome to tag any steward for any collection request, without respect to the nature of your collection.

Contact Us on Matrix https://chat.mozilla.org/#/room/#data-stewards:mozilla.org

Note: The data stewards aren't responsible for showing teams how to collect data, although they might be able to provide some guidance if they have time. But the Firefox data engineering team has prepared data documentation which can help!

Most assets involved in data review can be found in this repository. References to who fills out a form when are covered in the documentation below.

Scope

These guidelines are required for data collection in products with an active user base and established privacy policies under the Firefox organization, but may be applied to any Mozilla product as needed. Changes to policies themselves or the creation of a policy for a new product is out of scope of what is described here.

Key Roles for Data Collection

While the number of people involved in data collection can vary by product or project, there are two roles necessary for any project:

  • Data requester - the person requesting data to be collected
  • Data steward - the person who ensures the data collection process is followed and that requested data complies with Mozilla policies

In some cases a data steward may escalate concerns to the Trust and Legal teams. They are the teams responsible for defining data collection policies and can field questions about internal policy and laws governing user privacy.

Mozilla always strives to make data reviews public. However, there are sometimes limited sets of circumstances when we may conduct our reviews in a private bug; for example, a service is part of an agreement where the partnership is not yet public. These reviews will be made public once the actual data collection begins.

Adding or Modifying Data Collection

The process is slightly different for collections in mozilla-central code (Firefox Desktop, Firefox & Focus for Android, and Gecko) than it is elsewhere. Please consult the relevant section below.

Firefox Desktop, Firefox and Focus for Android, Gecko (from May 7, 2024)

When a developer uploads a change to Phabricator that adds or modifies any data collection, Phabricator will automatically add the needs-data-classification tag, and explain what happens next.

If you’re adding or modifying data collection in your Phabricator revision and this doesn’t happen automatically, please manually add this tag and then follow the same procedure.

Once this tag is in place Herald will ask the patch author and reviewer to assess the correct category for the data collection :

  • If the data being collected fits in the “technical data” or “interaction data” categories described there, use the data-classification-low tag.
  • If it’s any other category, or patch author and reviewer disagree about the right category, use the data-classification-high tag, and go through the sensitive data collection review process.
  • If you think that the data in question fits in “technical” or “interaction” data but would benefit from additional review, you can also explicitly choose to use the data-classification-high tag and thereby opt in to the sensitive data collection review process.

When using Glean for the data collection, the data classification of the new or expanded data collections should match the data_sensitivity property in the metric definitions. The entry in the data_reviews list should reflect the bug URL.

If the reviewer is unsure or feels uncomfortable making this assessment themselves, they can email the data stewards group or contact them on matrix for help.

Whichever tag you choose, please leave a comment explaining your choice. Note that you will not be able to land this revision until the revision has one of these tags and you remove the needs-data-classification tag. For low sensitivity data collection, you will be able to land the patch once this sensitivity is marked and you remove the needs-data-classification tag. For high sensitivity data collection, the data-stewards group will be added as a blocking reviewer on the patch. They will approve or request changes to the patch based on the sensitive data collection review process.

Patch authors are encouraged to add these tags themselves, but reviewers are responsible for making sure the right tag is used.

If you do not yet have a code change but are in the planning stages of a change and want to proactively discuss data collection options, reach out to the data stewards group.

Other Products

Step 1: Submit Request

To request a review for new or changed Data Collection in a Mozilla product, Data Review requesters are required to provide the following:

  • A completed Request Form, documenting what data is to be collected, why Mozilla needs to collect this data, how much data will be collected, and for how long it will be collected:
  • A bug to attach the completed Request Form to:
    • If you already have a bug filed to add the collection code, attach the form to that one.
    • If you don't already have a bug, file a new one in your own component, or Firefox::Untriaged if you don't have a component (e.g. if your code's in GitHub).
    • Tell Bugzilla that your form's extension is .txt so it can render it inline and so your Data Steward can review it more easily.
  • A notification so the Data Steward knows it's time to review your Request Form:
    • Flag the attached, completed Request Form for data-review by setting the data-review flag to ? and choosing your chosen Data Steward in the "Requestee" field that appears.
    • If a Data Steward doesn't get to your review within a couple of days, please reach out to us on Element.

Step 2: Request is reviewed

Data stewards review each request to ensure that it is documented fully and to assign the data collection to one of our 4 privacy categories as described here. tiers. The detailed steps in this process are:

  • Data stewards receive a data-review? on a file in a bug
  • Data stewards complete the data review form based on the information provided in the data collection request. They ensure that the request:
    • Follows Lean Data Practices & Guidelines
    • The basic mechanics of what is being measured is documented publicly.
    • Our need and justification for the data collection is documented for the record; e.g. there are complete and appropriate answers to questions on the request form.
    • The request aligns with user consent and control mechanisms outlined in the data collection categories listed below

Data stewards document the outcome of their review in the bug with a data-review+ or data-review- and their completed form. Typical outcomes include:

  • Unapproved requests are returned to data requesters for changes or clarification.
  • Simple requests that fall within Category 1 or 2 are often approved quickly.
  • Complex requests that pose broader policy and legal implications may be escalated to the Trust and Legal teams. (See Step 3)

Step 3: Sensitive Data Collection Review Process

Determine if you need to follow this process

For any data collection that is classified as category 3 or 4 (described below) – including in pre-release channels and experiments – we require additional review to be performed and an announcement to a mailing list. The reason for this is that while our privacy policies describe what we can do without additional user notice, this is an upper bound; even for collection which fits within the policy, we need to determine whether that collection is appropriate and conforms to our overall commitment to privacy and minimization. While a Data Steward may provide assistance with escalating a request or submitting it through the sensitive data review process, they are not part of the actual review of escalations. That is handled by a separate cross-functional team.

Create documentation and request review

As a first step, it is important that the details of the implementation, intended use, and value to users be clearly documented for future reference and efficient review. As soon as this is ready (we recommend as early as possible, before you move forward with the implementation), send an email to the data-review@mozilla.com mailing list.

The initial documentation from engineering/data stewardship and privacy/technical review should be completed as a prerequisite ahead of legal and security.

Risk Assessment Owner Facilitator
Privacy/Technical Review Office of the Firefox CTO Martin Thomson
Legal/Trust Review Legal Nneka Soyinka
Security Review Office of the CSO Marc Perreault
Data Review Data Mark Reid

Facilitators (named above) are expected to express judgement about how much risk is involved and will involve the appropriate reviewers.

If the level of risk is determined to be low enough and/or there is clear precedent, further discussion may not be necessary and each reviewer may give a sign-off immediately; otherwise, mitigations should be incorporated and documentation updated once they have been addressed. Live discussion is often very helpful – and should be planned for – when there is significant risk involved. One reviewer (after consulting with the full group), is permitted to approve on the group's behalf.

Data collection may not be shipped to users until final sign-offs have been obtained.

Escalation

In the case of a dispute about sensitive data collection and/or which mitigations are appropriate, the proposer or any reviewer should work with one of the facilitators to escalate the decision to the VP/XLT member in charge of the product (e.g., Head of Firefox, Head of Pocket). Depending on the scope and nature of the risk, there may also be cases where escalation goes beyond the immediate product owner (i.e., to the CPO or CEO). When this happens, the facilitator and escalating party:

  • Give each party a chance to document their recommended approach in writing.
  • Share the document with all involved parties for asynchronous review/comment.
  • Schedule a meeting for discussion if necessary.
  • Record the final decision by the product owner.

Data Collection Categories

There are four "categories" of data collection:

Category 1 “Technical data”
This includes information about the machine or software application itself in which there is no or little risk of personal identification.
Examples include OS, crashes and errors, outcome of automated processes like updates, activation, version #s, etc. This also includes aggregated compatibility information about features and API usage by websites, addons, and other 3rd-party software that interact with the application during usage.
It also includes information about the user's settings that is necessary to provide functionality. For example, what applications users have connected to a service or what services users have logged into using a Mozilla account.
Category 2 “Interaction data”
This includes information about the user’s direct engagement with the service in which there is no or little risk of personal identification.
Examples include how many devices a user has synced, engagement with specific features like clicks, scroll position, audio and session length, status of user preferences, and account activity levels.
It also includes information about the user's in-product journeys and product choices helpful to understand engagement (attitudes). For example, selections of add-ons or tiles to determine potential interest categories etc.
Category 3 “Stored Content & Communications”
This includes information about what people store, sync, communicate or connect to where the information is generally considered to be more sensitive and personal in nature.
Examples include users' saved URLs or URL history, specific web browsing history, general information about their web browsing history (such as TLDs or categories of webpages visited over time) and potentially certain types of interaction data about specific web pages or stories visited (such as highlighted portions of a story).
It also includes information such as content saved by users to an individual account like saved URLs, tags, notes, passwords and files as well as communications that users have with one another through a Mozilla service.
Category 4 “Highly sensitive or clearly identifiable personal data”
Information that directly identifies a person, or if combined with other data could identify a person. This data may be embedded within specific website content, such as memory contents, dumps, captures of screen data, or DOM data.
Examples include account registration data like name, password, and email address associated with an account, payment data in connection with subscriptions or donations, contact information such as phone numbers or mailing addresses, email addresses associated with surveys, promotions and customer support contacts.
It also includes any data from different categories that, when combined, can identify a person, device, household or account. For example: Category 1 log data combined with Category 3 saved URLs.
Additional examples are: voice audio commands (including a voice audio file), speech-to-text or text-to-speech (including transcripts), biometric data, demographic information, and precise location data associated with a persistent identifier, individual or small population cohorts. This is location inferred or determined from mechanisms other than IP such as wi-fi access points, Bluetooth beacons, cell phone towers or provided directly to us, such as in a survey or a profile.

Eligibility for Default on Data Collection

At installation, Mozilla’s products and services include one or more preferences and settings. These preferences and settings typically belong to a data collection state: a status that describes whether data collection occurs by default or not.

State What it Means
Default ON Data may be collected automatically.

Users must have a way to turn off data collection. Learn how to opt out of data collection in Firefox.

Default OFF Data may be collected, but only if a user takes an clear, express action to opt-in to the collection. This can be through a configuration option, a prompt or an update through an account profile.

Users must have a way to turn off data collection.

Release” means products that are not experimental. These include Firefox, Pocket, Lockwise, Monitor, and others.

Pre-release” means experimental products. They are typically identified by the words “Beta,” “Nightly,” “Preview,” “Reference Browser,” or “Developer Edition” in the name of the product.

Category 1 “Technical data”
Release & Pre-Release - eligible for Default ON.
Category 2 “Interaction data”
Release & Pre-Release - eligible for Default ON.
Category 3 “Stored Content and Communications”
Release: Default OFF. Default ON requires prior Trust approval.

Pre-Release: Default ON eligible

On a case-by-case basis collections may be eligible to be "Default ON" if mitigations are identified. Mitigations may include UX changes that make users aware of additional risk, technical mechanisms that remove the risk, or a risk assessment done of a case-by-case basis that determines the risk is limited.

Category 4 “Highly Sensitive or Clearly identifiable personal data”
Release & Pre-Release: Default OFF

Any collection requires prior Trust approval and (i) advance user notice (ii) consent and (iii) an opt-out.

Other Practices

Every year, the data collection owner and peers will survey all of the existing data collection systems for their product or project. This survey has the following goals:

  • To ensure that it is still necessary and useful to collect a piece of data.
  • To re-identify who is responsible for the collection, monitoring, and reporting of collected data.

Additional References

Data Publishing process