Confirmed users
36
edits
No edit summary |
(Minor cleanup, including adding a link to the data blog and fixing the doc link.) |
||
Line 7: | Line 7: | ||
<big>'''Dataset Publishing Process'''</big> | <big>'''Dataset Publishing Process'''</big> | ||
We want our data publishing review process, as well as our review decisions to be public and understandable, similar to our [[Firefox/Data_Collection|Mozilla Data Collection]] program. To that end, our full dataset publishing policy and details about what considerations we look at before determining what is safe to publish can be found below, including | We want our data publishing review process, as well as our review decisions to be public and understandable, similar to our [[Firefox/Data_Collection|Mozilla Data Collection]] program. To that end, our full dataset publishing policy and details about what considerations we look at before determining what is safe to publish can be found below, including a summary of the critical pieces of that process. | ||
The goal of our data publishing process is to: | The goal of our data publishing process is to: | ||
Line 35: | Line 35: | ||
[Canada, Linux, “Other locales”, nightly] for rare locales | [Canada, Linux, “Other locales”, nightly] for rare locales | ||
|- | |- | ||
| 3 || '''Dimension-level aggregation w/o minimum bucket sizes''' || | | 3 || '''Dimension-level aggregation w/o minimum bucket sizes''' || Client ID count by country, os, locale, channel, where there could be: [Canada, Linux, PL, nightly] which has one client in it. | ||
|- | |- | ||
| 4 || '''Probabilistic Aggregates''' || [https://en.wikipedia.org/wiki/HyperLogLog HLL] for computing approximate unique client counts, [https://en.wikipedia.org/wiki/Bloom_filter bloom filter] for computing presence in a set | | 4 || '''Probabilistic Aggregates''' || [https://en.wikipedia.org/wiki/HyperLogLog HLL] for computing approximate unique client counts, [https://en.wikipedia.org/wiki/Bloom_filter bloom filter] for computing presence in a set. | ||
|- | |- | ||
| 5 || '''Anonymized individual-level data''' || | | 5 || '''Anonymized individual-level data''' || | ||
Line 95: | Line 95: | ||
* Schedule it to update on the desired frequency | * Schedule it to update on the desired frequency | ||
* Plumb it in to the public facing dataset infrastructure, including metadata that links the public data back to the above review bug. | * Plumb it in to the public facing dataset infrastructure, including metadata that links the public data back to the above review bug. | ||
* Once the dataset has been published, it will be announced on the | * Once the dataset has been published, it will be announced on the [https://blog.mozilla.org/data/ Data @ Mozilla blog]. Accessing the public data is described on the [https://docs.telemetry.mozilla.org/cookbooks/public_data.html data documentation page]. | ||
<big>'''Definitions'''</big> | <big>'''Definitions'''</big> | ||
'''Metric''' - A metric is anything we want to measure. | '''Metric''' - A metric is anything we want to measure. | ||
Examples: the number of clients that used the developer tools console, the number of active clients | Examples: the number of clients that used the developer tools console, the number of active clients. | ||
'''Dimension''' - A dimension is a qualitative value such as OS, channel, or date. In practice, a dimension often defines a sub-population on which we can calculate a metric, allowing us to segment the metric for further analysis. | '''Dimension''' - A dimension is a qualitative value such as OS, channel, or date. In practice, a dimension often defines a sub-population on which we can calculate a metric, allowing us to segment the metric for further analysis. | ||
Examples: if we have an OS dimension, we can analyze the number of active clients by OS | Examples: if we have an OS dimension, we can analyze the number of active clients by OS. | ||
'''Aggregate''' - A combined value of many measurements (metric values), typically grouped by dimension or sets of dimensions | '''Aggregate''' - A combined value of many measurements (metric values), typically grouped by dimension or sets of dimensions. | ||
'''Individual-level Data''' - Data containing a dimension which uniquely identifies a single profile, user, client, etc. | '''Individual-level Data''' - Data containing a dimension which uniquely identifies a single profile, user, client, etc. |