Data Publishing: Difference between revisions

Minor cleanup, including adding a link to the data blog and fixing the doc link.
No edit summary
(Minor cleanup, including adding a link to the data blog and fixing the doc link.)
Line 7: Line 7:
<big>'''Dataset Publishing Process'''</big>
<big>'''Dataset Publishing Process'''</big>


We want our data publishing review process, as well as our review decisions to be public and understandable, similar to our [[Firefox/Data_Collection|Mozilla Data Collection]] program. To that end, our full dataset publishing policy and details about what considerations we look at before determining what is safe to publish can be found below, including asummary of the critical pieces of that process.
We want our data publishing review process, as well as our review decisions to be public and understandable, similar to our [[Firefox/Data_Collection|Mozilla Data Collection]] program. To that end, our full dataset publishing policy and details about what considerations we look at before determining what is safe to publish can be found below, including a summary of the critical pieces of that process.


The goal of our data publishing process is to:
The goal of our data publishing process is to:
Line 35: Line 35:
[Canada, Linux, “Other locales”, nightly] for rare locales
[Canada, Linux, “Other locales”, nightly] for rare locales
|-
|-
| 3 || '''Dimension-level aggregation w/o minimum bucket sizes''' || Clientid count by country, os, locale, channel, where there could be: [Canada, Linux, PL, nightly] which has one client in it.
| 3 || '''Dimension-level aggregation w/o minimum bucket sizes''' || Client ID count by country, os, locale, channel, where there could be: [Canada, Linux, PL, nightly] which has one client in it.
|-
|-
| 4 || '''Probabilistic Aggregates''' || [https://en.wikipedia.org/wiki/HyperLogLog HLL] for computing approximate unique client counts, [https://en.wikipedia.org/wiki/Bloom_filter bloom filter] for computing presence in a set
| 4 || '''Probabilistic Aggregates''' || [https://en.wikipedia.org/wiki/HyperLogLog HLL] for computing approximate unique client counts, [https://en.wikipedia.org/wiki/Bloom_filter bloom filter] for computing presence in a set.
|-
|-
| 5 || '''Anonymized individual-level data''' ||  
| 5 || '''Anonymized individual-level data''' ||  
Line 95: Line 95:
*  Schedule it to update on the desired frequency
*  Schedule it to update on the desired frequency
*  Plumb it in to the public facing dataset infrastructure, including metadata that links the public data back to the above review bug.
*  Plumb it in to the public facing dataset infrastructure, including metadata that links the public data back to the above review bug.
*  Once the dataset has been published, it will be announced on the new Data @ Mozilla blog. It will also be added to https://docs.telemetry.mozilla.org/datasets/.
*  Once the dataset has been published, it will be announced on the [https://blog.mozilla.org/data/ Data @ Mozilla blog]. Accessing the public data is described on the [https://docs.telemetry.mozilla.org/cookbooks/public_data.html data documentation page].


<big>'''Definitions'''</big>
<big>'''Definitions'''</big>


'''Metric''' - A metric is anything we want to measure.
'''Metric''' - A metric is anything we want to measure.
Examples: the number of clients that used the developer tools console, the number of active clients
Examples: the number of clients that used the developer tools console, the number of active clients.


'''Dimension''' - A dimension is a qualitative value such as OS, channel, or date. In practice, a dimension often defines a sub-population on which we can calculate a metric, allowing us to segment the metric for further analysis.
'''Dimension''' - A dimension is a qualitative value such as OS, channel, or date. In practice, a dimension often defines a sub-population on which we can calculate a metric, allowing us to segment the metric for further analysis.
Examples: if we have an OS dimension, we can analyze the number of active clients by OS;
Examples: if we have an OS dimension, we can analyze the number of active clients by OS.


'''Aggregate''' - A combined value of many measurements (metric values), typically grouped by dimension or sets of dimensions. See also Aggregate Data.
'''Aggregate''' - A combined value of many measurements (metric values), typically grouped by dimension or sets of dimensions.


'''Individual-level Data''' - Data containing a dimension which uniquely identifies a single profile, user, client, etc.
'''Individual-level Data''' - Data containing a dimension which uniquely identifies a single profile, user, client, etc.
Confirmed users
36

edits