Template:Longitudinal data intro: Difference between revisions

Revision as of 00:33, 1 September 2016

The longitudinal dataset is a summary of main pings.

The longitudinal dataset differs from main_summary in two important ways:

The longitudinal dataset groups all data for a client-id in the same row. This makes it easy to report profile level metrics. Without this deduping, metrics would be weighted by the number of submissions instead of by clients.
The dataset uses a 1% of all recent profiles, which will reduce query computation time and save resources. The sample of clients will be stable over time.

Accordingly, one should prefer using the Longitudinal dataset except in the rare case where a 100% sample is strictly necessary.

As discussed in the Longitudinal Data Set Example Notebook:

The longitudinal dataset is logically organized as a table where rows represent profiles and columns the various metrics (e.g. startup time). Each field of the table contains a list of values, one per Telemetry submission received for that profile. [...]
The current version of the longitudinal dataset has been build with all main pings received from 1% of profiles across all channels with [...] up to 180 days of data.

@@ Line 1: / Line 1: @@
 The longitudinal dataset is a summary of main pings.
-In general, you should prefer using the longitudinal set to main_summary unless there are extenuating circumstances.
-The longitudinal dataset:
+The longitudinal dataset differs from main_summary in two important ways:
-* makes it easy to report profile level metrics by grouping all data for a client-id in the same row
+* The longitudinal dataset groups all data for a client-id in the same row. This makes it easy to report profile level metrics. Without this deduping, metrics would be weighted by the number of submissions instead of by clients.
-* samples to 1% of all recent profiles, which will reduce query computation time and save resources
+* The dataset uses a 1% of all recent profiles, which will reduce query computation time and save resources. The sample of clients will be stable over time.
+Accordingly, one should prefer using the Longitudinal dataset except in the rare case where a 100% sample is strictly necessary.
 As discussed in the [https://gist.github.com/vitillo/627eab7e2b3f814725d2 Longitudinal Data Set Example Notebook]:

Template:Longitudinal data intro: Difference between revisions

Revision as of 00:33, 1 September 2016

Navigation menu

Search