Telemetry/Available Telemetry Datasets and their Applications: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(Add Main Summary MVP)
(→‎Main Summary: Update main summary warnings)
Line 10: Line 10:
Like the longitudinal dataset, main summary summarizes [https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/main-ping.html main pings]. Each row corresponds to a single ping. This table does no sampling and includes all desktop pings.
Like the longitudinal dataset, main summary summarizes [https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/main-ping.html main pings]. Each row corresponds to a single ping. This table does no sampling and includes all desktop pings.


Be aware that queries against this dataset can take a while to complete (~30 minutes for a simple query). Consider using the longitudinal dataset instead, which samples to 1% of all data and organized the data by client_id.  
===Caveats===
Querying against main summary on SQL.t.m.o/re:dash can '''impact performance for other users''' and can '''take a while to complete''' (~30m for simple queries). Since main summary includes a row for every ping, there are a large number of records which can consume a lot of resources on the shared cluster.
 
Instead, we recommend using the Longitudinal dataset where possible if querying from re:dash/sql.t.m.o. The longitudinal dataset samples to 1% of all data and organized the data by client_id. In the odd case where these queries are necessary, limit to a short submission_date_s3 range and ideally make use of the sample_id field. Even better, try using Spark.


==Cross Sectional==
==Cross Sectional==

Revision as of 03:52, 17 September 2016

Data Set Documentation

Longitudinal

Complete documentation

The longitudinal dataset is a summary of main pings. If you're not sure which dataset to use for your query, this is probably what you want. It differs from the main_summary table in two important ways:

  • The longitudinal dataset groups all data for a client-id in the same row. This makes it easy to report profile level metrics. Without this deduping, metrics would be weighted by the number of submissions instead of by clients.
  • The dataset uses a 1% of all recent profiles, which will reduce query computation time and save resources. The sample of clients will be stable over time.

Accordingly, one should prefer using the Longitudinal dataset except in the rare case where a 100% sample is strictly necessary.

As discussed in the Longitudinal Data Set Example Notebook:

The longitudinal dataset is logically organized as a table where rows represent profiles and columns the various metrics (e.g. startup time). Each field of the table contains a list of values, one per Telemetry submission received for that profile. [...]

The current version of the longitudinal dataset has been build with all main pings received from 1% of profiles across all channels with [...] up to 180 days of data.

Main Summary

Complete Documentation

Like the longitudinal dataset, main summary summarizes main pings. Each row corresponds to a single ping. This table does no sampling and includes all desktop pings.

Caveats

Querying against main summary on SQL.t.m.o/re:dash can impact performance for other users and can take a while to complete (~30m for simple queries). Since main summary includes a row for every ping, there are a large number of records which can consume a lot of resources on the shared cluster.

Instead, we recommend using the Longitudinal dataset where possible if querying from re:dash/sql.t.m.o. The longitudinal dataset samples to 1% of all data and organized the data by client_id. In the odd case where these queries are necessary, limit to a short submission_date_s3 range and ideally make use of the sample_id field. Even better, try using Spark.

Cross Sectional

Client Count

Crash Aggregates

Mobile Metrics

The android_events, android_clients, android_addons, and mobile_clients tables are documented here: https://wiki.mozilla.org/Mobile/Metrics/Redash