Telemetry/Available Telemetry Datasets and their Applications: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(frank changes)
(Add API Backlink summary)
 
(5 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=Data Set Documentation=
=Data Set Documentation=
This document describes a set of datasets which can be queried using re:dash/sql.telemetry.mozilla.org (s.t.m.o). In addition, they can be queried using a Spark cluster - see [https://wiki.mozilla.org/Telemetry/Custom_analysis_with_spark#How_can_I_load_parquet_datasets_in_a_Jupyter_notebook.3F these directions]. The Longditudinal dataset is also available natively within Spark, see the [https://github.com/mozilla/emr-bootstrap-spark/blob/master/examples/Longitudinal%20Dataset%20Tutorial.ipynb longitudinal tutorial].


==Longitudinal==
This document now lives here:
[[Telemetry/LongitudinalExamples|Complete documentation]]
https://github.com/mozilla/telemetry-batch-view/blob/master/docs/choosing_a_dataset.md


{{longitudinal data intro}}
[https://wiki.mozilla.org/api.php?action=query&list=backlinks&bltitle=Telemetry/Available_Telemetry_Datasets_and_their_Applications Wiki.mo pages linking to this dead page]
 
==Main Summary==
[https://github.com/mozilla/telemetry-batch-view/blob/master/docs/MainSummary.md Complete Documentation]
 
Like the longitudinal dataset, main summary summarizes [https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/main-ping.html main pings]. Each row corresponds to a single ping. This table does no sampling and includes all desktop pings.
 
===Caveats===
Querying against main summary on SQL.t.m.o/re:dash can '''impact performance for other users''' and can '''take a while to complete''' (~30m for simple queries). Since main summary includes a row for every ping, there are a large number of records which can consume a lot of resources on the shared cluster.
 
Instead, we recommend using the Longitudinal dataset where possible if querying from re:dash/s.t.m.o. The longitudinal dataset samples to 1% of all data and organized the data by client_id. In the odd case where these queries are necessary, limit to a short submission_date_s3 range and ideally make use of the sample_id field. Ideally, users who require this dataset would use Spark.
 
==Cross Sectional==
The Cross Sectional dataset is a simplified version of the Longitudinal dataset.
 
The majority of Longitudinal columns contain array values with one element for each ping, which is difficult to work with in SQL. The Cross Sectional dataset '''replaces these array-valued columns with summary statistics'''. To give an example, the Longitudinal dataset will contain a column named "geo_country" where each row is an array of locales for one client (e.g. array<"en_US", "en_US", "en_GB">). Instead, the Cross Sectional dataset includes a column named "geo_country_mode" where each row contains a single string representing the mode (e.g. "en_US"). The Cross Sectional column is '''easier to work with''' in SQL and is more representative than just choosing a single value from the Longitudinal array.
 
Note that the Cross Sectional dataset is derived from the Longitudinal dataset, so the dataset is a '''1% sample of main pings'''
 
This dataset is sometimes abbreviated as the '''xsec dataset'''. You can find the current version of the code [https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/CrossSectionalView.scala here]. This dataset is under active development, please '''contact rharter@mozilla.com with any questions'''.
 
==Client Count==
 
==Crash Aggregates==
 
==Mobile Metrics==
The android_events, android_clients, android_addons, and mobile_clients tables are documented here:
https://wiki.mozilla.org/Mobile/Metrics/Redash

Latest revision as of 21:49, 7 November 2016