Latest revision as of 21:49, 7 November 2016

Data Set Documentation

This document now lives here: https://github.com/mozilla/telemetry-batch-view/blob/master/docs/choosing_a_dataset.md

Wiki.mo pages linking to this dead page

@@ Line 1: / Line 1: @@
 =Data Set Documentation=
-This document describes a set of datasets which can be queried using re:dash/sql.telemetry.mozilla.org (s.t.m.o). In addition, they can be queried using a Spark cluster - see [https://wiki.mozilla.org/Telemetry/Custom_analysis_with_spark#How_can_I_load_parquet_datasets_in_a_Jupyter_notebook.3F these directions]. The Longditudinal dataset is also available natively within Spark, see the [https://github.com/mozilla/emr-bootstrap-spark/blob/master/examples/Longitudinal%20Dataset%20Tutorial.ipynb longitudinal tutorial].
-==Longitudinal==
+This document now lives here:
-[[Telemetry/LongitudinalExamples|Complete documentation]]
+https://github.com/mozilla/telemetry-batch-view/blob/master/docs/choosing_a_dataset.md
-{{longitudinal data intro}}
+[https://wiki.mozilla.org/api.php?action=query&list=backlinks&bltitle=Telemetry/Available_Telemetry_Datasets_and_their_Applications Wiki.mo pages linking to this dead page]
-==Main Summary==
-[https://github.com/mozilla/telemetry-batch-view/blob/master/docs/MainSummary.md Complete Documentation]
-Like the longitudinal dataset, main summary summarizes [https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/main-ping.html main pings]. Each row corresponds to a single ping. This table does no sampling and includes all desktop pings.
-===Caveats===
-Querying against main summary on SQL.t.m.o/re:dash can '''impact performance for other users''' and can '''take a while to complete''' (~30m for simple queries). Since main summary includes a row for every ping, there are a large number of records which can consume a lot of resources on the shared cluster.
-Instead, we recommend using the Longitudinal dataset where possible if querying from re:dash/s.t.m.o. The longitudinal dataset samples to 1% of all data and organized the data by client_id. In the odd case where these queries are necessary, limit to a short submission_date_s3 range and ideally make use of the sample_id field. Ideally, users who require this dataset would use Spark.
-==Cross Sectional==
-The Cross Sectional dataset is a simplified version of the Longitudinal dataset.
-The majority of Longitudinal columns contain array values with one element for each ping, which is difficult to work with in SQL. The Cross Sectional dataset '''replaces these array-valued columns with summary statistics'''. To give an example, the Longitudinal dataset will contain a column named "geo_country" where each row is an array of locales for one client (e.g. array<"en_US", "en_US", "en_GB">). Instead, the Cross Sectional dataset includes a column named "geo_country_mode" where each row contains a single string representing the mode (e.g. "en_US"). The Cross Sectional column is '''easier to work with''' in SQL and is more representative than just choosing a single value from the Longitudinal array.
-Note that the Cross Sectional dataset is derived from the Longitudinal dataset, so the dataset is a '''1% sample of main pings'''
-This dataset is sometimes abbreviated as the '''xsec dataset'''. You can find the current version of the code [https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/CrossSectionalView.scala here]. This dataset is under active development, please '''contact rharter@mozilla.com with any questions'''.
-==Client Count==
-==Crash Aggregates==
-==Mobile Metrics==
-The android_events, android_clients, android_addons, and mobile_clients tables are documented here:
-https://wiki.mozilla.org/Mobile/Metrics/Redash

Telemetry/Available Telemetry Datasets and their Applications: Difference between revisions

Latest revision as of 21:49, 7 November 2016

Data Set Documentation

Navigation menu

Telemetry/Available Telemetry Datasets and their Applications: Difference between revisions

Latest revision as of 21:49, 7 November 2016

Data Set Documentation

Navigation menu

Search