Telemetry/Available Telemetry Datasets and their Applications: Difference between revisions

Telemetry/Available Telemetry Datasets and their Applications (view source)

458 bytes added , 17 September 2016

→‎Main Summary: Update main summary warnings

54

edits

@@ Line 10: / Line 10: @@
 Like the longitudinal dataset, main summary summarizes [https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/main-ping.html main pings]. Each row corresponds to a single ping. This table does no sampling and includes all desktop pings.
-Be aware that queries against this dataset can take a while to complete (~30 minutes for a simple query). Consider using the longitudinal dataset instead, which samples to 1% of all data and organized the data by client_id.
+===Caveats===
+Querying against main summary on SQL.t.m.o/re:dash can '''impact performance for other users''' and can '''take a while to complete''' (~30m for simple queries). Since main summary includes a row for every ping, there are a large number of records which can consume a lot of resources on the shared cluster.
+Instead, we recommend using the Longitudinal dataset where possible if querying from re:dash/sql.t.m.o. The longitudinal dataset samples to 1% of all data and organized the data by client_id. In the odd case where these queries are necessary, limit to a short submission_date_s3 range and ideally make use of the sample_id field. Even better, try using Spark.
 ==Cross Sectional==