Telemetry/Available Telemetry Datasets and their Applications: Difference between revisions

→‎Main Summary: Update main summary warnings
(Add Main Summary MVP)
(→‎Main Summary: Update main summary warnings)
Line 10: Line 10:
Like the longitudinal dataset, main summary summarizes [https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/main-ping.html main pings]. Each row corresponds to a single ping. This table does no sampling and includes all desktop pings.
Like the longitudinal dataset, main summary summarizes [https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/main-ping.html main pings]. Each row corresponds to a single ping. This table does no sampling and includes all desktop pings.


Be aware that queries against this dataset can take a while to complete (~30 minutes for a simple query). Consider using the longitudinal dataset instead, which samples to 1% of all data and organized the data by client_id.  
===Caveats===
Querying against main summary on SQL.t.m.o/re:dash can '''impact performance for other users''' and can '''take a while to complete''' (~30m for simple queries). Since main summary includes a row for every ping, there are a large number of records which can consume a lot of resources on the shared cluster.
 
Instead, we recommend using the Longitudinal dataset where possible if querying from re:dash/sql.t.m.o. The longitudinal dataset samples to 1% of all data and organized the data by client_id. In the odd case where these queries are necessary, limit to a short submission_date_s3 range and ideally make use of the sample_id field. Even better, try using Spark.


==Cross Sectional==
==Cross Sectional==
54

edits