Telemetry/LongitudinalExamples: Difference between revisions

← Older edit

Telemetry/LongitudinalExamples (view source)

Revision as of 22:11, 7 November 2016

5,120 bytes removed , 7 November 2016

Deprecate wiki based d10n

Harter

54

edits

@@ Line 1: / Line 1: @@
-===Introduction===
+This document is now hosted here:
-{{longitudinal data intro}}
+https://github.com/mozilla/telemetry-batch-view/blob/master/docs/longitudinal_examples.md
-=== Table structure ===
+Wiki pages linking to this dead page:
+https://wiki.mozilla.org/api.php?action=query&list=backlinks&bltitle=Telemetry/LongitudinalExamples
-To get an overview of the longitudinal data table:
- DESCRIBE longitudinal
-That table has a row for each client, with columns for the different parts of the ping.
-There are a lot of fields here, so I recommend downloading the results as a CSV if you want to search through these fields.
-Unfortunately, there's no way to filter the output of DESCRIBE in Presto.
-Because this table combines all rows for a given client id, most columns contain either Arrays or Maps (described below). A few properties are directly available to query on:
- SELECT count(*) AS count
- FROM longitudinal
- WHERE os = 'Linux'
-==== Arrays ====
-Most properties are arrays, which contain one entry for each submission from a given client (newest first). Note that indexing starts at 1:
- SELECT reason[1] AS newest_reason
- FROM longitudinal
- WHERE os = 'Linux'
-To expand arrays and maps and work on the data row-wise we can use <code>UNNEST(array)</code>.
- WITH lengths AS
-   (SELECT os, greatest(-1, least(31, sl / (24*60*60))) AS days
-    FROM longitudinal
-    CROSS JOIN UNNEST(session_length, reason) AS t(sl, r)
-    WHERE r = 'shutdown' OR r = 'aborted-session')
- SELECT os, days, count(*) AS count
- FROM lengths
- GROUP BY days, os ORDER BY days ASC
-However, it may be better to use a sample from the main_summary table instead.
-Links:
-* [https://prestodb.io/docs/current/functions/array.html Documentation on array functions]
-* [https://prestodb.io/docs/current/sql/select.html#unnest <code>UNNEST</code> documentation]
-==== Maps ====
-Some fields like <code>active_addons</code> or <code>user_prefs</code> are handled as maps, on which you can use the <code>[]</code> operator and special functions:
- WITH adp AS
-   (SELECT active_addons[1]['{d10d0bf8-f5b5-c8b4-a8b2-2b9879e08c5d}']
-             IS NOT null AS has_adblockplus
-    FROM longitudinal)
- SELECT has_adblockplus, count(*) AS count
- FROM adp GROUP BY 1 ORDER BY 2 DESC
-Links:
-* [https://prestodb.io/docs/current/functions/map.html Documentation on map functions]
-=== Sampling ===
-While composing queries, it can be helpful to work on small samples to reduce query runtimes:
- SELECT * FROM longitudinal LIMIT 1000 ...
-There's no need to use other sampling methods, such as TABLESAMPLE, on the longitudinal set. Rows are randomly ordered, so a LIMIT sample is expected to be random.
-=== Example Queries ===
-* Blocklist URLs (extensions.blocklist.url):
- SELECT bl, COUNT(bl)
- FROM
-   (SELECT element_at(settings, 1).user_prefs['extensions.blocklist.url'] AS bl
-    FROM longitudinal)
- GROUP BY bl
-* Blocklist enabled/disabled (extensions.blocklist.enabled) count:
- SELECT bl, COUNT(bl)
- FROM
-   (SELECT element_at(settings, 1).blocklist_enabled AS bl
-    FROM longitudinal)
- GROUP BY bl
-* Parsing most recent submission_date
- SELECT DATE_PARSE(submission_date[1], '%Y-%m-%dT00:00:00.000Z') as parsed_submission_date
- FROM longitudinal
-* Limiting to most recent ping in the last 7 days
- SELECT * FROM longitudinal
- WHERE DATE_DIFF('day', DATE_PARSE(submission_date[1], '%Y-%m-%dT00:00:00.000Z'), current_date) < 7
-* Scalar measurement (how many users with more than 100 tabs)
- WITH samples AS
-  (SELECT
-    client_id,
-    normalized_channel as channel,
-    mctc AS max_concurrent_tabs
-   FROM longitudinal
-   CROSS JOIN UNNEST(scalar_parent_browser_engagement_max_concurrent_tab_count) as t (mctc)
-   WHERE
-    scalar_parent_browser_engagement_max_concurrent_tab_count is not null and
-    normalized_channel = 'nightly')
- SELECT approx_distinct(client_id) FROM samples WHERE max_concurrent_tabs.value > 100
-=== Using Views ===
-If you find yourself copy/pasting SQL between different queries, consider using a Presto VIEW to allow for code reuse. Views create logical tables which you can reuse in other queries. For example, [https://sql.telemetry.mozilla.org/queries/776/source this view] defines some important filters and derived variables which are then used in [https://sql.telemetry.mozilla.org/queries/777/source#1311 this downstream query].
-You can define a view by prefixing your query with
- CREATE OR REPLACE VIEW view_name AS ...
-Be careful not to overwrite an existing view! Using a unique name is important.
-Find more information [https://prestodb.io/docs/current/sql/create-view.html here].
-=== Working offline ===
-It's often useful to keep a local sample of the l10l data when prototyping an analysis. The data is stored in s3://telemetry-parquet/longitudinal/. Once you have AWS credentials you can copy a shard of the parquet dataset to a local directory using `aws cp [filename] .`
-To request AWS credentials, see [https://mana.mozilla.org/wiki/display/SVCOPS/Requesting+A+Dev+IAM+account+from+Cloud+Operations this page]. To initiate your AWS config, try `aws configure`
-=== FAQ ===
-==== I'm getting an error, "... cannot be resolved" ====
-For some reason, re:dash has trouble parsing SQL strings with double quotes. Try using single quotes instead.
-=== Other Resources ===
-* [https://prestodb.io/docs/current/sql.html Presto Docs]
-* [https://docs.treasuredata.com/articles/presto-query-faq Helpful FAQ covering perf/distribution]
-* [https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/Longitudinal.scala#L194 Longitudinal schema definition]
-* [[Custom_dashboards_with_re:dash]]