Telemetry/LongitudinalExamples: Difference between revisions

Telemetry/LongitudinalExamples (view source)

502 bytes added , 2 August 2016

Fixing l10l sampling methodology.

54

edits

@@ Line 24: / Line 24: @@
   SELECT * FROM longitudinal LIMIT 1000 ...
-Or to look at a 1% sample of the clients:
+For a statistically sound sample, use TABLESAMPLE:
+ SELECT * FROM longitudinal TABLESAMPLE BERNOULLI(xx)
- SELECT * FROM longitudinal WHERE sample_id[1] = 5 ...
+Where xx is an integer representing what percentage of data you want to include in your sample (e.g. 10% sample -> xx=10).
-The sample_id partitions the clients into stable ~1% samples.
+A couple of caveats:
+* This sampling method will only decrease your query run time if you're manipulating the data a lot. Bernoulli sampling still requires reading the whole DB before proceeding.
+* This sample will not be deterministic. I.e. you will not get the same sample for every run. This can cause problems when using Presto Views or logical tables.
+* Unlike LIMIT, this method does not guarantee a fixed number of results.
 === Arrays ===