Talk:MetricsDataPing: Difference between revisions

Talk:MetricsDataPing (view source)

Revision as of 04:10, 3 February 2012

5,430 bytes added , 3 February 2012

no edit summary

DEinspanjer

131

edits

@@ Line 10: / Line 10: @@
 * This is why European and German law *requires* opt-in for any gathering of data about the user.
+= Discussion of using random sampling to enroll installations into the data submission vs. enrolling all installations by default =
+== Comments from [[User:DEinspanjer|DEinspanjer]] 20:10, 2 February 2012 (PST) ==
+[[User:BenB]] brought up the idea of using random sampling during the security review meeting.  This had been previously discussed by the Metrics team.  It is a viable option with some possibly moderate drawbacks.  Anyone manually opting in to the system must be flagged as such so their self-selection bias does not skew analysis.  The current proposed system generates aggregate views of the data which roll up any high cardinality groups to an acceptable level (the initial threshold was set at 1000).  It is reasonable to assume there will be a lot of long tail groups with the minimum threshold aggregation levels.  Heavy sampling is likely to make that long tail unuseful for comparison analysis.  For example, it is very likely that even a 10% sampling might not allow Mozilla or an individual user doing local analysis to compare performance of their installation with other installations that have a particular add-on installed.
+It is not something that I would consider to be a closed topic by any means, but my personal preference is to make sure the system has adequate privacy controls and can handle the load of the full installation-base and avoid potential issues with sampling errors or reduced analytic capability for both the user and for Mozilla.
 = Discussion of old UUID method for collecting longitudinal data =
@@ Line 105: / Line 111: @@
 At least on the server side, not on the submission side. I would have to trust you, and anything between you and me. It would be possible to separate the calls and submit each value separately, but I think that would be overdoing it.
+==== Response from [[User:DEinspanjer|DEinspanjer]] 20:10, 2 February 2012 (PST) ====
+The example above initially only walks through a couple of measures (age of installation and crash count) and a single dimension (date), but it states that it attempts to handle the requirement of having several data points and dimensions that are needed.
+First, I would like to state the opinion that it does not matter whether the data points are sent in multiple requests or a single request.  The only reason for using multiple calls would be to attempt to obfuscate the fingerprint of the request from the recipient.  Unless the calls were made over a long period of time (which would greatly inhibit useful analysis) and the IP address was changing frequently and there was no fingerprintable user agent string or other HTTP headers (not a reasonable assumption), the sender must still ultimately trust Mozilla not to attempt to reconstruct the request chain and attempt to use the reconstructed fingerprint to identify the user by permanently linking it to PII such as IP or personal information.
+Second, I believe the same reasoning holds for the concept of breaking up the data into multiple tables to avoid fingerprinting.  Ultimately, the user needs to trust that the company will follow the practice they commit to because the work is taking place outside of their view.  There can be mitigating factors to increase the trust such as a simple and sound privacy policy, releasing the server source code, and sharing the aggregated data, but '''it still comes down to the community reviewing the data that is being sent, agreeing that the data itself isn't harmful to the user's privacy (i.e. it doesn't contain their e-mail address, sites visited, demographic data, etc.) and then trusting that the transport mechanism is not likely to be compromised and that the company will do what they say they are going to do.'''  If the system does not keep PII such as IP and does not longitudinally track sensitive information such as GeoLocation (both of which are strictly forbidden in the current proposal) then even an external party such as a government agent would not be able to get anything more interesting from requesting the data than what they would already have access to.  They would have to force the company to modify the source code of either the client or the server in order to collect information useful to them, and if they could actually force that, all bets are already off.
+So I want to walk through a more complete example with 16 data points:
+  DataName          Prev    Curr
+  SubmissionDate    00      01
+  AgeOfProfile      10      11
+  NumberOfCrashes   20      21
+  MainStartup       30      31
+  FirstPaint        40      41
+  OSNameVersion     50      51
+  AppNameVersion    60      61
+  AppBuildNumber    70      71
+  AppABI            80      81
+  AppUpdateChannel  90      91
+  Locale            A0      A1
+  NumberOfSearches  B0      B1
+  NumberOfSessions  C0      C1
+  ActiveSessionT    D0      D1
+  AddonCount        E0      E1
+  SystemMemory      F0      F1
+Some of these data points would likely be fairly constant, some of them would have very low cardinality which means large groups of records with the same values.  That said, some of them would change a lot and/or have very high cardinality.
+If you string together those previous and current tokens, they represent lots of bits that can be used for fingerprinting.  To me, it looks equivalent to the following:
+Header: prevID=00102030-4050-6070-8090-A0B0C0D0E0F0&currID=01112131-4151-6171-8191-A1B1C1D1E1F1  Payload: {expanded data}
+In my opinion, the two proposals are functionally equivalent on the transport layer. At the data storage layer, they both require the user to trust that the company will not attempt to subvert the data collected by linking it to PII.  If this system is acceptable given those requirements, the first should be as well.