Talk:MetricsDataPing: Difference between revisions
DEinspanjer (talk | contribs) |
DEinspanjer (talk | contribs) |
||
Line 62: | Line 62: | ||
=== What to avoid === | === What to avoid === | ||
It should not include exact historic numbers either, because they, too, would allow to puzzle the numbers together and allow to again build a history of IP addresses for a given user. Similarly, the exact time of the previous submission would allow to piece submissions together and must not be submitted, but rather only the day (2011-02-12, not minutes or seconds). It must be impossible to match 2 submissions together, even when considering several parameters as a collection, see http://panopticlick.eff.org/ . | It should not include exact historic numbers either, because they, too, would allow to puzzle the numbers together and allow to again build a history of IP addresses for a given user. Similarly, the exact time of the previous submission would allow to piece submissions together and must not be submitted, but rather only the day (2011-02-12, not minutes or seconds). It must be impossible to match 2 submissions together, even when considering several parameters as a collection, see http://panopticlick.eff.org/ . | ||
=== NonIdAlternative === | |||
For simplicity, I will take the number of crashes (e.g. in the last week or overall) as data point that you want to gather. The data itself is anonymous and can (apart from fingerprinting, more to that later) not identify a single user. | |||
==== Avoiding UUID ==== | |||
You wanted to know which profiles are not used anymore (dormant, retention problem) and which characteristics they have. This is inherently difficult without tracking individual users (installations), but it is possible with the following algo: | |||
The client submits: | |||
* Date of last submission - e.g. 2012-01-18 | |||
* Current date (from client perspective) - only date, not time - e.g. 2012-01-20 | |||
* Age of profile (Firefox installation) in days - e.g. 500 | |||
* (Last submitted age is implied or explicit - e.g. 498 ) | |||
* Number of crashes - e.g. 15 | |||
* Number of crashes submitted last time - e.g. 10 | |||
Then, on the server, you write that information in a database, as such: | |||
Date of submission | Age of installation | Crash count | Number of users | |||
2012-01-20 | 500 | 15 | 100000 | |||
Any additional user also submitting today the same combination "age 500, crash count 15" increases the "number of users" column by 1, new value is 100001. | |||
Also, you look up the row for the last submission, namely | |||
2012-01-18 | 498 | 10 | 20000 | |||
and decrease the number of users by 1, new value is 19999. | |||
If the user later that day decided that there were too many crashes and switches to Chrome, he will now be stranded on the row | |||
2012-01-20 | 500 | 15 | 5000 | |||
while other users who have continued to use FF have been subtracted after a while. So, you can say with certainty that there were 5000 users who used Firefox the last time on 2012-01-20, after having used Firefox for 500 days, and they had 15 crashes (per day/week/total, whatever you submit) when they stopped using Firefox. | |||
That is exactly the information you are so desperately seeking. Tsere, you has it. Without tracking any individual user: it's completely anonymous. | |||
==== Avoiding Fingerprinting ==== | |||
Now, what about all the other information that you need: startup times, addons, etc.? If we just add all that information to the same table and row, it would allow fingerprinting. But that is not necessary. You merely make one table per atomic information. I.e. | |||
Table A | |||
Date of submission | Age of installation | Crash count | Number of users | |||
Table B | |||
Date of submission | Age of installation | Startup time | Number of users | |||
or of course whatever other database schema you want, as long as each value is separate. That takes care of the fingerprinting. | |||
At least on the server side, not on the submission side. I would have to trust you, and anything between you and me. It would be possible to separate the calls and submit each value separately, but I think that would be overdoing it. |
Revision as of 18:01, 2 February 2012
Discussion of validity of opt-out approach
Opinions from User:BenB
What difference does it make for the user
- The argument "if they don't want to, they can opt-out" is a fallacy, because most users will not know about this data gathering. They cannot opt-out, if they don't know about it, because they have never been asked.
- The difference between opt-in and opt-out is that opt-out includes many of those users who do not wish to participate, therefore violating their wishes and rights.
- So, if the argument is that the result data will be different, then yes, it will be different, because it includes those users who do not wish to be included, but are included anyway in an opt-out scheme. If fact, if they actually do opt-out, then the data would be different again, therefore the same argument of "statistic is biased" applies. If the argument is that opt-out has average data, then only because many user wishes are violated.
- This is why European and German law *requires* opt-in for any gathering of data about the user.
Discussion of old UUID method for collecting longitudinal data
Update from User:DEinspanjer
Our original spec called for a UUID to be used per Firefox Profile/installation to allow longitudinal collection and analysis of the data. There were some worries we had about the possible loss of privacy in the event of a user sharing this UUID (discussion here)
This method allows us to implement the cumulative installation data as a write-only service which is better for user privacy. The previous document ID can be used by the client to issue the delete request. When that document is deleted, we will have no further data associated with that ID or Firefox profile/installation. If a document ID is ever made known to other people, they cannot use that ID to retrieve data from our system, and the ID would become orphaned within a day of subsequent use of that Firefox profile which means it would not be useful to anyone as a long term identifier.
Old content
Document Identifier Strategy
- Each profile will generate a UUID to be used as the document key. Each day's submission will use that UUID, and this will also be the key for that profile's cumulative data on the server. When each submission is received, the server merges it on the fly with the cumulative data, not persisting the individual documents.
Opinions from User:BenB
Privacy
A UUID is PII. Definition:
"Personally Identifiable Information (PII), as used in information security, is information that can be used to uniquely identify, contact, or locate a single person or can be used with other sources to uniquely identify a single individual."
An stable UUID for a user or user device is per definition always a PII and never anonymous.
It is therefore regulated by European and German data protection laws and normally forbidden.
From a user standpoint, it is irrelevant whether and how Mozilla uses the data, only that the data is sent. There can be
- interceptions during transmission
- other logging server components before the server component discussed here
- legal requests by various governments
- server break-ins, or
- policy changes on the Mozilla side.
Having a UUID would allow, for example, to track all my dynamic IP addresses over time, and allow to build a profile, when combined with access logs. If I have a notebook or mobile browser, it would even allow to track the places where I go based on IP geolocation / whois data.
The user has no way to verify whether any of the above (break-in, intercept, intended or lawful or not) is happening or not, and that already is a privacy violation. So, it's irrelevant what the intended usage was, only what is theoretically possible. The above must be impossible - not just "We won't do it, we promise!", but impossible.
Google Chrome
Google Chrome did use a UUID for each browser, and it was perceived as a serious privacy threat and a topic going through mainstream press, including the largest newspapers, in Germany. Eventually, Google dropped the UUID because of the PR problems it caused.
This question of whether a UUID is used by Firefox /will/ be picked up by the press, and the result will be negative for Firefox. This is not a guess, as history shows.
Perception
Germany and Europe are very privacy-aware, much more so than people in the US. Firefox has a big and loyal following there, to a big part because Firefox claims to do what users want and is privacy-aware. A UUID will be considered highly offensive in these countries and will cost Firefox market-share.
Alternative
Instead of building the history on the server, the client should build the history and only submit results. E.g. if you need to know whether things improved, you can let the client keep some old data and submit "12 crashes last week. One week before: 12% more. One year before: 50% less."
What to avoid
It should not include exact historic numbers either, because they, too, would allow to puzzle the numbers together and allow to again build a history of IP addresses for a given user. Similarly, the exact time of the previous submission would allow to piece submissions together and must not be submitted, but rather only the day (2011-02-12, not minutes or seconds). It must be impossible to match 2 submissions together, even when considering several parameters as a collection, see http://panopticlick.eff.org/ .
NonIdAlternative
For simplicity, I will take the number of crashes (e.g. in the last week or overall) as data point that you want to gather. The data itself is anonymous and can (apart from fingerprinting, more to that later) not identify a single user.
Avoiding UUID
You wanted to know which profiles are not used anymore (dormant, retention problem) and which characteristics they have. This is inherently difficult without tracking individual users (installations), but it is possible with the following algo:
The client submits:
- Date of last submission - e.g. 2012-01-18
- Current date (from client perspective) - only date, not time - e.g. 2012-01-20
- Age of profile (Firefox installation) in days - e.g. 500
- (Last submitted age is implied or explicit - e.g. 498 )
- Number of crashes - e.g. 15
- Number of crashes submitted last time - e.g. 10
Then, on the server, you write that information in a database, as such:
Date of submission | Age of installation | Crash count | Number of users 2012-01-20 | 500 | 15 | 100000
Any additional user also submitting today the same combination "age 500, crash count 15" increases the "number of users" column by 1, new value is 100001. Also, you look up the row for the last submission, namely
2012-01-18 | 498 | 10 | 20000
and decrease the number of users by 1, new value is 19999.
If the user later that day decided that there were too many crashes and switches to Chrome, he will now be stranded on the row
2012-01-20 | 500 | 15 | 5000
while other users who have continued to use FF have been subtracted after a while. So, you can say with certainty that there were 5000 users who used Firefox the last time on 2012-01-20, after having used Firefox for 500 days, and they had 15 crashes (per day/week/total, whatever you submit) when they stopped using Firefox.
That is exactly the information you are so desperately seeking. Tsere, you has it. Without tracking any individual user: it's completely anonymous.
Avoiding Fingerprinting
Now, what about all the other information that you need: startup times, addons, etc.? If we just add all that information to the same table and row, it would allow fingerprinting. But that is not necessary. You merely make one table per atomic information. I.e.
Table A Date of submission | Age of installation | Crash count | Number of users Table B Date of submission | Age of installation | Startup time | Number of users
or of course whatever other database schema you want, as long as each value is separate. That takes care of the fingerprinting.
At least on the server side, not on the submission side. I would have to trust you, and anything between you and me. It would be possible to separate the calls and submit each value separately, but I think that would be overdoing it.