Summary
Currently, input offers two export formats for the user feedback data. The data is exported in from of TSV coded tables:
- opinions.tsv.bz2 offers the everything but ratings
- ratings.tsv.bz2 has the ratings data
Both tables from a 1:n relationship and can be joined using the first column (the opinion id). Both tables are compressed using bzip2, so decode them e.g. using bunzip2.
TSV Coding
The data is a UTF-8 encoded unicode stream. Lines (=records) are separated using LF (newline, U+000A). There are no header/title records. Fields (=columns) are separated by TAB (U+0009). In this scheme, two characters within column values need to be escaped: TAB and LF. For this, they are preceded using backslash (U+005C) when they are part of cell content and not of TSV coding. Of course, this means that backslashes in content are escaped the same way.
- Example FSM to parse input data
Opinions
Fields
- 1. Opinion ID
- coded as base10 integer number, used to lookup ratings or items on the input website
- 2. Time of feedback
- base10 integer, note this is UNIX time (i.e. UTC+0, so seconds since 1970-01-01T00:00:00Z)
- 3. Type
- one of issue, praise, suggestion, rating
- 4. Product
- one of firefox, mobile
- 5. Version
- a version identifier such as 4.0b11 or 3.6.13
- 6. Platform
- one of mac, windows, linux, android, maemo
- 7. Locale
- a locale identifier such as en-US
- 8. Manufacturer
- for product:mobile only, the device manufacturer
- 9. Device
- for product:mobile only, a device identifier
- 10. URL
- an http, https, chrome or about URL given by the user with his feedback
- 11. Description
- Free text entered by the user. Limited to 140 unicode characters (not bytes)
Ratings
One line per (opinion x rating category). Keyed to opinion table using opinion ID.
Fields
- 1. Opinion ID
- base10 integer, used to group related ratings
- 2. Rating Type
- one of performance, startup
- 3. Rating Value
- base10 integer ranging from 0 to 100.