Firefox/Input/Sites
Jump to navigation
Jump to search
Sites Entities
This is a snapshot on what the sites tables look like. At the moment, primarily to discuss changes to the clustering process.
SiteSummary
A set of parameters (url, Firefox version, OS, feedback type: happy or sad) with an ID.
All comments matching the same set of parameters are clustered together.
The parameters are also used to browse clusters on the dashboard
Fields:
- id (1, 2, 3...)
- url ("http://google.com", "https://ebay.com")
- version ("4.0b7", "4.0b9pre"...)
- os ("mac", "vista", "android"..., and "<mobile>", "<desktop>" that match comments from a union of those)
- positive (1=happy, 2=sad, optional: NULL=both)
Cluster
Comments that have been put together by the clustering algorithm.
Points to the sitesummary it belongs to.
Also points to a main comment that represents the cluster on overview pages.
- id
- site_id
- site_summary_id
- primary_description (denormalized, same as in comments table)
- primary_comment_id (a pointer to the comment that is source of the primary description)
- positive (1=happy, 2=sad)
Comment
Contains the actual text of the comment.
Points to the cluster it belongs to.
Also contains the ID of the original feedback item (which contains more info).
- id
- cluster_id
- desciption (the comment text)
- opinion_id a link into the actual source table
- score (0...10.0 assigned by the clustering algorithm, higher is more representative for the cluster)
Structure
site_summary 1:n cluster 1:n comment
Considerations for map/reduce implementation of clustering (bug 621725)
- time is mainly spent doing clustering, IO not that big of a deal
- the old algorithm used a processing db with a lot of chatter, very slow if db is not on localhost
- (web-)developers should be able to start a clustering job without an hadoop installation
what has been done for input 3.0
- used dumbo+hadoop streaming
- clustering itself is parallelized
- import is sequential (id generation requires shared counter)...
- ...but efficient (via TSV + "load data infile")
- for the 1M comments in the DB, the new approach takes about <10 minutes even on one CPU (old script: ~2h, even slower with non-local processing DB)
- expected: about 80%-90% of the time is spent doing clustering and can be parallelized.