Firefox/Input/Sites: Difference between revisions
mNo edit summary |
m (→numbers) |
||
Line 60: | Line 60: | ||
* 683k clusters comprised of | * 683k clusters comprised of | ||
* 820k comments | * 820k comments | ||
* there are a few larger clusters with 20 to 4000 comments (~2k, < 1% of all clusters). These take the bulk of the clustering time (n**2 algorithm). | * there are a few larger clusters with 20 to 4000 comments (~2k, < 1% of all clusters). These take the bulk of the clustering time (n**2 algorithm) as they come from large sitesummaries. | ||
* there are ~35k small clusters (2-20 comments, mostly 2 or 3) | * there are ~35k small clusters (2-20 comments, mostly 2 or 3) | ||
* and a very long tail of single-comment "clusters" | * and a very long tail of single-comment "clusters" |
Latest revision as of 02:28, 14 January 2011
Sites Entities
This is a snapshot on what the sites tables look like (input 3.0).
At the moment, primarily to discuss possible changes to the clustering process.
SiteSummary
A set of parameters (url, Firefox version, OS, feedback type: happy or sad) with an ID.
All comments matching the same set of parameters are clustered together.
The parameters are also used to browse clusters on the dashboard
Fields:
- id (1, 2, 3...)
- url ("http://example.com", "https://example.net")
- version ("4.0b7", "4.0b9pre"...)
- os ("mac", "vista", "android"..., and "<mobile>", "<desktop>" that match comments from a union of those)
- positive (1=happy, 2=sad, optional: NULL=both)
Cluster
Comments that have been put together by the clustering algorithm.
Points to the sitesummary it belongs to.
Also points to a main comment that represents the cluster on overview pages.
- id
- site_id
- site_summary_id
- primary_description (denormalized, same as in comments table)
- primary_comment_id (a pointer to the comment that is source of the primary description)
- positive (1=happy, 2=sad)
Comment
Contains the actual text of the comment.
Points to the cluster it belongs to.
Also contains the ID of the original feedback item (which contains more info).
- id
- cluster_id
- desciption (the comment text)
- opinion_id a link into the actual source table
- score (0...10.0 assigned by the clustering algorithm, higher is more representative for the cluster)
Structure
site_summary 1:n cluster 1:n comment
Considerations for map/reduce implementation of clustering (bug 621725)
- time is mainly spent doing clustering, IO not that big of a deal
- the old algorithm used a processing db with a lot of chatter, very slow if db is not on localhost
- (web-)developers should be able to start a clustering job without an hadoop installation
- we use a clustering algorithm and other (smaller) parts of the input app that are written in Python.
numbers
- Over 6 months (beta) we accumulated 1M pieces of feedback
- There are about 300k site summaries now with
- 683k clusters comprised of
- 820k comments
- there are a few larger clusters with 20 to 4000 comments (~2k, < 1% of all clusters). These take the bulk of the clustering time (n**2 algorithm) as they come from large sitesummaries.
- there are ~35k small clusters (2-20 comments, mostly 2 or 3)
- and a very long tail of single-comment "clusters"
what has been done for input 3.0
- used dumbo+hadoop streaming
- clustering itself is parallelized
- import (normalization) is sequential (id generation requires shared counter)...
- ...but efficient (via TSV + "load data infile")
- for the 1M comments in the DB, the new approach takes about <10 minutes even on one CPU (old script: ~2h, even slower with non-local processing DB)
- expected: about 80%-90% of the time is spent doing clustering and can be parallelized.