Firefox/Input/Sites

Sites Entities

This is a snapshot on what the sites tables look like. At the moment, primarily to discuss changes to the clustering process.

A set of parameters (url, Firefox version, OS, feedback type: happy or sad) with an ID.

All comments matching the same set of parameters are clustered together.

The parameters are also used to browse clusters on the dashboard

Fields:

id (1, 2, 3...)
url ("http://google.com", "https://ebay.com")
version ("4.0b7", "4.0b9pre"...)
os ("mac", "vista", "android"..., and "<mobile>", "<desktop>" that match comments from a union of those)
positive (1=happy, 2=sad, optional: NULL=both)

Comments that have been put together by the clustering algorithm.

Points to the sitesummary it belongs to.

Also points to a main comment that represents the cluster on overview pages.

id
site_id
site_summary_id
primary_description (denormalized, same as in comments table)
primary_comment_id (a pointer to the comment that is source of the primary description)
positive (1=happy, 2=sad)

Contains the actual text of the comment.

Points to the cluster it belongs to.

Also contains the ID of the original feedback item (which contains more info).

id
cluster_id
desciption (the comment text)
opinion_id a link into the actual source table
score (0...10.0 assigned by the clustering algorithm, higher is more representative for the cluster)

site_summary 1:n cluster 1:n comment

time is mainly spent doing clustering, IO not that big of a deal
the old algorithm used a processing db with a lot of chatter, very slow if db is not on localhost
(web-)developers should be able to start a clustering job without an hadoop installation

used dumbo+hadoop streaming
clustering itself is parallelized
import is sequential (id generation requires shared counter)...
...but efficient (via TSV + "load data infile")
for the 1M comments in the DB, the new approach takes about <10 minutes even on one CPU (old script: ~2h, even slower with non-local processing DB)
expected: about 80%-90% of the time is spent doing clustering and can be parallelized.