Firefox/Input/Sites: Difference between revisions

Latest revision as of 02:28, 14 January 2011

Sites Entities

This is a snapshot on what the sites tables look like (input 3.0).

At the moment, primarily to discuss possible changes to the clustering process.

SiteSummary

A set of parameters (url, Firefox version, OS, feedback type: happy or sad) with an ID.

All comments matching the same set of parameters are clustered together.

The parameters are also used to browse clusters on the dashboard

Fields:

id (1, 2, 3...)
url ("http://example.com", "https://example.net")
version ("4.0b7", "4.0b9pre"...)
os ("mac", "vista", "android"..., and "<mobile>", "<desktop>" that match comments from a union of those)
positive (1=happy, 2=sad, optional: NULL=both)

Cluster

Comments that have been put together by the clustering algorithm.

Points to the sitesummary it belongs to.

Also points to a main comment that represents the cluster on overview pages.

id
site_id
site_summary_id
primary_description (denormalized, same as in comments table)
primary_comment_id (a pointer to the comment that is source of the primary description)
positive (1=happy, 2=sad)

Comment

Contains the actual text of the comment.

Points to the cluster it belongs to.

Also contains the ID of the original feedback item (which contains more info).

id
cluster_id
desciption (the comment text)
opinion_id a link into the actual source table
score (0...10.0 assigned by the clustering algorithm, higher is more representative for the cluster)

Structure

site_summary 1:n cluster 1:n comment

Considerations for map/reduce implementation of clustering (bug 621725)

time is mainly spent doing clustering, IO not that big of a deal
the old algorithm used a processing db with a lot of chatter, very slow if db is not on localhost
(web-)developers should be able to start a clustering job without an hadoop installation
we use a clustering algorithm and other (smaller) parts of the input app that are written in Python.

numbers

Over 6 months (beta) we accumulated 1M pieces of feedback
There are about 300k site summaries now with
683k clusters comprised of
820k comments
there are a few larger clusters with 20 to 4000 comments (~2k, < 1% of all clusters). These take the bulk of the clustering time (n**2 algorithm) as they come from large sitesummaries.
there are ~35k small clusters (2-20 comments, mostly 2 or 3)
and a very long tail of single-comment "clusters"

what has been done for input 3.0

used dumbo+hadoop streaming
clustering itself is parallelized
import (normalization) is sequential (id generation requires shared counter)...
...but efficient (via TSV + "load data infile")
for the 1M comments in the DB, the new approach takes about <10 minutes even on one CPU (old script: ~2h, even slower with non-local processing DB)
expected: about 80%-90% of the time is spent doing clustering and can be parallelized.

@@ Line 60: / Line 60: @@
 * 683k clusters comprised of
 * 820k comments
-* there are a few larger clusters with 20 to 4000 comments (~2k, < 1% of all clusters). These take the bulk of the clustering time (n**2 algorithm).
+* there are a few larger clusters with 20 to 4000 comments (~2k, < 1% of all clusters). These take the bulk of the clustering time (n**2 algorithm) as they come from large sitesummaries.
 * there are ~35k small clusters (2-20 comments, mostly 2 or 3)
 * and a very long tail of single-comment "clusters"

Firefox/Input/Sites: Difference between revisions

Latest revision as of 02:28, 14 January 2011

Contents

Sites Entities

SiteSummary

Cluster

Comment

Structure

Considerations for map/reduce implementation of clustering (bug 621725)

numbers

what has been done for input 3.0

Navigation menu

Firefox/Input/Sites: Difference between revisions

Latest revision as of 02:28, 14 January 2011

Sites Entities

SiteSummary

Cluster

Comment

Structure

Considerations for map/reduce implementation of clustering (bug 621725)

numbers

what has been done for input 3.0

Navigation menu

Search