Firefox/Input/Sites: Difference between revisions

From MozillaWiki
< Firefox‎ | Input
Jump to navigation Jump to search
mNo edit summary
 
Line 60: Line 60:
* 683k clusters comprised of
* 683k clusters comprised of
* 820k comments
* 820k comments
* there are a few larger clusters with 20 to 4000 comments (~2k, < 1% of all clusters). These take the bulk of the clustering time (n**2 algorithm).
* there are a few larger clusters with 20 to 4000 comments (~2k, < 1% of all clusters). These take the bulk of the clustering time (n**2 algorithm) as they come from large sitesummaries.
* there are ~35k small clusters (2-20 comments, mostly 2 or 3)  
* there are ~35k small clusters (2-20 comments, mostly 2 or 3)  
* and a very long tail of single-comment "clusters"
* and a very long tail of single-comment "clusters"

Latest revision as of 02:28, 14 January 2011

Sites Entities

This is a snapshot on what the sites tables look like (input 3.0).

At the moment, primarily to discuss possible changes to the clustering process.


SiteSummary

A set of parameters (url, Firefox version, OS, feedback type: happy or sad) with an ID.

All comments matching the same set of parameters are clustered together.

The parameters are also used to browse clusters on the dashboard

Fields:

  • id (1, 2, 3...)
  • url ("http://example.com", "https://example.net")
  • version ("4.0b7", "4.0b9pre"...)
  • os ("mac", "vista", "android"..., and "<mobile>", "<desktop>" that match comments from a union of those)
  • positive (1=happy, 2=sad, optional: NULL=both)

Cluster

Comments that have been put together by the clustering algorithm.

Points to the sitesummary it belongs to.

Also points to a main comment that represents the cluster on overview pages.

  • id
  • site_id
  • site_summary_id
  • primary_description (denormalized, same as in comments table)
  • primary_comment_id (a pointer to the comment that is source of the primary description)
  • positive (1=happy, 2=sad)

Comment

Contains the actual text of the comment.

Points to the cluster it belongs to.

Also contains the ID of the original feedback item (which contains more info).

  • id
  • cluster_id
  • desciption (the comment text)
  • opinion_id a link into the actual source table
  • score (0...10.0 assigned by the clustering algorithm, higher is more representative for the cluster)

Structure

site_summary 1:n cluster 1:n comment

Considerations for map/reduce implementation of clustering (bug 621725)

  • time is mainly spent doing clustering, IO not that big of a deal
  • the old algorithm used a processing db with a lot of chatter, very slow if db is not on localhost
  • (web-)developers should be able to start a clustering job without an hadoop installation
  • we use a clustering algorithm and other (smaller) parts of the input app that are written in Python.

numbers

  • Over 6 months (beta) we accumulated 1M pieces of feedback
  • There are about 300k site summaries now with
  • 683k clusters comprised of
  • 820k comments
  • there are a few larger clusters with 20 to 4000 comments (~2k, < 1% of all clusters). These take the bulk of the clustering time (n**2 algorithm) as they come from large sitesummaries.
  • there are ~35k small clusters (2-20 comments, mostly 2 or 3)
  • and a very long tail of single-comment "clusters"

what has been done for input 3.0

  • used dumbo+hadoop streaming
  • clustering itself is parallelized
  • import (normalization) is sequential (id generation requires shared counter)...
  • ...but efficient (via TSV + "load data infile")
  • for the 1M comments in the DB, the new approach takes about <10 minutes even on one CPU (old script: ~2h, even slower with non-local processing DB)
  • expected: about 80%-90% of the time is spent doing clustering and can be parallelized.