Firefox/Input/Sites

From MozillaWiki
< Firefox‎ | Input
Revision as of 01:07, 14 January 2011 by Michel (talk | contribs) (Created page with "== Sites Entities == This is a snapshot on what the sites tables look like. At the moment, primarily to discuss changes to the clustering process. === SiteSummary === A set of ...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Sites Entities

This is a snapshot on what the sites tables look like. At the moment, primarily to discuss changes to the clustering process.


SiteSummary

A set of parameters (url, Firefox version, OS, feedback type: happy or sad) with an ID.

All comments matching the same set of parameters are clustered together.

The parameters are also used to browse clusters on the dashboard

Fields:

  • id (1, 2, 3...)
  • url ("http://google.com", "https://ebay.com")
  • version ("4.0b7", "4.0b9pre"...)
  • os ("mac", "vista", "android"..., and "<mobile>", "<desktop>" that match comments from a union of those)
  • positive (1=happy, 2=sad, optional: NULL=both)

Cluster

Comments that have been put together by the clustering algorithm.

Points to the sitesummary it belongs to.

Also points to a main comment that represents the cluster on overview pages.

  • id
  • site_id
  • site_summary_id
  • primary_description (denormalized, same as in comments table)
  • primary_comment_id (a pointer to the comment that is source of the primary description)
  • positive (1=happy, 2=sad)

Comment

Contains the actual text of the comment.

Points to the cluster it belongs to.

Also contains the ID of the original feedback item (which contains more info).

  • id
  • cluster_id
  • desciption (the comment text)
  • opinion_id a link into the actual source table
  • score (0...10.0 assigned by the clustering algorithm, higher is more representative for the cluster)

Structure

site_summary 1:n cluster 1:n comment


Considerations for map/reduce implementation of clustering (bug 621725)

  • time is mainly spent doing clustering, IO not that big of a deal
  • the old algorithm used a processing db with a lot of chatter, very slow if db is not on localhost
  • (web-)developers should be able to start a clustering job without an hadoop installation

what has been done for input 3.0

  • used dumbo+hadoop streaming
  • clustering itself is parallelized
  • import is sequential (id generation requires shared counter)...
  • ...but efficient (via TSV + "load data infile")
  • for the 1M comments in the DB, the new approach takes about <10 minutes even on one CPU (old script: ~2h, even slower with non-local processing DB)
  • expected: about 80%-90% of the time is spent doing clustering and can be parallelized.