Firefox/Projects/Places DB Creation Scripts

Overview

Sprint lead: ddahl
Sprinters: adw

Description
Create python/perl scripts to generate Places DBs with various characteristics such as "many visits within the same domain", "visits across many domains", "many tags", "many bookmarks", etc.

Goals / Use Cases

The sample data set should actually be quite huge (according to Beltzner and Shaver). We should collect stats from others with Dietrich's extension to see what the average data set looks like at Mozilla.

The chief goal is to be able to automate the generation of these sample sqlite databases for a continuous test to run on Places. We want to be able to reliably set some benchmarks and see what code changes either slow down or speed up queries in Places.

Non Goals

tbd

Design

We should try to use the Django ORM to reverse-engineer the Places database schema into Django Models so creating rows will be easy and we can concentrate on url data collection.

Data collection:

Beltzner envisions a huge dataset made up of perhaps 10k unique urls in bookmarks and a similar data set in history, etc...

We need to brainstorm a method for getting this raw data. Spider/bot? There are many python libs for this.

What are the variables we need to keep in mind when creating this data sample for performance testing? ASK Dietrich and Shawn.

Potential exemplar datasets:

  • "Grandma": Very few visits per month, mostly to the same sites. Very few bookmarks.
  • "Nerd": Very many visits per month across a wide range of sites with a core of often visited sites. Tons o' bookmarks, maybe lots of tags, too.
  • "Random Walk": Many visits to many different sites with no discernible most often visited sites.
  • "News Hound": Many visits per month, mostly to the same sites.

Or a more general way to think about it, we have these dimensions:

  • Number of places (unique URLs)
  • Number of visits
  • Nature of visits (visiting same URLs often to the exclusion of others, or visiting all places equally? Visiting same domains often? (Does that matter?) Type of transition?)
  • Number of bookmarks
  • Number of tags
  • Nature of tags (each bookmark has tons of tags, few tags, or varied?)
  • Keywords

We can come up with different data points in each dimension, take cartesian product across all dimensions to get a full suite of databases...

How should AutoComplete be stressed? Shawn says:

Some notes on the above funcs and SQL:

GetAutoCompleteBaseQuery() boils down to:

SELECT h.url,
       h.title,
       f.url,
       (SELECT b.parent
        FROM moz_bookmarks b
        JOIN moz_bookmarks t ON t.id = b.parent AND t.parent != ?1
        WHERE b.type = nsINavBookmarksService::TYPE_BOOKMARK AND
              b.fk = h.id
        ORDER BY b.lastModified DESC LIMIT 1
       ) AS parent,
       (SELECT b.title
        FROM moz_bookmarks b
        JOIN moz_bookmarks t ON t.id = b.parent AND t.parent != ?1
        WHERE b.type = nsINavBookmarksService::TYPE_BOOKMARK AND
              b.fk = h.id
        ORDER BY b.lastModified DESC LIMIT 1
       ) AS bookmark,
      (SELECT GROUP_CONCAT(t.title, ',')
       FROM moz_bookmarks b
       JOIN moz_bookmarks t ON t.id = b.parent AND t.parent = ?1
       WHERE b.type = nsINavBookmarksService::TYPE_BOOKMARK AND
             b.fk = h.id
      ) AS tags,
      h.visit_count, h.typed, h.frecency
FROM moz_places_temp h 
LEFT OUTER JOIN moz_favicons f ON f.id = h.favicon_id 
                WHERE h.frecency <> 0 
                {ADDITIONAL_CONDITIONS} 
UNION ALL 
SELECT h.url,
       h.title,
       f.url,
       (SELECT b.parent
        FROM moz_bookmarks b
        JOIN moz_bookmarks t ON t.id = b.parent AND t.parent != ?1
        WHERE b.type = nsINavBookmarksService::TYPE_BOOKMARK AND
              b.fk = h.id
        ORDER BY b.lastModified DESC LIMIT 1
       ) AS parent,
       (SELECT b.title
        FROM moz_bookmarks b
        JOIN moz_bookmarks t ON t.id = b.parent AND t.parent != ?1
        WHERE b.type = nsINavBookmarksService::TYPE_BOOKMARK AND
              b.fk = h.id
        ORDER BY b.lastModified DESC LIMIT 1
       ) AS bookmark,
      (SELECT GROUP_CONCAT(t.title, ',')
       FROM moz_bookmarks b
       JOIN moz_bookmarks t ON t.id = b.parent AND t.parent = ?1
       WHERE b.type = nsINavBookmarksService::TYPE_BOOKMARK AND
             b.fk = h.id
      ) AS tags,
      h.visit_count,
      h.typed,
      h.frecency 
FROM moz_places h 
LEFT OUTER JOIN moz_favicons f ON f.id = h.favicon_id 
WHERE h.id NOT IN (SELECT id FROM moz_places_temp) AND
      h.frecency <> 0 
      {ADDITIONAL_CONDITIONS} 
-- ORDER BY h.frecency, the 9th column
ORDER BY 9 DESC LIMIT ?2 OFFSET ?3);

mDBAdaptiveQuery:

SELECT IFNULL(h_t.url, h.url),
       IFNULL(h_t.title, h.title),
       f.url,
       (SELECT b.parent
        FROM moz_bookmarks b
        JOIN moz_bookmarks t ON t.id = b.parent AND t.parent != ?1
        WHERE b.type = nsINavBookmarksService::TYPE_BOOKMARK AND
              b.fk = h.id
        ORDER BY b.lastModified DESC LIMIT 1
       ) AS parent,
       (SELECT b.title
        FROM moz_bookmarks b
        JOIN moz_bookmarks t ON t.id = b.parent AND t.parent != ?1
        WHERE b.type = nsINavBookmarksService::TYPE_BOOKMARK AND
              b.fk = h.id
        ORDER BY b.lastModified DESC LIMIT 1
       ) AS bookmark,
      (SELECT GROUP_CONCAT(t.title, ',')
       FROM moz_bookmarks b
       JOIN moz_bookmarks t ON t.id = b.parent AND t.parent = ?1
       WHERE b.type = nsINavBookmarksService::TYPE_BOOKMARK AND
             b.fk = h.id
      ) AS tags,
      IFNULL(h_t.visit_count, h.visit_count),
      IFNULL(h_t.typed, h.typed),
      rank
FROM (SELECT ROUND(
               MAX(
                 ((i.input = ?2) + (SUBSTR(i.input, 1, LENGTH(?2)) = ?2)) *
                 i.use_count
               ), 1
             ) AS rank,
             place_id
      FROM moz_inputhistory i
      GROUP BY i.place_id HAVING rank > 0
     ) AS i
LEFT JOIN moz_places h ON h.id = i.place_id
LEFT JOIN moz_places_temp h_t ON h_t.id = i.place_id
LEFT JOIN moz_favicons f ON f.id = IFNULL(h_t.favicon_id, h.favicon_id)
WHERE IFNULL(h_t.url, h.url) NOTNULL
ORDER BY rank DESC, IFNULL(h_t.frecency, h.frecency) DESC

mDBKeywordQuery:

SELECT IFNULL(
         (SELECT REPLACE(url, '%s', ?2) FROM moz_places_temp WHERE id = b.fk),
         (SELECT REPLACE(url, '%s', ?2) FROM moz_places WHERE id = b.fk)
       ) AS search_url,
       IFNULL(h_t.title, h.title),
       COALESCE(
         f.url,
         (SELECT f.url FROM moz_places_temp
          JOIN moz_favicons f ON f.id = favicon_id
            WHERE rev_host = IFNULL(
              (SELECT rev_host FROM moz_places_temp WHERE id = b.fk),
              (SELECT rev_host FROM moz_places WHERE id = b.fk)
            )
          ORDER BY frecency DESC LIMIT 1),
         (SELECT f.url FROM moz_places
          JOIN moz_favicons f ON f.id = favicon_id
            WHERE rev_host = IFNULL(
              (SELECT rev_host FROM moz_places_temp WHERE id = b.fk),
              (SELECT rev_host FROM moz_places WHERE id = b.fk)
            )
          ORDER BY frecency DESC LIMIT 1)
       ),
       b.parent,
       b.title,
       NULL,
       IFNULL(h_t.visit_count, h.visit_count),
       IFNULL(h_t.typed, h.typed)
FROM moz_keywords k
JOIN moz_bookmarks b ON b.keyword_id = k.id
LEFT JOIN moz_places AS h ON h.url = search_url
LEFT JOIN moz_places_temp AS h_t ON h_t.url = search_url
LEFT JOIN moz_favicons f ON f.id = IFNULL(h_t.favicon_id, h.favicon_id)
  WHERE LOWER(k.keyword) = LOWER(?1)
ORDER BY IFNULL(h_t.frecency, h.frecency) DESC")

AutoComplete is definitely important, but we'd like our database construction scripts/methodology to be general enough to generate places databases for any kind of testing context.

Implementation

set up django:

http://www.djangoproject.com/download/1.0.2/tarball/

uncompress and run:

sudo python setup.py install

add django bin to your path

export PATH=$PATH:~/code/python/django/bin:~/code/python

cd ~/code/python

run this:

django-admin.py startproject places

django-admin.py startapp builddb

copy a places.sqlite file to ~/code/python/places

export PLACES_DB_PATH=~/code/python/places/places.sqlite

export DJANGO_SETTINGS_MODULE=places.settings

export PYTHONPATH=$PYTHONPATH:~/code/python

edit the places/settings.py:

import os

DATABASE_ENGINE = 'sqlite3'

DATABASE_NAME = os.environ['PLACES_DB_PATH']

reverse engineer the Django Models from the schema:

cd ~/code/python/places

python manage.py inspectdb >> builddb/models.py

Now, we need to clean up the foreign keys.

Bugs

tbd