Places SQL queries best practices: Difference between revisions

Places SQL queries best practices (view source)

Revision as of 22:03, 26 January 2009

8,733 bytes removed , 26 January 2009

Removing all content from page

Mak77

Confirmed users

595

edits

@@ Line 1: / Line 1: @@
-== Places SQL queries best practices ==
-This is a collection of suggestions and hints to help internal developers hack into the Places database correctly and with common code style.  Add-on developers are strongly recommended to use the Places API to query the database, since direct queries could end up corrupting the database when schema changes [https://developer.mozilla.org/Querying_Places Querying Places].
-For a better comprehension please look at the [https://wiki.mozilla.org/Places#Quick_Links Database schema ERD]
-=== Table aliases ===
-When querying Places tables and in need of aliasing a table for joins or complext queries, is always better to use the same aliases in all queries.
-This is needed not only for our internal code style, but also to avoid common perf-killer paths in complex <i>DELETE</i>s or <i>UPDATE</i>s, indeed referring to <i>table_name.column_name</i> in such a query could be different than referring to h.id, causing a full lookup of the table.
-For the above reasons, we always try to avoid the <i>table_name.column_name</i> path.
-<center>
-{| border="1" cellpadding="2"
-|+Commonly used table aliases
-! Table || Alias || On-conflict alias
-|-
-| moz_anno_attributes || n ||
-|-
-| moz_annos || a ||
-|-
-| moz_bookmarks || b ||
-|-
-| moz_bookmarks_roots || r ||
-|-
-| moz_favicons || f ||
-|-
-| moz_historyvisits || v ||
-|-
-| moz_historyvisits_temp || v || v_t
-|-
-| moz_inputhistory || i ||
-|-
-| moz_items_annos || a || t
-|-
-| moz_keywords || k ||
-|-
-| moz_places || h || p
-|-
-| moz_places_temp || h || h_t
-|}
-</center>
-When joining a table with itself could make sense define an alias based on the <i>JOIN</i> meaning, for example selecting all bookmarks with a parent called "folder":
-<pre>
-SELECT b.id FROM moz_bookmarks b
-JOIN moz_bookmarks parents ON parents.id = b.parent
-WHERE parents.title = 'folder'
-</pre>
-The code style for using aliases is:
-* if the alias is on a real(temp) table from places.sqlite use <i>TABLE_NAME ALIAS</i>
-* if the alias is on a subquery use <i>AS</i> construct, like <i>SUBQUERY AS ALIAS</i>
-* when referring to a column of a table having an alias, always specify the alias, even if there's no risk of column names conficts.
-=== Querying the database in the temp tables era ===
-Starting from Firefox 3.1 <i>moz_places</i> and <i>moz_historyvisits</i> tables are partitioned between disk and memory, to avoid common locking issues with OS's fsyncs calls ({{Bug|442967}}).
-New memory tables are called <i>moz_places_temp</i> and <i>moz_historyvisits_temp</i>, with the same columns and indexes as the disk tables.
-NOTE: while for disk tables you can query <i>sqlite_master</i> table to get existant tables, indexes and triggers, for temporary tables you should instead use <i>sqlite_temp_master</i>.  Notice mozStorage functions indexExists and tableExists are not correct atm ({{bug|472963}}), so if you are looking for existance of moz_places_temp, they will return false even if the table exists.
-The tables are abstracted through a view, actual views are <i>moz_places_view</i> and <i>moz_historyvists_view</i>.
-Since SQLite as of now (3.6.10) does not support indices on views, we have to follow two different paths based on the type of query we are executing:
-* <i>INSERT</i>s/<i>UPDATE</i>s/<i>DELETE</i>s must always act on the views
-** Real tables have triggers attached to them to allow for syncing so in no case an implementer should try to execute one of these operations on the temp tables.
-** Views have triggers attached to them that ensure data are written/moved to memory tables when needed.
-* <i>SELECT</i> must always act on real tables, since due to the lack of indices querying the view would be slow. We can <i>SELECT</i> from the views in tests though, when we are sure perf is not an issue.
-So this is correct:
-<pre>
-DELETE FROM moz_places_view
-WHERE id = 1337
-</pre>
-While this is completely <strong>WRONG</strong>:
-<pre>
-DELETE FROM moz_places_temp
-WHERE id = 1337
-</pre>
-When using <i>SELECT</i> we must always be sure to query both tables, new changed data will be in the temp table, while the disk table will be synced in background or when a new bookmarks is added (to prevent bookmarks corruption). Still some place/visit won't never be synced to disk, for example visits with a <i>visit_type = TRANSITION_EMBED</i>.
-To query both tables we can use a special SQL construct:
-* UNION merges the resultsets of 2 queries discarding duplicates
-* UNION ALL merges the resultsets of 2 queries
-We use one of these based on the query, usually <i>UNION</i> is slower than <i>UNION ALL</i>, so the actual way we query both tables for a place with a certain url may be:
-<pre>
-SELECT id FROM moz_places_temp
-WHERE url = 'http://www.mozilla.org'
-UNION ALL
-SELECT id FROM moz_places
-WHERE url = 'http://www.mozilla.org'
-  AND id NOT IN(SELECT id FROM moz_places_temp)
-</pre>
-Due to the fact tables are partitioned, we cannot get the last inserted id from one of those, instead we should get the max between both tables, to do that we may, for example, execute:
-<pre>
-SELECT MAX(
-  (SELECT MAX(id) FROM moz_places_temp),
-  (SELECT MAX(id) FROM moz_places)
-)
-</pre>
-A similar trick can be applied to <i>IFNULL</i>, this, in many complex queries, can help removing an expensive JOIN, to only get a single value:
-<pre>
-SELECT IFNULL(
-  (SELECT id FROM moz_places_temp WHERE url = 'http://www.mozilla.org'),
-  (SELECT id FROM moz_places WHERE url = 'http://www.mozilla.org')
-) AS id
-</pre>
-Notice SQLite does support only one definition of <i>GROUP BY</i>, <i>ORDER BY</i>, <i>LIMIT</i> for <i>UNION</i>s, but sometimes we don't need to <i>UNION</i> all results from both queries. In this case we can use a small trick:
-<pre>
-SELECT * FROM (
-  SELECT id, visit_date FROM moz_historyvisits_temp
-  WHERE visit_type = 1
-  ORDER BY visit_date DESC
-  LIMIT 10
-)
-UNION ALL
-SELECT * FROM (
-  SELECT id, visit_date FROM moz_historyvisits
-  WHERE visit_type = 1
-  ORDER BY visit_date DESC
-  LIMIT 10
-)
-ORDER BY 2 DESC
-LIMIT 10
-</pre>
-This query gets the 10 most recent visit ids, this involves some code duplication, but is really faster than simply <i>UNION</i> all results when there are a lot of records.  Notice however in the external query we can't use column names, instead we have to use the column indices (2: eg visit_date).
-=== Querying multiple tables ===
-When querying multiple tables involving foreign keys, please always avoid cartesian products like:
-<pre>
-SELECT b.id, h.url
-FROM moz_places h, moz_bookmarks b
-WHERE h.id = b.fk
-</pre>
-Instead use a <i>JOIN</i>:
-<pre>
-SELECT b.id, h.url
-FROM moz_places h
-JOIN moz_bookmarks b ON h.id = b.fk
-</pre>
-That ensures the query is easier to expand, is less error-prone for future developers and more perf-friendly.
-We actually use 2 types of <i>JOIN</i>s:
-* <i>LEFT JOIN</i> is used when the right table values could not exist, but we still want a record containing the left table values.
-* <i>JOIN</i> is used when the right table must obey the <i>JOIN</i> condition, so records without a match in the right table won't be returned.
-=== Places database entry points ===
-This section contains example queries for common tasks, notice most of this can already be done through APIs, but in some case toolkit code could directly query to avoid a perf hit.
-<i>Coming soon...</i>
-=== Perf hints ===
-* Use SQLite command <i>EXPLAIN QUERY PLAN statement</i> to check what tables and indices you're using, and try to optimize it.
-* When possible don't rely on SQLite internal optimizer, it can do a very good job, but we can do also better, so always try to optimize your queries.
-* Test queries against a big database, that can make a lot of difference.
-* Use <i>LIMIT</i> every time is possible, especially with partitioned tables, since when working without indices, having small intermediate tables is really important.
-* Use subqueries when you don't really need a complete JOIN but only check for existance of a record, or get at least one record of a certain type.
-* When a query is too slow, try rebuilding it to do the opposite and exclude results (<i>IN</i>, <i>NOT IN</i> constructs can be useful).
-* Virtual machines have slower disk access, so take that in count when testing write performances in a vm.
-* Use transactions every time you need to do multiple writes.
-* Since each query can only use an index, the optimizer has to make a choice when the same query can be executed using two different indices. You can avoid using a certain index, and force the optimizer to use the another one, prepending the column name with a +. For example the following query will use the url index.<pre>SELECT id, url FROM moz_places WHERE +id = 5 AND url = 'http://www.mozilla.org/'</pre>