User:Andrew Sutherland/MailNews/GlobalDatabase: Difference between revisions

User:Andrew Sutherland/MailNews/GlobalDatabase (view source)

Revision as of 08:46, 10 June 2008

5,592 bytes added , 10 June 2008

no edit summary

Andrew Sutherland

Confirmed users

360

edits

@@ Line 27: / Line 27: @@
 == A Normalized Start ==
-Because we are building on the
+Because we are building a layer on top of the folder-based storage, each message in the global database only needs to 1) store enough information to locate the message in folder-centric storage plus 2) whatever information we need about it to act on/filter/group by/etc. in the global database without touching the message store.
+For the first case, we store the folder a message is in (using the URI, although it may turn out we should break this down further), and its message key.  The folder URI is normalized into its own table for storage efficiency and less work on folder changes (renames, etc.).  Because we expect the number of folders to be reasonable, we may in fact keep the contents of the table cached in memory at all times, avoiding the need for joins.
+For the second case, we break the query-able aspects of the message up into attributes which are associated with their message (and the conversation it belongs to).
+All messages belong to a conversation, which stores the originating subject of the thread.  Messages do not store their subjects, although it might be appropriate to try and concisely indicate when a subject change occurs through use of a flag for a hash of the subject content (no 're:').
 * folderLocations TABLE
@@ Line 50: / Line 56: @@
 ** parentID INTEGER REFERENCES messages(id)
 ** headerMessageID TEXT ''the Message-ID header from the message''
-** ?? bodySnippet ?? -- clarkbw's little snippet of text to display
+** bodySnippet TEXT ''clarkbw's little snippet of text to display''
 * folderID index: (folderID)
 * headerMessageID index: (headerMessageID)
 * conversationID index: (conversationID)
+Both because people/e-mail addresses are first class to us, and because it would be horrible for performance if we didn't, people/e-mail get normalized into their own table.  However, we then have an interesting modeling difficulty.  People, who we will call 'contacts' frequently has more than one e-mail address, which we will call 'identities'.  The difficulty is that some times we want to deal with just the contact, ignoring the fact that they have multiple e-mail addresses, and other times we may want to only deal with mail from one specific e-mail address.
+There are a few obvious alternatives.  We could track both concepts throughout the database; anytime there is a to/from/cc, we mention both the contact and the identity.  We can also only use one of the two.  In the case of using the contact, we would have to post-process based on the textual e-mail address if we only care about one (or more) specific identities.  In the case of using the identity, if we want to talk about the contact, we need to query for all of the identities associated with the contact.
+The choice we have made is to only cite the identity, requiring us to issue queries for all of a contact's identities when we want to interact on a contact level.  The rationale of not using the contact+identity option, based on previous prototyping experience, is that the extra permutations lead to ugliness.
+The identity approach also potentially allows for tree-like hierarchies in the future.  For example, it is possible that I have a contact who has multiple work identities (2 work e-mail addresses, 1 work IM account, 1 work IRC account), multiple home identities (1 home e-mail for spam, 1 home e-mail for good friends, 1 home IM account, 1 text message address (phone number)), etc.  The hierarchy approach where we're just building a list of identities arguably scales better permutation-wise than potentially having multiple contacts to deal with too (for each non-leaf node).  The down-side is, of course, that the number of low-level attribute queries scales linearly with the number of identities.  This can be dealt with by structuring the identities so that identities are consecutively numbers so that by masking off some number of low-order bits (and potentially right-shifting) we get the contact id, but that requires some cleverness.
+Suffice it to say, input is greatly appreciated on this area.
 * contacts TABLE
@@ Line 65: / Line 81: @@
 ** email TEXT
 ** description TEXT
+See the examples section for more information on attributes and aggregate data.
 * attributeDefinitions TABLE
@@ Line 100: / Line 118: @@
 === Query Strategy ===
+We optimize our attribute queries by generating indices for each of the potential usage patterns.  We also cause the indices to be covering indices; that is, the index contains all of the information in the row, so that any query can be answered using only the data in the index.  This avoids the need for the database engine to perform any lookup for the actual row or its data.
+When performing attribute queries, we perform complex queries by INTERSECTing the multiple simple queries involved.  This does result in SQLite having to create an 'ephemeral' intermediary table for each of the results, but it is unavoidable, and far better for compiled code to be doing that than our javascript code.
 === Examples ===
@@ Line 251: / Line 272: @@
 = API =
+I have some doodles on this, but basically, I think it would be swell to:
+* Structure things so that results are provided asynchronously, as we are eventually going to get our results asynchronously.
+* Have queries resembling jQuery or Django; successive calls conceptually successively filter the context (be it the set of all conversations/messages, a single conversation, or a contact).  In actuality, of course, the calls are just setting constraints and the query is not kicked off until explicitly requested/needed.
+In terms of STEEL, the plan is that, for now, we will be a consumer of STEEL, if possible.  To the extent possible we will mimic STEEL's API, but since we are trying to build a layer on top of the folder-centric interface, those things will be different.
+Our relationship to/interaction with the address book is going to be non-existent at first (we will have our own concept of contacts/identities), but will subsequently need to be integrated.  Because our contacts/identities are first and foremost normalizations, we do not need or want a 1:1 mapping with the address book.  Our goal is to make sure that we can get from our contacts/identities to address book cards and back again.
 = Implementation Plan =