Mozilla2:Unified Storage: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
Line 202: Line 202:
The [http://www.w3.org/TR/rdf-primer/#typedliterals RDF primer] mentions all of schema2 datatypes for typed literals.
The [http://www.w3.org/TR/rdf-primer/#typedliterals RDF primer] mentions all of schema2 datatypes for typed literals.
Prominent are numbers, like 2.5 vs 2.500 or dates from different timezones.
Prominent are numbers, like 2.5 vs 2.500 or dates from different timezones.
## Comments from Axel


Another important issue I miss is the safeness of the data. If we share application data, the backing store needs to be crash and power-off proof, IMHO.
Another important issue I miss is the safeness of the data. If we share application data, the backing store needs to be crash and power-off proof, IMHO.


Is there any method to prune data, like, what happens if application data changes the scheme? We probably don't want to bloat our database with left overs for all days. What about extension uninstalls?
Is there any method to prune data, like, what happens if application data changes the scheme? We probably don't want to bloat our database with left-overs for all days. What about extension un-installs?


One thing that I still have on my plate is profile migration, like, how would apps like Safari or Opera migrate firefox profiles? We blame MS for obfuscating their internal data and being not interoperable, we don't wanna end up in the same hot spot, right?
One thing that I still have on my plate is profile migration, like, how would apps like Safari or Opera migrate firefox profiles? We blame MS for obfuscating their internal data and being non-interoperable; we don't wanna end up in the same hot spot, right?


=== Comments from Vlad ===
=== Comments from Vlad ===

Revision as of 03:04, 26 November 2004

Author

  • Vladimir Vukicevic <vladimir@pobox.com>

Related Links

Goals

  1. Provide a unified interface for storing and searching through data for all Mozilla components and extensions
  2. Provide Mozilla core components and extension authors with tools to enable richer interaction with user data
  3. Eliminate the need for components to write their own file serialization/deserialization
  4. Eliminate multiple file formats from Mozilla profiles (db1.85, mork, rdf/xml, xml, various text formats)

Requirements

  1. Low-complexity API suitable for use with simple stores (bookmarks, history)
  2. Direct SQL access for richer interaction (mail header cache)
  3. API for consumers to be able to expose triggers, etc. as notifications (RDF and others; this is tied in for the data source story for moz going forward, whether it stays a RDF-centered world or whether we branch out a bit)
  4. Cross-application notification
  5. Straightforward upgrades from an older version of a particular store's schema to a newer version.

Design Ramblings

Current data stores in Mozilla

Common:

  • Prefs

prefs.js (evaluable JS)

  • SSL Certificates

cert8.db (db1.85)

  • Local Store

localstore.rdf (RDF/XML)

  • MIME Types

mimeTypes.rdf (RDF/XML)

Firefox:

  • History

history.dat (Mork) Implements RDF data source on top of mork.

  • Bookmarks

bookmarks.html (NETSCAPE-Bookmark-file-1) Kept in-memory as RDF, only written out periodically.

  • Cookies

cookies.txt (HTTP Cookie File) Kept in-memory as a hash table by host

  • Saved passwords

signons.txt

  • Form History

formhistory.dat (Mork)

  • Host Permissions

hostperm.1 (text)

  • Download Manager History

downloads.rdf (RDF/XML)

Thunderbird:

  • Addressbook

abook.mab (Mork)

  • Mail Views (Saved Searches)

mailViews.dat (custom, text)

  • secmod.db (?)

secmod.db (db1.85)

  • Folder Cache

panacea.dat (Mork)

  • Junk Bayesian filter training file

training.dat (unknown)

  • ImapMail/<server>.dat

server info (Mork)

  • *Mail*/*/*.msf

folder summary data (Mork)

  • filters (?)
       mail filter data

Notes

Unified store doesn't imply just one store, but more a unified way of accessing information. I can initially see possibly two stores -- one a more secure one that would store SSL certs and saved passwords, and another for everything else. I'm not sure what "more secure" means here, though; one store might be enough, and might make more interesting relationships possible.

Not all items need/should migrate to unified store: prefs probably wants to stay as JS. Junk mail bayesian training bits probably also want to stay as whatever format they currently are. Everything else should go into the unified store.

Most of the other data we have is fairly simple to store: history, bookmarks, cookies, form history, download manager bits, addressbook, etc. Some of this data really wants to be hierarchical, e.g. bookmarks, while other data, such as tbird folder cache, wants to be slightly hierarchical. A simple child_of column can work well there, especially if the data is being exposed as RDF.

We'll need an IPC interface/service to handle cross-application notifications. Sqlite won't give us trigger functions on multiple sqlite instances. There are two solutions here: one, turn mozStorage into a server that's shared by all the apps. This could suck from a performance standpoint, especially for things like generating content directly from sql. Two, require that each table have an owner component in just one app; this component would be responsible for managing sqlite triggers, and would in turn re-publish interesting bits in some xpcom/ipc-esque way.

Interfaces

Two interfaces:

Full SQL

Core interfaces `mozIStorageService`, `mozIStorageConnection`, `mozIStorageStatement`, `mozIStorageValueArray`. Auxiliary interfaces `mozIStorageFunction`, `mozIStorageSchema`. See the interfaces in bug 261861 for descriptions and available methods on these.

  • how do upgrades work? We could have schema versions -- a `moz_schemas` table that holds a simple string -> integer map, with the integer getting incremented each time there's a schema update. `mozStorage` can provide a `getSchemaVersion` call to get the version, and a `setSchemaVersion` to update it.

We can also just provide a CreateTable call that would compare the schema given with the schema in the database, and notify if the schemas differ.

RDF Data Store interface

Would need two tables per db-backed data store:

   CREATE TABLE foo__resource_store (
       resource_id   SERIAL,
       resource_name STRING
   );
   CREATE TABLE foo__triple_store (
       triple_id     SERIAL,
       subject       INTEGER, -- corresponds to resource_id in the resource_store
       predicate     INTEGER, -- corresponds to resource_id in the resource_store
       object        BLOB,    -- string, integer, datetime, etc. -- depending on lit type!
       object_is_literal BOOLEAN
   );

Shared profiles

See also: Mozilla2:Profile Sharing, Mozilla2:Multi User Sharing

Being able to unify things like the SSL cert database using anything less than a kludge will require a profile directory shared amongst the aviary members.

Ideally, the profile directory will contain the shared data stores (which can be opened multiple times through sqlite), and an individual directory for each application. The application directories would contain things like application-specific extensions -- this scheme would also allow firefox to install thunderbird extensions.

Various things that need to be shared with toolkit/xulrunner-wide apps (cache, cookies, etc.) will all go into the shared profile directory.

Other things should probably be kept in the unified store, just in case an extension wants to do something with them. For example, exposing bookmarks to Thunderbird would allow forumzilla to figure out which feeds you have livemarks for, as well as enable the mailing of bookmarks that can be accepted within the mailer. Firefox could take advantage of addressbook lookup. Both apps would share the same mime type settings, such as default applications.

  • However:* numerous people have indicated that doing yet another profile directory change would be... less than desirable. So, an alternate approach is to use just add a `common` directory at the same level as `firefox` and `thunderbird`. Creating a profile should also create a common profile with the same name. I'm not sure what happens if you create a firefox and thunderbird profile with the same name, but you don't expect them to share any common bits. This probably isn't an issue.

Notifications

To support the existing RDF observer interfaces, changes in unified store data need to be exposed using some sort of notification mechanism -- we can probably install SQLite callbacks for every operation that shouldn't slow us down too much.

Given a triple store, it should be easy to translate this into RDF. Given an arbitrary store, it'll be up to the store user to figure out what changed and what notifications need to be emitted. Helpers for creating triggers would be a plus.

For non-RDF notifications, the trigger mechanism will be used, with the following limitation: triggers are only emitted if the trigger-causing actions occur on the same database connection. This means that different database connections (e.g. different apps) will not receive eachother's triggers through the database. The planned solution is to wrap triggers in such a way that they use the xpcom IPC subsystem to notify any other storage-using applications of the trigger asynchronously -- to the app, the triggers would be delivered just like a database-originated trigger.

Database schema freezability

One of the goals is to provide the ability for app and extension authors to have much more fine-grained control over data queries than would normally be possible. However, we also want to retain the ability to change database schemas as necessary. To this end, table schemas will be frozen at column granularity; e.g. for `moz_bookmarks`, the `url`, `title`, `keyword`, and `last_visited` columns might be frozen, but the others might not be. This means that extension authors need to refrain from using "*" in their SELECT queries, but instead should get only the exact columns that they're interested in. (This is good practice in any case.)

Back-end

SQLite will be the back end for the unified store. Because it implements a SQL engine, we get querying "for free", without having to invent our own query language or query execution system. Its code-size footprint is moderate (250k), but it will hopefully simplify much existing code so that the net code-size change should be smaller. It has exceptional performance, and supports concurrent access to the database. Finally, it is released into the public domain, meaning that we will have no licensing issues.

Other options exist, though an overriding concern is one of license for a piece that is to be a core Mozilla component. For example, Berkeley DB's license is incompatible with Mozilla, as is Firebird's. However, Berkeley DB lacks a query language, meaning we'd have to write our own; Firebird has similar goals to SQLite, but is based upon a codebase unproven in the open source world. Using RDF as a back end is infeasable due to performance; and finally, using a home-grown solution makes no sense.


Comments from Darin:

This is great stuff. Have you seen Mozilla2:Profile Sharing (which should really be made public)?

I agree that we should move profile data that is shared across toolkit apps into a common profile directory. For example, ~/.mozilla/toolkit/ or ~/.mozilla/common/. It would make sense to have separate shared profile directories per profile name. So, you'd then have ~/.mozilla/toolkit/default.xyz/ available to any other toolkit app that is using the profile named "default".

If we introduce a major change in the way we store profile data such as the sqlite based solution you are proposing here, then we should definitely include any changes needed to support profile sharing across toolkit apps.

Notifications need to be interprocess, no? How does sqlite solve the problem of overlapped access? How do we keep the state of the shared database synchronized between processes? Would transactions made on the database be reflected as notifications to other processes?

As for the SSL DBs, have you seen the rdb mechanism? All we have to do is provide a DLL that implements the Berkeley DB API, name it rdb.dll and NSS will use it instead.

Is there any benefit to using the IPC daemon for managing interprocess transaction synchronization since it already has support for that? I guess I'm not clear on how much of this sqlite already does for us. Also, I envision including the IPC daemon as a core mozilla component since it enables support for distributed XPCOM, so we would potentially have it at our disposal.

Comments from Vlad:

Darin wrote: > Notifications need to be interprocess, no? How does sqlite solve the problem of overlapped access? How do we keep the state of the shared database synchronized between processes? Would transactions made on the database be reflected as notifications to other processes?

Just did some testing with this. SQLite does not give us cross-instance triggers, which is really too bad. We'll have to work out an IPC solution here.

> As for the SSL DBs, have you seen the rdb mechanism? All we have to do is provide a DLL that implements the Berkeley DB API, name it rdb.dll and NSS will use it instead.

Interesting.. I have not. That might be the sanest and safest approach; I was looking at hacking support into NSS directly, and that code scared the crap out of me. I'd much rather work with an implementation of a known API than have to audit random code changes in NSS.

Comments from Axel

I would veto against storing resources as strings for RDF, there should be a table mapping strings to numbers, so that you'd end up querying for numbers instead of strings, should be much more performant. Same for literals, probably. Uniqueness of literals is harder than for resources, btw, as typed literals have redundant string representations.

What about separate RDF DataSources? This is a common shortcoming of RDF, that you can't identify a triple once it is in the wild, but there are good use cases to identify sets of triples, such like, previous settings or stuff like that.

Comments from Vlad

Axel wrote: > I would veto against storing resources as strings for RDF, there should be a table mapping strings to numbers, so that you'd end up querying for numbers instead of strings, should be much more performant.

Hmm.. it would mean two tables per database-backed store rather than one, but that's not a big deal. I'm wondering whether having to do multiple database queries per RDF query (to figure out integer mappings for resources) will hurt more or less than storing strings; probably much less, especially since the results of the resource->id queries can easily be cached. I've updated the schema above.. does that look better?

> Same for literals, probably. Uniqueness of literals is harder than for resources, btw, as typed literals have redundant string representations.

Hmm.. can you give me some examples of this? The literals that we currently have are strings, dates, ints, and blobs; does rdf define more? If it's just those 4, they map nicely to sql types that we can store in that field (sqlite does manifest typing of data values, so we can figure out whether the thing we stored was an int, string, date, or whatever).


Comments from Axel

Vlad wrote: > Hmm.. can you give me some examples of this?

The RDF primer mentions all of schema2 datatypes for typed literals. Prominent are numbers, like 2.5 vs 2.500 or dates from different timezones.

Another important issue I miss is the safeness of the data. If we share application data, the backing store needs to be crash and power-off proof, IMHO.

Is there any method to prune data, like, what happens if application data changes the scheme? We probably don't want to bloat our database with left-overs for all days. What about extension un-installs?

One thing that I still have on my plate is profile migration, like, how would apps like Safari or Opera migrate firefox profiles? We blame MS for obfuscating their internal data and being non-interoperable; we don't wanna end up in the same hot spot, right?

Comments from Vlad

Axel wrote: > Is there any method to prune data, like, what happens if application data changes the scheme?

It will be up to the component to clean up after itself in this case; it can also provide views for backwards-compatability as necessary.

> We probably don't want to bloat our database with left overs for all days. What about extension uninstalls?

Extensions will hopefully clean up after themselves as well. However, the plan is to give extension authors an API to go through for table creation that will involve creating a table name that includes the extension's GUID; this way we can identify if an extension is still installed or not for any given table.

> One thing that I still have on my plate is profile migration, like, how would apps like Safari or Opera migrate firefox profiles? We blame MS for obfuscating their internal data and being not interoperable, we don't wanna end up in the same hot spot, right?

The data won't be obfuscated in any way. sqlite is fully in the public domain; as such, if anyone wants to get at Firefox profile data, they'll be welcome to use sqlite and get the data out. Migration -from- Firefox (or any other xul/toolkit app) is not our problem, provided that we don't intentionally make it difficult.

Comments from shaver

> Migration -from- Firefox (or any other xul/toolkit app) is not our problem, provided that we don't intentionally make it difficult.

sqlite is _worlds_ more usable for other apps than our current mix of ad-hoc formats and mork chicanery.

Comments from ago

Any plan to have multiuser access (even coarse-grained with full file locks during writes) coupled with offline (i.e. a syncing algorithm, possibly sql based) capabilities? Calendars, addresscards etc. would benefit from it. Typical use: network with a shared contacts storage, which can be accessed directly (rw), or dumped locally for offline use and subsequent sync operation. One way to do syncing might be to store a sequential number (not a timestamp, which depends on the clock, actually the clocks, plural...) for each record/group of records (i.e. a vcard). Every time the record/group is modified, the next sequential number is used. Knowing the max value of this field at the time you go offline, it should be possible to sync the contents of two tables. I am thinking only of a single function to addresses simple situations (which are also quite common), leaving more complex conflict resolution issues to the clients. Assuming identical table structures with an Index field, something like:

sync(recordsetA, recordsetB, indexField, conflictResolution)

conflictResolution = priority_to_A / priority_to_B / duplicate / only_return_a_list_of_conflicts

So defining two items: A and B, and I(A), I(B) their respective sequential indeces, and I(O) the max value of the sequential index of tableA at the time of the last connection, the two recordsets are merged according to the following rules:

  • I(A) <= I(O) && I(B)<= I(O) : nothing to do
  • I(A) > I(O) && I(B) <= I(O) : A->B
  • I(A) > I(O) && B missing : A->B
  • A missing && I(B) <= I(O) : delete B
  • I(A) > I(O) && I(B) > I(O) : conflict -> resolve according to conflictResolution

Same thing inverting letters. More complex conflict resolution schemes can and should be implemented by the clients. Syncing two apps based on Uinifed Storage should not be a problem. Syncing with an external application (which does not use the sequential index) is still possible, but it will require an intermediate step (client-side). In this case, a copy of the recordset as of the time of the last connection must be stored. On the next connection, by comparing the old recordset to the new data, it is possible to establish which items were edited offline (i.e. I(B) > I(O))...

Having such syncing functionality within Unified Storage might avoid a lot of code duplication since it is a common task. Moreover it would be more elegant, safer and faster to assign the sequential index internally rather than letting the clients do so.