Identity/AttachedServices/StorageServerProtocol: Difference between revisions

Jump to navigation Jump to search
no edit summary
No edit summary
No edit summary
Line 1: Line 1:


== Status: Obsolete ==


This document describe a protocol for the Delta-Sync storage model, which we're hoping to replace with the new Queue-Sync storage model.  It is being kept around for reference, but will not see implementationEventually this document will be replaced with one describing the Queue-Sync protocol.
== Summary ==
 
This is a working proposal for the PiCL Storage API, to implement the concepts described in [[Identity/CryptoIdeas/05-Queue-Sync]]. 
 
It's a work in progress that will eventually obsolete [[Identity/AttachedServices/StorageProtocolZero]].
 
 
== Queue-Sync Data Model ==
 
More details at [[Identity/CryptoIdeas/05-Queue-Sync]].
 
Data is stored in independent named '''collections'''.  A collection is a key-value store mapping keys to '''records'''.  Each
 
collection has a monotonically-increasing '''sequence number''' which is incremented whenever a record is changed, and provides the
 
ability to request all '''changes''' since a given sequence number.
 
 
'''Collection''' objects have the following fields:
 
<table>
<tr><th>Parameter</th><th>Type</th><th>Description</th></tr>
 
<tr><td>name</td><td>urlsafe string, 64 bytes</td><td>A unique identifier for this collection amongt all the user's data.  Collection
 
names may only contain characters from the urlsafe-base64 alphabet (i.e. alphanumerics, underscore and hyphen).</td></tr>
 
<tr><td>seqnum</td><td>integer, 8 bytes</td><td>A monotonically-increasing integer that is incremented with each change to the
 
contents of the collection.</td></tr>
 
<tr><td>changeid</td><td>urlsafe string, XXX bytes</td><td>A hash that uniquely identifies the last change to this collection.  It is
 
derived from the new sequence number, the previous changeid, and the details of the change that was made.</td></tr>
 
<tr><td>signature</td><td>urlsafe string, XXX bytes</td><td>A client-generated HMAC signature of the current changeid.  Not used or
 
verified by the server, since it doesn't have the secret key.</td></tr>
 
</table>
 
 
 
'''Record''' objects have the following fields:
 
<table>
<tr><th>Parameter</th><th>Type</th><th>Description</th></tr>
 
<tr><td>key</td><td>urlsafe string, 64 bytes</td><td>A unique identifier for this record within the collection.  Keys may only contain
 
characters from the urlsafe-base64 alphabet (i.e. alphanumerics, underscore and hyphen).</td></tr>
 
<tr><td>payload</td><td>urlsafe string, 256 KB</td><td>The value current stored in this record.  Typically this would be encrypted and
 
signed by the client.</td></tr>
 
<tr><td>seqnum</td><td>integer, 8 byte</td><td>The collection-level sequence number at which this record was last modified.</td></tr>
 
<tr><td>changeid</td><td>urlsafe string, XXX bytes</td><td>The collection-level changeid corresponding to the modification of this
 
record.  It is derived from the new sequence number, the previous changeid, the record key, and the new record payload.</td></tr>
 
<tr><td>signature</td><td>urlsafe string, XXX bytes</td><td>A client-generated HMAC signature of the changeid for this record.  Not
 
used or verified by the server, since it doesn't have the secret key.</td></tr>
 
</table>
 
 
 
'''Change''' objects are identical to '''record''' objects, except their payload field may have the value NULL to indicate a deletion
 
rather than an update:
 
<table>
<tr><th>Parameter</th><th>Type</th><th>Description</th></tr>
 
<tr><td>key</td><td>urlsafe string, 64 bytes</td><td>A unique identifier for the changed record within the collection.  Keys may only
 
contain characters from the urlsafe-base64 alphabet (i.e. alphanumerics, underscore and hyphen).</td></tr>
 
<tr><td>payload</td><td>urlsafe string or null, 256 KB</td><td>The new value to be stored in the record, or null if the record is to
 
be deletedTypically this would be encrypted and signed by the client.</td></tr>
 
<tr><td>seqnum</td><td>integer, 8 byte</td><td>The new collection-level sequence number after this change is applied.</td></tr>


== Summary ==
<tr><td>changeid</td><td>urlsafe string, XXX bytes</td><td>The new collection-level changeid corresponding to this change.  It is


This is a working proposal for the PiCL Storage API, to implement the concepts described in [[Identity/CryptoIdeas/04-Delta-Sync]]. It's a work in progress that will eventually obsolete [[Identity/AttachedServices/StorageProtocolZero]].
derived from the new sequence number, the previous changeid, the record key, and the new record payload.</td></tr>


== Delta-Sync Data Model ==
<tr><td>signature</td><td>urlsafe string, XXX bytes</td><td>A client-generated HMAC signature of the changeid.  Not used or verified


The storage server hosts a number of independent named '''collections''' for each user.  Each collection is a key-value store whose contents can be atomically modified by the client.
by the server, since it doesn't have the secret key.</td></tr>


Each modification of a collection creates a new '''version''' with corresponding version identifier in the format <seqnum>:<hash>:<hmac>, giving a signed hash of the contents of the collection at that version.  The server ensures that versions can only be created with monotonically-increasing sequence numbers.
</table>


A collection can be marked as '''obsolete'''.  This will cause any further attempts to access it to return an error code.  Obsolete collections may be garbage-collected by the storage server after 24 hours.


More details at [[Identity/CryptoIdeas/04-Delta-Sync]].


== Authentication ==
== Authentication ==


To access the storage service, a client device must authenticate by providing a BrowserID assertion and a Device ID.  It will receive in exchange:
To access the storage service, a client device must authenticate by providing a BrowserID assertion and a Device ID.  It will receive  
 
in exchange:


* the current version number of each collection
* a short-lived id/key pair that can be used to authenticate subsequent requests using the Hawk request-signing scheme
* a short-lived id/key pair that can be used to authenticate subsequent requests using the Hawk request-signing scheme
* a URL to which further requests should be directed
* a mapping of collection names to access URLs


You can think of this as establishing a "login session" with the server.  Access requests for a specific collection should be directed


You can think of this as establishing a "login session" with the server, although we're also tunneling some basic metadata in order to reduce the number of round-trips.
to the appropriate URL.


Example:
Example:
Line 40: Line 124:
     <  Content-Type: application/json
     <  Content-Type: application/json
     <  {
     <  {
    <  "base_url": <user-specific access url>,
     <  "id": <hawk auth id>,
     <  "id": <hawk auth id>,
     <  "key": <hawk auth secret key>,
     <  "key": <hawk auth secret key>,
     <  "collections": {
     <  "collections": {
     <    "XXXXX": <version id for this collection>,
     <    "history": <access url for history collection>,
     <    "YYYYY": <version id for this collection>,
     <    "bookmarks": <access url for bookmarks collection>,
     <    <...etc...>
     <    <...etc...>
     <  }
     <  }
     <  }
     <  }


The user and device identity information is encoded in the hawk auth id, to avoid re-sending it on each request.  The server may also include additional state in this value, depending on the implementation.  It's opaque to the client.
The user and device identity information is encoded in the hawk auth id, to avoid re-sending it on each request.  The server may also  
 
include additional state in this value, depending on the implementation.  It's opaque to the client.
 
The collection-specific access URLs may include a unique identifier for the user, in order to improve RESTful-icity of the API.  Or
 
they might point the client to a specific data-center which houses their write master for each collection.  It's opaque to the client.


The base_url may include a unique identifier for the user, in order to improve RESTful-icity of the API.  Or it might point the client to a specific data-center which houses their write master.  It's opaque to the client.


== Data Access ==
== Data Access ==


The client now makes Hawk-authenticated requests to the storage API under its assigned base_url. The following operations are available.
The client now makes Hawk-authenticated requests to a specific collection at its assigned access url.
The following operations are available on each collection.
 
 
=== GET <collection-url> ===
 
Get the current metadata for a collection: its name, seqnum and changeid.
Example:
 
    >  GET <collection-url>
    >  Authorization:  <hawk auth parameters>
    .
    <  200 OK
    <  Content-Type: application/json
    <  {
    <  "name": "history"
    <  "seqnum": 123,
    <  "changeid": "HASH_OF_DETAILS_OF_THE_MOST_RECENT_CHANGE",
    <  "signature": "HMAC_SIGNATURE_OF_CHANGEID"
    <  }
 
 
=== GET <collection-url>/records ===
 
Query parameters:  start, end, limit.
 
Request headers: If-Match, If-None-Match
 
Response headers: ETag


=== GET <base-url> ===


Get the current version id for all collectionsThis is the same data as returned in the session-establishment call above, but it may be useful if the client wants to refresh its view.  Example:
Get the set of records currently contained in the collectionFor small collections, the full set
of records will be returned like so:


     >  GET <base-url>
     >  GET <collection-url>/records
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
     .
     .
Line 68: Line 184:
     <  Content-Type: application/json
     <  Content-Type: application/json
     <  {
     <  {
     <  "collections": {
     <  "records": {
     <     "XXXXX": <version id for this collection>,
     <   "key1": { "payload": "payload1", "seqnum": 123, "changeid": "HASH1", "signature": "sig1" },
     <     "YYYYY": <version id for this collection>,
     <   "key2": { "payload": "payload2", "seqnum": 124, "changeid": "HASH2", "signature": "sig2" }
    <    <...etc...>
     <  }
     <  }
     <  }
     <  }


=== GET <base-url>/<collection> ===


Get the current metadata for a specific collection: its version id, obsolete status, and last read/write timesExample:
If there are a large number of records in the collection then the server may choose to paginate the result, returning only some of the
 
records in the initial responseIt will include the key "next" in the output to indicate that more records are available:


     >  GET <base-url>/<collection>
     >  GET <collection-url>/records
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
     .
     .
Line 85: Line 201:
     <  Content-Type: application/json
     <  Content-Type: application/json
     <  {
     <  {
     <  "version": <version id for this collection>,
     <  "next": "key3",
     <  "obsolete": false,
     <  "items": {
     <   "atime": <last-accessed timestamp for this collection>,
     <     "key1": <record1>,
     <   "mtime": <last-modified timestamp for this collection>
     <     "key2": <record2>
    <  }
     <  }
     <  }


=== POST <base-url>/<collection> ===
Clients can request the next batch using the 'start' query parameter:


Update writeable metadata for a specific collection.  Currently the only piece of metadata that can be updated is the "obsolete" flag,
     >  GET <collection-url>/records?start=key3
which can be flipped to true:
 
     >  POST <base-url>/<collection>
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
    >  {
    >    "obsolete": true
    >  }
     .
     .
     <  200 OK
     <  200 OK
     <  Content-Type: application/json
     <  Content-Type: application/json
     <  {
     <  {
     <  "version": <version id for this collection>,
     <  "items": {
     <   "obsolete": true,
     <     "key3": <record3>,
     <   "atime": <last-accessed timestamp for this collection>,
     <     "key4": <record4>
     <  "mtime": <last-modified timestamp for this collection>
     <  }
     <  }
     <  }


=== GET <base-url>/<collection>/<version> ===
When no "next" value is included in the response, the client knows that all available records have
been fetched.
 
Records are always batched in lexicographic order of their keys, and clients are free to request an arbitrary key range using the


Get the contents of a specific version of a specific collection.  In the simplest case, we GET the full contents like so:
'start' and 'end' parameters:


     >  GET <base-url>/<collection>/<version>
     >  GET <collection-url>/records?start=key2&end=key3
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
     .
     .
Line 122: Line 236:
     <  {
     <  {
     <  "items": {
     <  "items": {
     <   "key1": "value1",
     <     "key2": <record2>,
     <   "key2": "value2",
     <     "key3": <record3>
    <    <..etc..>
     <  }
     <  }
     <  }
     <  }


However, clients will usually want to request a delta from a previous version.  They can do this by specifying the "from" parameter.  New or updated keys are represented with their value, while deleted keys are represented with a value of null.  Like so:
Clients may also choose to batch their requests by using the 'limit' query parameter.  As with server-driven batching, the output key


     >  GET <base-url>/<collection>/<version>?from=<previous version>
"next" will be used to indicate that more data is available:
 
     >  GET <collection-url>/records?start=key2&limit=2
    >  Authorization:  <hawk auth parameters>
    .
    <  200 OK
    <  Content-Type: application/json
    <  {
    <  "next": "key4",
    <  "items": {
    <    "key2": <record2>,
    <    "key3": <record3>
    <  }
    <  }
    .
    .
    >  GET <collection-url>/records?start=key4&limit=2
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
     .
     .
Line 137: Line 266:
     <  {
     <  {
     <  "items": {
     <  "items": {
     <    "key1": "value1",  // a key that was updated
     <    "key4": <record4>
    <     "key2": null      // a key that was deleted
     <  }
     <  }
     <  }
     <  }


To allow reliable transfer of a large number of items, both client and server may choose to paginate responses to this query.


The client may specify "first" as the key at which to (lexicographically) start the listing, and "upto" as the key at which to stop the listingIt may also specify an integer "limit" to restrict the total number of keys sent at once.  The server may enforce a default value and/or upper-bound on "limit".
Each server response will include an "ETag" header, formed from the combination of the current seqnum and changeid of the collection.   


If the set of items is truncated, the server will send the response argument "next" to give the next available key in iteration order.  The client should make another request setting "first" equal to the provided value of "next" in order to fetch additional items.
Clients can use this in combination with standard If-Match and If-None-Match headers to ensure that they're getting a consistent view


As an example, suppose that the client requests at most two items per response, and the collection contains items "key1", "key2" and "key3".  It would would need to fetch them in two batches like so:
of the collection:


     >  GET <base-url>/<collection>/<version>?limit=2
     >  GET <collection-url>/records?start=key2&limit=2
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
     .
     .
     <  200 OK
     <  200 OK
     <  Content-Type: application/json
     <  Content-Type: application/json
    <  ETag: 124-HASH2
     <  {
     <  {
     <  "next": "key3",
     <  "next": "key4",
     <  "items": {
     <  "items": {
     <    "key1": "value1",
     <    "key2": <record2>,
     <    "key2": "value2"
     <    "key3": <record3>
     <  }
     <  }
     <  }
     <  }
     .
     .
     .
     .
     >  GET <base-url>/<collection>/<version>?first=key3&limit=2
     >  GET <collection-url>/records?start=key4&limit=2
    >  Authorization:  <hawk auth parameters>
    >  If-Match: 123-HASH
    .
    <  412 Precondition Failed
    <  ETag: 125-HASH3
 
 
XXX TODO: use of headers, versus returning seqnum/changeid in the response body?
 
 
=== GET <collection-url>/records/<key> ===
 
Request headers: If-Match, If-None-Match
 
Response headers: ETag
 
 
Get the specific record stored under the given key:
 
    >  GET <collection-url>/records/<key>
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
     .
     .
     <  200 OK
     <  200 OK
     <  Content-Type: application/json
     <  Content-Type: application/json
    <  ETag: 123-HASH1
     <  {
     <  {
     <  "items": {
     <  "key": <key>
     <    "key3": "value3"
    <  "seqnum": 123,
     <   "changeid": "HASH1",
     "payload": "payload1"
     <  }
     <  }
     <  }
     <  }
This request supports standard etag behaviour to ensure that a consistent view of the collection is being read.




XXX TODO: There are several error cases that need to be distinguished, possibly by HTTP status code or possibly by some information in the error response body:
=== GET <collection-url>/changes ===


* The requested version is not known or no longer present on the server
Query parameters: since, limit.
* We can't generate a delta from the specified "from" version to the request version
* The specified "from" version is invalid (e.g. due to lost writes during a rollback)


=== POST <base-url>/<collection>/<version> ===
Get the sequence of changes that have been made to the collection.  If the number of changes to be returned is small, they will be


Creates a new version of a specific collection.  In the simplest case, we POST up the full contents of the new version like so:
returned all at once like so:


     >  POST <base-url>/<collection>/<version>
     >  GET <collection-url>/changes
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
    >  {
    >  "items": {
    >    "key1": "value1",
    >    "key2": "value2",
    >    <..etc..>
    >  }
    >  }
     .
     .
     <  201 Created
     <  200 OK
    <  Content-Type: application/json
    <  {
    <  "changes": [
    <    { "seqnum": 0, "changeid": "HASH1", "signature": "sig1", "key": "key1", "payload": "payload1" },
    <    { "seqnum": 1, "changeid": "HASH2", "signature": "sig2", "key": "key2", "payload": "payload2" },
    <  }
    <  }
 
The changeids and signatures on these changes form a hash chain which can be verified by the client.


If there are a large number of changes to be fetched then the server may choose to paginate the result, returning only some of the


However, clients will usually want to send just the changes from a previous versionThey can do this by specifying the "from" parameter.  New or updated keys are represented with their value, while deleted keys are represented with a value of null.  Like so:
changes in the initial requestIt will include the key "next" in the output to indicate that more changes are available:


     >  POST <base-url>/<collection>/<version>?from=<previous version>
     >  GET <collection-url>/changes
    >  Authorization:  <hawk auth parameters>
    .
    <  200 OK
    <  Content-Type: application/json
    <  {
    <  "next": 3,
    <  "changes": [
    <    <change1>,
    <    <change2>
    <  ]
    <  }
 
Clients can request the next batch using the 'since' query parameter:
 
    >  GET <collection-url>/changes?since=3
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
    >  {
    >  "items": {
    >    "key1": "value1",  // a key to be updated
    >    "key2": null      // a key to be deleted
    >  }
    >  }
     .
     .
     <  201 Created
     <  200 OK
    <  Content-Type: application/json
    <  {
    <  "changes": [
    <    <change3>,
    <    <change4>
    <  ]
    <  }


Records are always batched in sequence number order.  Clients are free to request changes starting at an arbitrary sequence number,


To guard against intermittent or unreliable connections, the client can also send data in batches.  It can specify the argument "first" to indicate a key offset at which this batch begins, and the argument "upto" to specify a key offset at which this batch ends.  The server will spool all the incoming items until it sees a batch with no "upto" argument, then create the new version as an atomic unit.
which is useful for pulling in just the things that have changed since a previous sync.


As an example, here is how the client might create a new version by sending items one at a time:
Clients may also choose to batch their requests by using the 'limit' query parameter.  As with server-driven batching, the output key


     >  POST <base-url>/<collection>/<version>?upto=key2
"next" will be used to indicate that more data is available:
 
     >  GET <collection-url>/changes?since=2&limit=2
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
    >  {
    >  "items": {
    >    "key1": value1"
    >  }
    >  }
     .
     .
     <  202 Accepted
     <  200 OK
    <  Content-Type: application/json
    <  {
    <  "next": 4,
    <  "changes": [
    <    <change2>,
    <    <change3>
    <  ]
    <  }
    .
    .
    >  GET <collection-url>/changes?since=4&limit=2
    >  Authorization:  <hawk auth parameters>
     .
     .
    <  200 OK
    <  Content-Type: application/json
    <  {
    <  "changes": {
    <    <change4>
    <  }
    <  }
The server is not required to keep the full change history from seqnum zero, and may periodically compact and garbage-collection the
stored data.  If the client requests changes since a seqnum that is no longer known to the server, it will receive an error:
    >  GET <collection-url>/changes?since=1
    >  Authorization:  <hawk auth parameters>
     .
     .
     > POST <base-url>/<collection>/<version>?start=key2&upto=key3
     < 416 Requested Range Not Satisfiable
 
 
XXX TODO: seriously, is there a good error code for this, or should we just tunnel errors in the body?
 
 
=== POST <collection-url>/records ===
 
Request headers: If-Match, If-None-Match
 
Response headers: ETag
 
Update or delete records in the collection.  The request body must contain an array of change objects with properly-formed sequence
numbers and changeids, and it must be preconditioned with an If-Match or If-None-Match header:
 
    >  POST <collection-url>/records
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
    >  If-Match: 125-HASH1
     >  {
     >  {
     >   "items": {
     >   "changes": [
     >   "key2": "value2"
    >      {"key": "key1", "payload": "newpayload1", "seqnum": 126, "changeid": "NEWHASH1", "signature": "newsig1"},
     >   }
     >     {"key": "key2", "payload": null, "seqnum": 127, "changeid": "NEWHASH2", "signature": "newsig2"}
     >  }
     >   }
     >  }  
     .
     .
     <  202 Accepted
     <  204 No Content
 
The server will apply each change in turn, checking that the seqnum and changeid hash chains are properly formed.  If they are not
 
then an error will be reported:
 
    >  POST <collection-url>/records
    >  Authorization:  <hawk auth parameters>
    >  If-Match: 120-OLD-HASH
    >  {
    >    "changes": [
    >      {"key": "key1", "payload": "newpayload1", "seqnum": 121, "changeid": "NEWHASH1", "signature": "newsig1"},
    >      {"key": "key2", "payload": null, "seqnum": 122, "changeid": "NEWHASH2", "signature": "newsig2"}
    >    }
    >  }
     .
     .
     .
     <  412 Precondition Failed
     > POST <base-url>/<collection>/<version>?start=key3
     < ETag: 125-HASH1
 
 
No content is returned in response to a POST.  The client has already calculated the new seqnum and changeid for the collection, so
 
there is no more useful information that the server can provide.
 
 
=== POST <collection-url>/records/<key> ===
 
 
Update or delete a specific record in the collection.  The request body must contain a change object with properly-formed sequence
number and changeid, and it must be preconditioned with an If-Match or If-None-Match header:
 
    >  POST <collection-url>/records/<key>
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
    >  If-Match: 125-HASH1
     >  {
     >  {
     >   "items": {
     >   "payload": "newpayload1",
     >    "key3": "value3"
    >    "seqnum": 126,
     >   }
     >    "changeid": "NEWHASH1",
     >  }
     >   "signature": "newsig1"
     >  }  
     .
     .
     <  201 Created
     <  204 No Content




XXX TODO: There are several error cases that need to be distinguished, possibly by HTTP status code or possibly by some information in the error response body:
The server will check that the seqnum and changeid hash chains are properly formed before applying the change.  If they are not then


* There was a conflicting write, so you can no longer create the requested version
an error will be reported:
* The requested version is invalid, e.g. wrong sequence number
* The specified "from" version is too old, so we can't use it as the start point of a delta
* The specified "from" version is invalid (e.g. due to lost writes during a rollback)
* The provided batches had holes, or were otherwise invalid
* The server forgot a previous batch and you'll have to start again




== Things To Think About ==
    >  POST <collection-url>/records/<key>
    >  Authorization:  <hawk auth parameters>
    >  If-Match: 120-OLD-HASH
    >  {
    >    "payload": "newpayload1",
    >    "seqnum": 126,
    >    "changeid": "NEWHASH1",
    >    "signature": "newsig1"
    >  }
    .
    <  412 Precondition Failed
    <  ETag: 125-HASH1
 
No content is returned in response to a POST.  The client has already calculated the new seqnum and changeid for the collection, so


* Currently there's no explicit way for the server to track the current version held by each client.  We could add this in the initial handshake, or intuit it based on their activity.
there is no more useful information that the server can provide.
* Is json the best format for this transfer, or could we come up with a more efficient representation?
* Should we add a way to retrieve specific keys, for real-time updating of just the important bits?
Confirmed users
358

edits

Navigation menu