Raindrop/Strawman: Difference between revisions

Revision as of 22:53, 8 June 2010

Whereas

the couchdb version of raindrop scales very poorly, for a variety of reasons which we understand more or less well.

and

even if we could get an order of magnitude increase in the performance of couchdb in our architecture, it'd still be very unaffordable.

and

our data model abstractions currently leak too much to the HTTP APIs

and

our "spread out" data model makes it too hard for newcomers to understand and work with raindrop

It is resolved that

We will do a Raindrop reset.

This reset prioritizes:

APIs that make it easier to do front-ends
an architecture that takes hosted scaling into consideration
use existing battle-tested technologies when possible

After talking to a bunch of people, I'm proposing the follow strawman proposal:

1) We stop using couchdb as a queue, and use a queue instead. Specifically, we use a message queue (rabbit-mq gets consensus). This would enable:

understanding the performance cost of message fetching, and allocating those to specific processing units (processes, nodes, etc.)
better horizontal scaling

2) We define a clear HTTP API for use by inflow & other front-ends.

3) We use a blob storage to keep raw messages and JSON-normalized messages. MogileFS is a candidate for the hosted version, but we'd probably want a trivial python equivalent for localhost dev.

4) We optimize the pipeline to do all processing of messages in memory, only writing the final processed objects to disk at the end of processing a message, to save massively on DB use.

5) We use a mature ORM (specifically SQLalchemy) as the Raindrop equivalent of Gloda.

@@ Line 23: / Line 23: @@
 * APIs that make it easier to do front-ends
 * an architecture that takes hosted scaling into consideration
-** minimal delta of work that make it possible to get the existing front-ends working w/ a back-end that's a) faster, b) closer to where we think we'll need to go.
+* use existing battle-tested technologies when possible
 -----
@@ Line 29: / Line 29: @@
 After talking to a bunch of people, I'm proposing the follow strawman proposal:
-) We stop using couchdb as a queue, and use a queue instead.  Specifically, we use a message queue (rabbit-mq or apache-mq).  This would enable:
+) We stop using couchdb as a queue, and use a queue instead.  Specifically, we use a message queue (rabbit-mq gets consensus).  This would enable:
 * understanding the performance cost of message fetching, and allocating those to specific processing units (processes, nodes, etc.)
+* better horizontal scaling
 ) We define a clear HTTP API for use by inflow & other front-ends.
-) We use a fast key-value storage to keep track of which remote messages we already have (by IMAP UIDs, twitter timestamps, whatever)
+) We use a blob storage to keep raw messages and JSON-normalized messages.  MogileFS is a candidate for the hosted version, but we'd probably want a trivial python equivalent for localhost dev.
-) We use a blob storage to keep raw messages and JSON-normalized messages.
+) We optimize the pipeline to do all processing of messages in memory, only writing the final processed objects to disk at the end of processing a message, to save massively on DB use.
-) We evaluate merging bits of gloda and porting some of the existing Raindrop extensions to create a pipeline.  This bit of code would likely be the most "custom" bit of the Raindrop backend (along w/ the API handling), with everything else quite "stock".
+) We use a mature ORM (specifically SQLalchemy) as the Raindrop equivalent of Gloda.
-* The primary reason for using Gloda here are:
-** it's a message-aware ORM that knows about conversations, mailing lists, contacts vs. identities, etc.
-** it's got known worst-case performance characteristics which are reasonable (about 15 msgs/sec including full-text-search)
-* Note: I'm not implying a commitment to using Gloda in the long term for Raindrop, but just as a way to get us to the next stage in the architecture.
-* There are interesting complications due to using Gloda, including at least:
-** need to interface w/ the message queue
-** need to run on xulrunner/xpcshell+event loop
-** currently tied to sqlite
-* Note: It's likely that long term we'd move from sqlite to a more scalable DB.

Raindrop/Strawman: Difference between revisions

Revision as of 22:53, 8 June 2010

Navigation menu

Search