Raindrop/Strawman: Difference between revisions

← Older edit

Raindrop/Strawman (view source)

Revision as of 22:56, 8 June 2010

655 bytes removed , 8 June 2010

no edit summary

DavidA

Confirmed users, Bureaucrats and Sysops emeriti

525

edits

@@ Line 1: / Line 1: @@
 Whereas
-the couchdb version of raindrop scales very poorly, for a variety of reasons which we understand more or less well.
+* the couchdb version of raindrop scales very poorly, for a variety of reasons which we understand more or less well.
 and
-even if we could get an order of magnitude increase in the performance of couchdb in our architecture, it'd still be very unaffordable.
+* even if we could get an order of magnitude increase in the performance of couchdb in our architecture, it'd still be very unaffordable.
 and
-our data model abstractions currently leak too much to the HTTP APIs
+* our data model abstractions currently leak too much to the HTTP APIs
 and
-our "spread out" data model makes it too hard for newcomers to understand and work with raindrop
+* our "spread out" data model makes it too hard for newcomers to understand and work with raindrop
 It is resolved that
-We will do a Raindrop reset.
+* We will do a Raindrop reset.
 This reset prioritizes:
-a) APIs that make it easier to do front-ends
+* APIs that make it easier to do front-ends
-b) an architecture that takes hosted scaling into consideration
+* an architecture that takes hosted scaling into consideration
-c) Minimal deltas that make it possible to avoid regressing the accomplishements made so far.
+* use existing battle-tested technologies when possible
 -----
@@ Line 29: / Line 29: @@
 After talking to a bunch of people, I'm proposing the follow strawman proposal:
-) We stop using couchdb as a queue, and use a queue instead.  Specifically, we use a message queue (rabbit-mq or apache-mq).  This would enable:
+) We stop using couchdb as a queue, and use a queue instead.  Specifically, we use a message queue (rabbit-mq gets consensus).  This would enable:
 * understanding the performance cost of message fetching, and allocating those to specific processing units (processes, nodes, etc.)
+* better horizontal scaling
 ) We define a clear HTTP API for use by inflow & other front-ends.
-) We use a fast key-value storage to keep track of which remote messages we already have (by IMAP UIDs, twitter timestamps, whatever)
+) We use a blob storage to keep raw messages and JSON-normalized messages.  MogileFS is a candidate for the hosted version, but we'd probably want a trivial python equivalent for localhost dev.
-) We use a blob storage to keep raw messages and JSON-normalized messages.
+) We optimize the pipeline to do all processing of messages in memory, only writing the final processed objects to disk at the end of processing a message, to save massively on DB use.
-) We evaluate merging bits of gloda and porting some of the existing Raindrop extensions to create a pipeline.  This bit of code would likely be the most "custom" bit of the Raindrop backend (along w/ the API handling), with everything else quite "stock".
+) We use a mature ORM (specifically SQLalchemy) as the Raindrop equivalent of Gloda.
-* The primary reason for using Gloda here are:
+[[File:Raindrop_reset_large.png|200px|thumb|left]]
-** it's a message-aware ORM that knows about conversations, mailing lists, contacts vs. identities, etc.
-** it's got known worst-case performance characteristics which are reasonable (about 15 msgs/sec including full-text-search)
-* Note: I'm not implying a commitment to using Gloda in the long term for Raindrop, but just as a way to get us to the next stage in the architecture.
-* There are interesting complications due to using Gloda, including at least:
-** need to interface w/ the message queue
-** need to run on xulrunner/xpcshell+event loop
-** currently tied to sqlite
-* Note: It's likely that long term we'd move from sqlite to a more scalable DB.