Raindrop/Strawman: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 23: Line 23:
* APIs that make it easier to do front-ends
* APIs that make it easier to do front-ends
* an architecture that takes hosted scaling into consideration
* an architecture that takes hosted scaling into consideration
** minimal delta of work that make it possible to get the existing front-ends working w/ a back-end that's a) faster, b) closer to where we think we'll need to go.
* use existing battle-tested technologies when possible


-----
-----
Line 29: Line 29:
After talking to a bunch of people, I'm proposing the follow strawman proposal:
After talking to a bunch of people, I'm proposing the follow strawman proposal:


1) We stop using couchdb as a queue, and use a queue instead.  Specifically, we use a message queue (rabbit-mq or apache-mq).  This would enable:
1) We stop using couchdb as a queue, and use a queue instead.  Specifically, we use a message queue (rabbit-mq gets consensus).  This would enable:


* understanding the performance cost of message fetching, and allocating those to specific processing units (processes, nodes, etc.)
* understanding the performance cost of message fetching, and allocating those to specific processing units (processes, nodes, etc.)
* better horizontal scaling


2) We define a clear HTTP API for use by inflow & other front-ends.
2) We define a clear HTTP API for use by inflow & other front-ends.


3) We use a fast key-value storage to keep track of which remote messages we already have (by IMAP UIDs, twitter timestamps, whatever)
3) We use a blob storage to keep raw messages and JSON-normalized messages.  MogileFS is a candidate for the hosted version, but we'd probably want a trivial python equivalent for localhost dev.


4) We use a blob storage to keep raw messages and JSON-normalized messages.
4) We optimize the pipeline to do all processing of messages in memory, only writing the final processed objects to disk at the end of processing a message, to save massively on DB use.


5) We evaluate merging bits of gloda and porting some of the existing Raindrop extensions to create a pipeline.  This bit of code would likely be the most "custom" bit of the Raindrop backend (along w/ the API handling), with everything else quite "stock".
5) We use a mature ORM (specifically SQLalchemy) as the Raindrop equivalent of Gloda.
 
* The primary reason for using Gloda here are:
** it's a message-aware ORM that knows about conversations, mailing lists, contacts vs. identities, etc.
** it's got known worst-case performance characteristics which are reasonable (about 15 msgs/sec including full-text-search)
 
* Note: I'm not implying a commitment to using Gloda in the long term for Raindrop, but just as a way to get us to the next stage in the architecture.
 
* There are interesting complications due to using Gloda, including at least:
** need to interface w/ the message queue
** need to run on xulrunner/xpcshell+event loop
** currently tied to sqlite
 
* Note: It's likely that long term we'd move from sqlite to a more scalable DB.

Revision as of 22:53, 8 June 2010

Whereas

  • the couchdb version of raindrop scales very poorly, for a variety of reasons which we understand more or less well.

and

  • even if we could get an order of magnitude increase in the performance of couchdb in our architecture, it'd still be very unaffordable.

and

  • our data model abstractions currently leak too much to the HTTP APIs

and

  • our "spread out" data model makes it too hard for newcomers to understand and work with raindrop

It is resolved that

  • We will do a Raindrop reset.

This reset prioritizes:

  • APIs that make it easier to do front-ends
  • an architecture that takes hosted scaling into consideration
  • use existing battle-tested technologies when possible

After talking to a bunch of people, I'm proposing the follow strawman proposal:

1) We stop using couchdb as a queue, and use a queue instead. Specifically, we use a message queue (rabbit-mq gets consensus). This would enable:

  • understanding the performance cost of message fetching, and allocating those to specific processing units (processes, nodes, etc.)
  • better horizontal scaling

2) We define a clear HTTP API for use by inflow & other front-ends.

3) We use a blob storage to keep raw messages and JSON-normalized messages. MogileFS is a candidate for the hosted version, but we'd probably want a trivial python equivalent for localhost dev.

4) We optimize the pipeline to do all processing of messages in memory, only writing the final processed objects to disk at the end of processing a message, to save massively on DB use.

5) We use a mature ORM (specifically SQLalchemy) as the Raindrop equivalent of Gloda.