Services/F1/Server/Architecture: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
= Goals =
* provide a scalable architecture to serve up to 100M users
* make sure the server is immune against third-party failures (Twitter is down)
* isolate parts that need to be more secured
= Expected Load =
* 50k users : 1600 transactions
* 16 bytes / user / day
* 1.5 gbyte / day
For 100M users: 37 transactions per second.
XXX
= Synchronous Architecture =
XXX
= Asynchronous Architecture =
[[File:Eff-ouane-two.png]]
[[File:Eff-ouane-two.png]]


= Big Picture =
== Big Picture ==


* The Post office Front server (PF) receives a send request.
* The Post office Front server (PF) receives a send request.
Line 8: Line 30:
* the client grabs back the result via the Receiver Front server (RF)
* the client grabs back the result via the Receiver Front server (RF)


= Detailed transaction =
== Detailed transaction ==


1. PF checks the status of a service by looking at the Services Status DB.
1. PF checks the status of a service by looking at the Services Status DB.
Line 30: Line 52:
6. The consumer selects a Result DB server by using a simple modulo-based shard algorithm. It pushes the result into the select Result DB.  
6. The consumer selects a Result DB server by using a simple modulo-based shard algorithm. It pushes the result into the select Result DB.  


6. The consumer updates the Session DB by writing there the Result Server ID
7. The consumer updates the Session DB by writing there the Result Server ID
associated to the Session ID. Once this is done, the job is removed from the queue.
associated to the Session ID. Once this is done, the job is removed from the queue.


7. The consumer updates the Service status DB with the status (G or F) provided by the worker.
8. The consumer updates the Service status DB with the status (G or F) provided by the worker.


9. When the user ask for the result via RF, RF looks in the Session DB what is  
9. When the user ask for the result via RF, RF looks in the Session DB what is  
Line 40: Line 62:
10. The RF then picks the result and returns it.
10. The RF then picks the result and returns it.


= Databases =
== Databases ==


The Session DB and Services Status DB are key/value stores global to the cluster and replicated in several places. They can be eventually consistent.
The Session DB and Services Status DB are key/value stores global to the cluster and replicated in several places. They can be eventually consistent.
Line 48: Line 70:
The queues are providing a Publisher/Subscriber pattern, where each worker picks jobs to be executed on a specific service.
The queues are providing a Publisher/Subscriber pattern, where each worker picks jobs to be executed on a specific service.


= Failure Scenarios and persistency =
== Failure Scenarios and persistency ==


The general idea is that a transaction can be re-tried manually by the user if there's an issue in our system. Although our system should prevent clients to send more work if a third party service is down.
The general idea is that a transaction can be re-tried manually by the user if there's an issue in our system. Although our system should prevent clients to send more work if a third party service is down.
Line 72: Line 94:
XXX
XXX


= Security =
== Security ==


XXX explains the level of security of each piece -- who holds the app key, the tokens, etc
XXX explains the level of security of each piece -- who holds the app key, the tokens, etc
Line 79: Line 101:


XXX
XXX
== Numbers ==
* 50k users : 1600 transactions
* 16 bytes / user / day
* 1.5 gbyte / day
For 100M users: 37 transactions per second.






[[File:Effe-ouane.png]]
[[File:Effe-ouane.png]]

Revision as of 11:37, 21 March 2011

Goals

  • provide a scalable architecture to serve up to 100M users
  • make sure the server is immune against third-party failures (Twitter is down)
  • isolate parts that need to be more secured

Expected Load

  • 50k users : 1600 transactions
  • 16 bytes / user / day
  • 1.5 gbyte / day

For 100M users: 37 transactions per second.

XXX

Synchronous Architecture

XXX

Asynchronous Architecture

Eff-ouane-two.png

Big Picture

  • The Post office Front server (PF) receives a send request.
  • The request is pushed into a queue
  • A worker (W) picks the job and push the result into a key/value storage
  • the client grabs back the result via the Receiver Front server (RF)

Detailed transaction

1. PF checks the status of a service by looking at the Services Status DB. Each service has a (GR / BR) ratio stored in that DB. GR = Number of good responses. BR = Number of bad responses. What are "Good" and "Bad" responses is to the workers discretion.

Given a threshold, PF can decide to return a 503 immediately, together with a Retry-After header.

2. PF creates a unique Session ID for the transaction and stores it into the Session DB.

3. PF push the work to be done, together with the session id, into the right queue. It then return a 202 code to the client.

4. A worker picks the job and do it. Once it gets the result back, it sends back the result in a response queue.

5. A Consumer picks the jobs that are done in the response queue.

6. The consumer selects a Result DB server by using a simple modulo-based shard algorithm. It pushes the result into the select Result DB.

7. The consumer updates the Session DB by writing there the Result Server ID associated to the Session ID. Once this is done, the job is removed from the queue.

8. The consumer updates the Service status DB with the status (G or F) provided by the worker.

9. When the user ask for the result via RF, RF looks in the Session DB what is the Response DB that has the stored result.

10. The RF then picks the result and returns it.

Databases

The Session DB and Services Status DB are key/value stores global to the cluster and replicated in several places. They can be eventually consistent.

The Result DB is a key/value store local to a single server. It holds the result for a specific transaction.

The queues are providing a Publisher/Subscriber pattern, where each worker picks jobs to be executed on a specific service.

Failure Scenarios and persistency

The general idea is that a transaction can be re-tried manually by the user if there's an issue in our system. Although our system should prevent clients to send more work if a third party service is down.


  • if PF cannot reach on of those, a 503+Retry after is sent back:
    • the services status DB
    • the session DB
    • the queue
  • if the queue crashes, the data stored should not be lost. Workers should be able to resume their subscriptions
  • The job should be left in the queue until the transaction is over
    • a worker that picks a job marks it as being processed
    • if a worker crashes while doing the job, the marker should decay after a TTL so another worker can pick up the job
    • if a consumer cannot reach the result DB, the job stays in the queue and eventually goes away after a ttl.
    • if a consumer cannot reach the session DB the job stays in the queue and eventually goes away after a ttl.
  • if a consumer cannot reach the services status DB, nevermind.


XXX

Security

XXX explains the level of security of each piece -- who holds the app key, the tokens, etc

OAuth Dance

XXX


Effe-ouane.png