Big Picture
- The Post office Front server (PF) receives a send request.
- The request is pushed into a queue
- A worker (W) picks the job and push the result into a key/value storage
- the client grabs back the result via the Receiver Front server (RF)
Detailed transaction
1. PF checks the status of a service by looking at the Services Status DB. Each service has a (GR / BR) ratio stored in that DB. GR = Number of good responses. BR = Number of bad responses. What are "Good" and "Bad" responses is to the workers discretion.
Given a threshold, PF can decide to return a 503 immediately, together with a Retry-After header.
2. PF creates a unique Session ID for the transaction and stores it into the Session DB.
3. PF push the work to be done, together with the session id, into the right queue. It then return a 202 code to the client.
4. A worker picks the job and do it. Once it gets the result back, it sends back the result in the queue.
5. A Consumer picks the jobs that are done.
6. The consumer selects a Result DB server by using a simple modulo-based shard algorithm. It pushes the result into the select Result DB.
6. The consumer updates the Session DB by writing there the Result Server ID associated to the Session ID. Once this is done, the job is removed from the queue.
7. The consumer updates the Service status DB with the status (G or F) provided by the worker.
9. When the user ask for the result via RF, RF looks in the Session DB what is the Response DB that has the stored result.
10. The RF then picks the result and returns it.
Databases
The Session DB and Services Status DB are key/value stores global to the cluster and replicated in several places. They can be eventually consistent.
The Result DB is a key/value store local to a single server. It holds the result for a specific transaction.
The queues are providing a Publisher/Subscriber pattern, where each worker picks jobs to be executed on a specific service.
Failure Scenarios and persistency
The general idea is that a transaction can be re-tried manually by the user if there's an issue in our system. Although our system should prevent clients to send more work if a third party service is down.
- if PF cannot reach on of those, a 503+Retry after is sent back:
- the services status DB
- the session DB
- the queue
- if the queue crashes, the data stored should not be lost. Workers should be able to resume their subscriptions
- The job should be left in the queue while a worker runs it:
- a marker (M) is added so it's not picked by another worker
- the job is deleted upon completion and successful push of the result
- if a worker crashes while doing the job, M should decay after a TTL so another worker pick up the job
- if a worker cannot reach the result DB, it retries a few times then just drop the result and remove the job from the queue.
- if a worker cannot reach the session DB, it retries a few times then just drop the result and remove the job from the queue.
XXX
Security
XXX explains the level of security of each piece -- who holds the app key, the tokens, etc
OAuth Dance
XXX