Services/F1/Server/ServicesStatusDB: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
No edit summary
Line 1: Line 1:
= Goal =
= Goal =


When a third party service like Twitter gets down or is starting to be very slow, clients will retry to send to our server more and more requests and our infrastructure will be overloaded and potentially unresponsive. See the rational section.
When a third party service like Twitter gets down or is starting to be very slow, we need to inform the client of the outage, and ask them to retry later. If possible, we should tell us what's going on.
 
= Outage definition =
 
The Share server can have three types of outages:
 
1. a Mozilla infrastructure outage
2. a third-party outage, like "Twitter is down"
3. a scheduled maintenance window
 
= Client UX on outage =
 
When an outage happens, the server returns a 503 + Retry-After + X-Strict-Retries, and possibly an explanation in the response body.
 
When a client tries to send a request and gets back a 503, a bar pops-up :
 
  Twitter seems to be unresponsive, we will try again automatically in 5 mn. [Force retry]
 
The "Force retry" button that appears in the pop-up will let the end-user force a retry.
 
When possible, the server informs the client why the service is down via the body, and the client can display the information:
  As scheduled, our system is currently in a maintenance window that will be over in 25mn.
 
In case the X-Strict-Retries header is present, like in this example, the "Force retry" button will not appear.
 
After three tentatives, automatic or forced, the message to be sent is discarded and the user gets a popup bar:
 
  Failed to send the message after 3 attempts, Discarded.


The goal of the Services Status DB is to provide to every front web server in our infrastructure a status of every third party service in usage. The web server can decide to preemptively back off any new request to the services that are down in order to avoid piling up unnecessary work.


= Principle =
= Principle =
This section describes the user flow when a request is made:


1. On every request the client adds a X-Target-Service header containing the domain of the service it wants to reach.  
1. On every request the client adds a X-Target-Service header containing the domain of the service it wants to reach.  
For example, if the clients want to share on Twitter, a "X-Target-Service: twitter.com" is added.
For example, if the clients want to share on Twitter, a "X-Target-Service: twitter.com" is added.


2. The web server (NGinx) that receives the query ask the Services DB what is the status of the service (as described later) and decide if the query should go through or not.  
2. The web server (NGinx) that receives the request, asks the Services DB what is the status of the service (as described later) and decides if the request should go through or not.  


3. If the request is rejected, the client receives a 503 + Retry-After header and has to wait the time provided before it retries.
3. If the request is rejected, the client receives a 503 + Retry-After header and has to wait before it retries. It also possibly gets a reason in the body, and a X-Strict-Retries header.


4. In case the request is accepted, it is passed to the upstream server (Python) that does the job
4. In case the request is accepted, it is passed to the upstream server (Python) that does the job
Line 18: Line 47:
5. If the upstream server succeeds, it notifies asynchronously the Services DB
5. If the upstream server succeeds, it notifies asynchronously the Services DB


6. If the upstream server fails. e.g. if the third party service is considered down, it notifies asynchronously the Services DB
6. If the upstream server fails. e.g. if the third party service is considered down, it notifies asynchronously the Services DB, and send back a 503 + Retry-After.


= Client UX on outage =


= Database =
= Database =
Line 28: Line 54:
The DB is a membase key/value storage, and stores for each service:
The DB is a membase key/value storage, and stores for each service:


* the number of good requests [GR] - TTL-ed
* GR: the number of good requests TTL-ed
* the number of bad requests [BR] - TTL-ed
* BR: the number of bad requests - TTL-ed
* a disabled flag
* Disable: a disabled flag
* Retry-After: the value of the Retry-After header
* TTL: the time after which GR and GB are initialized to 0
* MinReqs: the minimum number of requests (GR+GB) before the ratio is considered meaningful
* MinRatio: the value between 0 and 1 under which the service is considered unreliable


The DB is replicated in several places and is eventually consistent.
The DB is replicated in several places and is eventually consistent.
Line 43: Line 73:
* MinRatio: the value between 0 and 1 under which the service is considered unreliable
* MinRatio: the value between 0 and 1 under which the service is considered unreliable


These parameters are stored in configuration files but pushed into the DB. In other words they can be changed dynamically by the workers or an admin application.
These parameters have a default value are stored in configuration files but pushed into the DB. In other words they can be changed dynamically by the workers or an admin application.


Every-time a request comes in, the web server get GR and BR and calculates the ratio of successes.  
Every-time a request comes in, the web server gets the disable, GR and BR values and calculates the ratio of successes to decide if the request should be rejected.  


In pseudo-code:
In pseudo-code:
  if disabled:
      # oops, the service was completely disabled
      res = 503("Retry-After: " + retry_after)
      if disabling_reason:
        res.append(X-Strict-Retries)
      raise res 


   num_reqs = BR + GR
   num_reqs = BR + GR
Line 60: Line 97:
       raise 503("Retry-After: " + retry_after)
       raise 503("Retry-After: " + retry_after)


     
What are "Good" and "Bad" responses from the third-party services is to the workers discretion.  
What are "Good" and "Bad" responses from the third-party services is to the workers discretion.  
= Rationale =


= Disabled =
When a third party service like Twitter gets down or is starting to be very slow, clients will retry to send to our server more and more requests and our infrastructure will be overloaded and potentially unresponsive. -- see the Rational section.


The DB also store a '''disabled''' flag for each service, that can be toggled manually to shut down a service if needed.
The goal of the Services Status DB is to provide to every front web server in our infrastructure a status of every third party service in usage.  


In that case, every request are backed-off.
The web server can decide to preemptively back off for a limited time any new request to the services that are down in order to avoid piling up unnecessary work.

Revision as of 14:08, 29 April 2011

Goal

When a third party service like Twitter gets down or is starting to be very slow, we need to inform the client of the outage, and ask them to retry later. If possible, we should tell us what's going on.

Outage definition

The Share server can have three types of outages:

1. a Mozilla infrastructure outage 2. a third-party outage, like "Twitter is down" 3. a scheduled maintenance window

Client UX on outage

When an outage happens, the server returns a 503 + Retry-After + X-Strict-Retries, and possibly an explanation in the response body.

When a client tries to send a request and gets back a 503, a bar pops-up :

  Twitter seems to be unresponsive, we will try again automatically in 5 mn. [Force retry]

The "Force retry" button that appears in the pop-up will let the end-user force a retry.

When possible, the server informs the client why the service is down via the body, and the client can display the information:

  As scheduled, our system is currently in a maintenance window that will be over in 25mn. 

In case the X-Strict-Retries header is present, like in this example, the "Force retry" button will not appear.

After three tentatives, automatic or forced, the message to be sent is discarded and the user gets a popup bar:

  Failed to send the message after 3 attempts, Discarded.


Principle

This section describes the user flow when a request is made:

1. On every request the client adds a X-Target-Service header containing the domain of the service it wants to reach. For example, if the clients want to share on Twitter, a "X-Target-Service: twitter.com" is added.

2. The web server (NGinx) that receives the request, asks the Services DB what is the status of the service (as described later) and decides if the request should go through or not.

3. If the request is rejected, the client receives a 503 + Retry-After header and has to wait before it retries. It also possibly gets a reason in the body, and a X-Strict-Retries header.

4. In case the request is accepted, it is passed to the upstream server (Python) that does the job

5. If the upstream server succeeds, it notifies asynchronously the Services DB

6. If the upstream server fails. e.g. if the third party service is considered down, it notifies asynchronously the Services DB, and send back a 503 + Retry-After.


Database

The DB is a membase key/value storage, and stores for each service:

  • GR: the number of good requests TTL-ed
  • BR: the number of bad requests - TTL-ed
  • Disable: a disabled flag
  • Retry-After: the value of the Retry-After header
  • TTL: the time after which GR and GB are initialized to 0
  • MinReqs: the minimum number of requests (GR+GB) before the ratio is considered meaningful
  • MinRatio: the value between 0 and 1 under which the service is considered unreliable

The DB is replicated in several places and is eventually consistent.

Back-off decision process

For every services, the parameters that can be configured are:

  • Retry-After: the value of the header
  • TTL: the time after which GR and GB are initialized to 0
  • MinReqs: the minimum number of requests (GR+GB) before the ratio is considered meaningful
  • MinRatio: the value between 0 and 1 under which the service is considered unreliable

These parameters have a default value are stored in configuration files but pushed into the DB. In other words they can be changed dynamically by the workers or an admin application.

Every-time a request comes in, the web server gets the disable, GR and BR values and calculates the ratio of successes to decide if the request should be rejected.

In pseudo-code:

 if disabled:
     # oops, the service was completely disabled
     res = 503("Retry-After: " + retry_after)
     if disabling_reason:
       res.append(X-Strict-Retries)
     raise res  
 num_reqs = BR + GR
 ratio = GR / (BR + GR)
 
 if num_reqs < min_reqs:
     # not meaningfull 
     return
 
 if ratio < min_ratio:
     # oops, too many failures
     raise 503("Retry-After: " + retry_after)

What are "Good" and "Bad" responses from the third-party services is to the workers discretion.

Rationale

When a third party service like Twitter gets down or is starting to be very slow, clients will retry to send to our server more and more requests and our infrastructure will be overloaded and potentially unresponsive. -- see the Rational section.

The goal of the Services Status DB is to provide to every front web server in our infrastructure a status of every third party service in usage.

The web server can decide to preemptively back off for a limited time any new request to the services that are down in order to avoid piling up unnecessary work.