Services/F1/Server/ServicesStatusDB: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
No edit summary
Line 38: Line 38:
For every services, the parameters that can be configured are:
For every services, the parameters that can be configured are:


- Retry-After: the value of the header  
* Retry-After: the value of the header  
- TTL: the time after which GR and GB are initialized
* TTL: the time after which GR and GB are initialized to 0
- MinReqs: the minimum number of requests (GR+GB) before the ratio is considered meaningful
* MinReqs: the minimum number of requests (GR+GB) before the ratio is considered meaningful
- MinRatio: the value between 0 and 1 under which the service is considered unreliable
* MinRatio: the value between 0 and 1 under which the service is considered unreliable
 
These parameters are stored in configuration files but pushed into the DB. In other words they can be changed dynamically by the workers or an admin application.


Every-time a request comes in, the web server get GR and BR and calculates the ratio of successes.  
Every-time a request comes in, the web server get GR and BR and calculates the ratio of successes.  
Line 49: Line 51:
   num_reqs = BR + GR
   num_reqs = BR + GR
   ratio = GR / (BR + GR)
   ratio = GR / (BR + GR)
 
 
   if num_reqs < min_reqs:
   if num_reqs < min_reqs:
       # not meaningfull  
       # not meaningfull  
       return
       return
 
 
   if ratio < min_ratio:
   if ratio < min_ratio:
       # oops, too many failures
       # oops, too many failures
       raise 503("Retry-After: " + retry_after)
       raise 503("Retry-After: " + retry_after)
        
        
What are "Good" and "Bad" responses from the third-party services is to the workers discretion.


 
= Disabled =
Each service has a '''(GR / BR)''' ratio stored in that DB.
 
* GR = Number of good responses.
* BR = Number of bad responses.
 
What are "Good" and "Bad" responses is to the workers discretion.
 
== Disabled ==


The DB also store a '''disabled''' flag for each service, that can be toggled manually to shut down a service if needed.
The DB also store a '''disabled''' flag for each service, that can be toggled manually to shut down a service if needed.


== Retry-After ==
In that case, every request are backed-off.
 
For each service, the DB stores a Retry-After value. When the DB starts, the value is loaded from a configuration file, but can be changed by the workers or the admin application.

Revision as of 13:39, 29 April 2011

Goal

When a third party service like Twitter gets down or is starting to be very slow, clients will retry to send to our server more and more requests and our infrastructure will be overloaded and potentially unresponsive. See the rational section.

The goal of the Services Status DB is to provide to every front web server in our infrastructure a status of every third party service in usage. The web server can decide to preemptively back off any new request to the services that are down in order to avoid piling up unnecessary work.

Principle

1. On every request the client adds a X-Target-Service header containing the domain of the service it wants to reach. For example, if the clients want to share on Twitter, a "X-Target-Service: twitter.com" is added.

2. The web server (NGinx) that receives the query ask the Services DB what is the status of the service (as described later) and decide if the query should go through or not.

3. If the request is rejected, the client receives a 503 + Retry-After header and has to wait the time provided before it retries.

4. In case the request is accepted, it is passed to the upstream server (Python) that does the job

5. If the upstream server succeeds, it notifies asynchronously the Services DB

6. If the upstream server fails. e.g. if the third party service is considered down, it notifies asynchronously the Services DB

Client UX on outage

Database

The DB is a membase key/value storage, and stores for each service:

  • the number of good requests [GR] - TTL-ed
  • the number of bad requests [BR] - TTL-ed
  • a disabled flag

The DB is replicated in several places and is eventually consistent.

Back-off decision process

For every services, the parameters that can be configured are:

  • Retry-After: the value of the header
  • TTL: the time after which GR and GB are initialized to 0
  • MinReqs: the minimum number of requests (GR+GB) before the ratio is considered meaningful
  • MinRatio: the value between 0 and 1 under which the service is considered unreliable

These parameters are stored in configuration files but pushed into the DB. In other words they can be changed dynamically by the workers or an admin application.

Every-time a request comes in, the web server get GR and BR and calculates the ratio of successes.

In pseudo-code:

 num_reqs = BR + GR
 ratio = GR / (BR + GR)
 
 if num_reqs < min_reqs:
     # not meaningfull 
     return
 
 if ratio < min_ratio:
     # oops, too many failures
     raise 503("Retry-After: " + retry_after)


What are "Good" and "Bad" responses from the third-party services is to the workers discretion.

Disabled

The DB also store a disabled flag for each service, that can be toggled manually to shut down a service if needed.

In that case, every request are backed-off.