Services/F1/Server/ServicesStatusDB: Difference between revisions

no edit summary
No edit summary
 
(2 intermediate revisions by the same user not shown)
Line 71: Line 71:
* TTL: the time after which GR and GB are initialized to 0
* TTL: the time after which GR and GB are initialized to 0
* MinReqs: the minimum number of requests (GR+GB) before the ratio is considered meaningful
* MinReqs: the minimum number of requests (GR+GB) before the ratio is considered meaningful
* MinRatio: the value between 0 and 1 under which the service is considered unreliable
* Threshold: the value between 0 and 1 under which the service is considered unreliable


These parameters have a default value are stored in configuration files but pushed into the DB. In other words they can be changed dynamically by the workers or an admin application.
These parameters have a default value are stored in configuration files but pushed into the DB. In other words they can be changed dynamically by the workers or an admin application.
Line 96: Line 96:
       return
       return
      
      
   if ratio < min_ratio:
   if ratio < threshold:
       # oops, too many failures
       # oops, too many failures
       res = 503()
       res = 503()
Line 104: Line 104:
What are "Good" and "Bad" responses from the third-party services is to the workers discretion.
What are "Good" and "Bad" responses from the third-party services is to the workers discretion.


= Rationale =  
= Rationale for the threshold =  


When a third party service like Twitter gets down or is starting to be very slow, clients will retry to send to our server more and more requests and our infrastructure will be overloaded and potentially unresponsive. -- see the Rational section.
When a third party service like Twitter gets down or is starting to be very slow, clients will retry to send to our servers more and more requests and our infrastructure will be overloaded and potentially unresponsive.


The goal of the Services Status DB is to provide to every front web server in our infrastructure a status of every third party service in usage.  
The goal of the threshold is to provide to every front web server a way to pre-emptively back-off for a limited time any new request to the services that are down in order to avoid piling up unnecessary work and triggering infrastructure alarms. This regulation will let our servers get back to flow once Twitter is up again.


The web server can decide to preemptively back off for a limited time any new request to the services that are down in order to avoid piling up unnecessary work.
== What happens without a threshold ==
 
Let's say we get 100 requests per second (RPS) per server (including the time Twitter takes to answer). Each server is able to handle a maximum of 500 RPS.
 
Since Twitter is down they're all timing out after 30s. After 30 s, our Twitter server is handling 3000 concurrent requests. The system has already started to send back errors because it's unable to hold that many concurrent requests.
 
Since we don't do any monitoring on Twitter, we're unable to determine if the outage is a problem on our side -- like a sudden spike of activity, or on Twitter side.
 
We're just backing off all requests and the client gets a message saying that it did not work. We are unable to tell the client why it is happening. And all the Nagios turn RED on all those servers because the heartbeat is not responding anymore. Ops have to intervene, and need to check what's going on. "Grrr Twitter is down...not my fault"
 
== What happens with a threshold ==
 
Under the same conditions.
 
Since Twitter is down they're all timing out after 30s. After a few seconds, the ratio of success/failures get below the threshold, and we're able to tell the end-users that there's a Twitter outage. There is a limited activity of requests on our servers that are going to Twitter, so we don't trigger any infrastructure alerts.
 
We're just backing off all requests and the client gets a message saying that it did not work. We can regulate and handle this automatically. We can tell the end-user that Twitter is down.
 
== Scenario ==
 
Here's a full example.  Let's say we have these values :
 
* ttl = 5 minutes
* retry-after = 5 minutes 1 s.
* [R] ratio under which we back off clients: 0.3
* [T] total number of requests before we check the ratio: 3
* [G] good requests
* [B]  bad requests.
 
Let's make the assumption that we start with G at 1 and B at 0, and we have 10 requests done while twitter is down.
 
sequence:
 
0s : twitter is down. G = 1, B = 0. ttl starts.
2s: G = 1, B = 1, T = 1
3s: G = 1, B = 2, T = 1
4s: G = 1, B = 3, T = 3
--- at this time we're backing off requests--
5s: client 4 backed-off, retry at 5m6s
15s: client 5 backed-off, retry at 5m16s
30s: client 6 backed-off, retry at 5m31s
45s: client 7 backed-off, retry at 5m46s
50s: client 8 backed-off, retry at 5m51s
52s: client 9 backed-off, retry at 5m53s
---
60s: twitter is back online
 
-- calls still being rejected --
 
4mn59s: client 10 backed-off, retry at 10m
 
5 mn: G = 0, B = 0. ttl starts.
5 mn6s: client 4 retries, success, G = 1, B = 0
5 mn16s: client 5 retries, success, G = 2, B = 0
5 mn31s: client 6 retries, success, G = 3, B = 0
5 mn46s: client 7 retries, success, G = 4, B = 0
5 mn51s: client 8 retries, success, G = 5, B = 0
5 mn53s: client 9 retries, success, G = 6, B = 0
...
10mn: client 10 retries, success, G = 7, B = 0
 
 
As you can see, one particularity of the system is that once Twitter is back online, we have a bit of inertia before all clients are hitting back our servers.
 
One small caveat: when the ttl-ed values are back to 0, if the service is still down we will have a small amount of requests that make it through before the threshold is calculated correctly again.
 
The configuration values are really depending on what kind of outage we want to target. Long ones (hours), medium (a few minutes)
Confirmed users
927

edits