CloudServices/Sagrada/TokenServer: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
No edit summary
 
(45 intermediate revisions by 5 users not shown)
Line 1: Line 1:
= Goals =
= Goals =


So here's the challenge we face. Current login for sync looks like this:
tldr: having a centralized login service.


# provide username and password
See: http://docs.services.mozilla.com/token/index.html#goal-of-the-service
# we log into ldap with that username and password and grab your sync node
# we check the sync node against the url you've accessed, and use that to configure where your data is stored.


This solution works great for centralized login. It's fast, has a minimum number of steps, and caches the data centrally. The system that does node-assignment is lightweight, since the client and server both cache the result, and has support for multiple applications with the /node/<app> API protocol.
= APIS =


However, this breaks horribly when we don't have centralized login. And adding support for browserid to the SyncStorage protocol means that we're now there. We're going to get valid requests from users who don't have an account in LDAP. We won't even know, when they make a first request, if the node-assignment server has ever heard of them.
see http://docs.services.mozilla.com/token/apis.html


So, we have a bunch of requirements for the system. Not all of them are must-haves, but they're all things we need to think about trading off in whatever system gets designed:
* need to support multiple services (not necessarily centrally)
* need to be able to assign users to different machines as a service scales out, or somehow distribute them
* need to consistently send a user back to the same server once they've been assigned
* need to give operations some level of control over how users are allocated
* need to provide some recourse if a particular node dies
* need to handle exhaustion attacks. For example, I could set up an primary that just auto-approved any username, then loop through users until all nodes were full.
* need support for future developments like bucketed assignment
* Needs to be a system that scales infinitely.


= Proposed Design =
= Proposed Design =
Line 33: Line 21:
== Definitions and assumptions ==
== Definitions and assumptions ==


First, a few definitions.  The major players in the network topology are:
See http://docs.services.mozilla.com/token/index.html#assumptions
 
* '''Service''': a service Mozilla provides, like '''Sync''' or '''Easy Setup'''.
* '''Login Server''': used to authenticate user, returns tokens that can be used to authenticate to our services.
* '''Node''': an URL that identifies a service, like http://phx345
* '''Service Node''': a server that contains the service, and can be mapped to several Nodes (URLs)
* '''Node Assignment Server''': a service that can attribute to a user a node.
* '''User DB''': a database that keeps the user/node relation
* '''Cluster''': Group of webheads and storage devices that make up a set of Service Nodes.
* '''Colo''': physical datacenter, may contain multiple clusters
 
Cryptographically, we have the following terms:
 
* '''HKDF''':  HMAC-based Key Derivation Function, a method for deriving multiple secret keys from a single master secret (https://tools.ietf.org/html/rfc5869).
* '''Two-Legged OAuth''':  an authentication scheme for HTTP requests, based on a HMAC signature over the request metadata.  (http://tools.ietf.org/html/rfc5849#section-3)
* '''Auth Token''': used to identify the user after starting a session.  Contains the user application id and the expiration date.
* '''Metadata Token''': used to send application-specific metadata for the Service.
* '''Master Secret''':  a secret shared between Login Server and Service Node. Never used directly, only for deriving other secrets.
* '''Signing Secret''': derived from the master secret, used to sign auth and metadata tokens. For example: sig-secret = HKDF_Expand(master-secret, "SIGN")
* '''Encryption Secret''': derived from the master secret, used to encrypt the metadata token.  For example: enc-secret = HKDF_Expand(master-secret, "ENCRYPT")
* '''Token Secret''':  derived from the master secret and auth token, used as '''oauth_consumer_secret'''. This is the only secret shared with the client and is different for each token. For example: token-secret = HKDF_Expand(master-secret, auth-token)
 
 
 
Some assumptions:
 
* A Login Server detains the secret for all the Service Nodes for a given Service.
* Any given webhead in a cluster can receive calls to all service nodes in the cluster.
* The Login Server will support only BrowserID at first, but could support any authentication protocol in the future, as long as it can be done with a single call.
* All servers are time-synced


== Flow ==
== Flow ==


Here's the proposed two-step flow (with Browser ID):
see http://docs.services.mozilla.com/token/user-flow.html


# the client trades a browser id assertion for an auth token and corresponding secret
== Authorization token ==
# the client uses the auth token to sign subsequent requests using two-legged oauth


Getting an auth token:
A token is a json encoded mapping. The keys of the Authorization Token are:


<pre>
* '''expires''': an expire timestamp (UTC) defaults to current time + 30 mn
Client                      Login Server                  BID        User DB          Node Assignment Server 
* '''uid''': the app-specific user id (the user id integer in the case of sync)
===========================================================================================================
* '''salt''': a randomly-generated salt for use in the calculation of the Token Secret (''optional'')
                                |                          |            |                    |               
* '''node''': the name of the service node to which the user is assigned
request token ---- [1] --------->|------> verify --- [2] -->|            |                    |               
                                |      get node -- [3] ---|------------>|--> lookup          |               
                                |                          |            |<-- return node     |                 
                                |  attribute node --[4]----|-------------|------------------->|--> set node   
                                |                          |            |                    |<-- node       
                                |<--- build token  [5]    |            |                    |             
keep token <-------- [6] --------|                          |            |                    |               
</pre>


Calling the service:
Example:


<pre>
  auth_token = {"uid": 123, "node": "https://sync-1.services.mozilla.com", "expires": 1324654308.907832, "salt": "sghfwq6875765..UYgs"} 


Client                                          Service Node
============================================================
The token is signed using the Signing Secret and base64-ed. The signature is HMAC-SHA256:
create signed auth header [7]    |                | 
call node --------------- [8] ---|---------------->|--> verify token [9]
                                |                |    verify request signature [10]
                                |                |<-- process request [11]
get response <-------------------|-----------------|
</pre>


  auth_token, signature = HMAC-SHA256(auth_token, sig_secret)
  auth_token = b64encode(auth_token, signature)


* the client requests a token, giving its browser id assertion [1]
'''The authorization token is not encrypted'''


    POST /request_token HTTP/1.1
== Secrets ==
    Host: token.services.mozilla.com
    Content-Type: application/json
    X-Authentication-Method: Browser-ID  (optional header since Browser-ID is the default)
   
    {"audience":XXX,"assertion":XXX}   


* the Login Server checks the browser id assertion [2] '''this step will be done locally without calling an external browserid server -- but this could potentially happen''' (we can use pyvep + use the BID.org certificate)
Each Service Node has a unique Master Secret that it shares with the Login Server,which is used to sign and validate authentication tokens. Multiple secrets can be active at any one time to support graceful rolling over to a new secret.
* the Login Server asks the Users DB if the user is already allocated to a node. [3]
* if the user is not allocated to a node, the Login Server asks a new one to the Node Assignment Server [4]
* the Login Server creates a response with an auth token and corresponding token secret [5] and sends it back to the user.  The auth token contains the user id and a timestamp, and is signed using the signing secret.  The token secret is derived from the master secret and auth token using HKDF. It also adds the node url in the response, and optionaly a metadata token. [6]


  HTTP/1.1 200 OK
To simplify management of these secrets, the tokenserver maintains a single list of master secrets and derives a secret specific to each node using HKDF:
  Content-Type: application/json
 
  {'oauth_consumer_key': <auth-token>,
    'oauth_consumer_secret': <token-secret>,
    'service_entry': <node>,
    'metadata': <metadata-token>
    }


* the client saves the node location and oauth parameters to use in subsequent requests. [6]
* node-info = "services.mozilla.com/mozsvc/v1/node_secret/" + node-name
* for each subsequent request to the Service, the client calculates a special Authorization header using two-legged OAuth [7] and sends the request to the allocated node location [8] along with the metadata token if provided
* node-master-secret = HKDF(master-secret, salt=None, info=node-info, size=digest-length)


    POST /request HTTP/1.1
The node-specific Master Secret is used to derive keys for various cryptographic routines. At startup time, the Login Server and Node should pre-calculate and cache the signing key as follows:
    Host: some.node.services.mozilla.com
    Authorization: OAuth realm="Example",
                    metadata=<metadata-token>,
                    oauth_consumer_key=<auth-token> 
                    oauth_signature_method="HMAC-SHA1",
                    oauth_timestamp="137131201",  (client timestamp)
                    oauth_nonce="7d8f3e4a",
                    oauth_signature="bYT5CMsGcbgUdFHObYMEfcx6bsw%3D"


* sig-secret:  HKDF(node-master-secret, salt=None, info="SIGNING", size=digest-length)


* the node uses the Signing Secret to validate the Auth Token [9].  If invalid or expired then the node returns a 401
By using a no salt (or a fixed salt) these secrets can be calculated once and then used for each request.
* the node calculates the Token Secret from its Master Secret and the Auth Token, and checks whether the signature in the Authorization header is valid [10]. If it's an invalid then the node returns a 401
* the node processes the request as defined by the Service [11]


== Tokens ==
When issuing or checking an Auth Token, the corresponding Token Secret is calculated as:


A token is a json encoded mapping. The are two tokens:
* token-secret: b64encode(HKDF(node-master-secret, salt=token-salt, info=auth-token, size=digest-length))


* the authorization token: contains the user application id and the expiration date.
Note that the token-secret is base64-encoded for ease of transmission back to the client.
* the metadata token: contains app-specific data ''(optional)''




=== Authorization Token ===
=== Configuring Secrets ===


The keys of the Authorization Token are:
The tokenserver should be configured to use the DerivedSecrets class with the list of master secrets:


* '''expires''': an expire timestamp (UTC) defaults to current time + 30 mn
    [tokenserver]
* '''uid''': the app-specific user id (the user id integer in the case of sync)
    secrets.backend = mozsvc.secrets.DerivedSecrets
    secrets.master_secrets = master-secret-one master-secret-two


Example:
A suitable master secret can be generated using mozsvc as follows:


  auth_token = {'uid': '123', 'expires': 1324654308.907832} 
    python -m mozsvc.secrets new


The token is signed using the signing secret and base64-ed. The signature is HMAC-SHA1:
Each node should be configured to use the FixedSecrets class and its corresponding derived secret:


  auth_token, signature = HMAC-SHA1(auth_token, sig_secret)
    [hawkauth]
  auth_token = b64encode(auth_token, salt, signature)
    secrets.backend = mozsvc.secrets.FixedSecrets
    secrets.secrets = node-master-secret-one, node-master-secret-two


'''The authorization token is not encrypted'''
This prevents a compromise on one service node from leaking the secrets on all nodes.  A suitable node-specific secret can be derived from the master secret as follows:


=== Metadata token (optional) ===
    python -m mozsvc.secrets derive <master_secret> https://<node_name>


The keys of the Metadata token are free-form. This token can include anything needed by the application to function.


It's passed as-is by the client to the Service Node
=== Secret Update Process ===
 
Example:
 
  app_token = {'email': 'my@email.com', 'someparam': 1324654308.907832} 
 
To avoid information leakage, the token is encrypted and signed using the shared secret and base64-ed. The encryption is AES-CBC and signature is HMAC-SHA1:
 
  app_token, signature = AES-CBC+HMAC-SHA1(app_token, secret_key)
  app_token = b64encode(app_token, signature)
 
 
'''The metadata token is crypted'''
 
== Shared Secrets File ==


Each Service Node has a unique secret per Node it serves, it shares with the Login Server. A secret is an hex string of 256 chars from [a-f0-9]
To revoke the secrets for a specific node, simply rename it so that its derived secret will be different.


Example of generating such string:
To update the master secrets, the following procedure should be used:


  >>> import binascii, os
1) Generate the new master secret, but keep the old one as well for now
  >>> print binascii.b2a_hex(os.urandom(256))[:256]
  21c100e75c02af215e2bf523b0...0505ff951


Ops create secrets for each Node, and maintain for each cluster a file containing all secrets. The file is deployed on the Login Server and on each Service Node. The Login Server has all clusters files.
2) For each storage node, derive both the new and old node-specific secrets
and push them out, so that its config file looks like this:


Each file is a CSV file called '''/var/moz/shared_secrets/CLUSTER''', where CLUSTER is the name of the cluster,
    [hawkauth]
    secrets.backend = mozsvc.secrets.FixedSecrets
    secrets.secrets = <old-derived-node-secret-as-hex> <new-derived-node-secret-as-hex>


Example:
Restart it.  It is now able to accept tokens signed with either secret.


    phx1,secret
3) For each tokenserver webhead, update it with the new master secret, removing
    phx2,secret
the old one. Its config file will look like:
    ...


    [tokenserver]
    secrets.backend = mozsvc.secrets.DerivedSecrets
    secrets.master_secrets = <new-master-secret-as-hex>


=== Secret Update Process ===
Restart it.  It now generates tokens signed with the new derived secrets.


When an existing secret needs to be changed for whatever reason, the current secret becomes the ''old'' secret. The reason is to avoid existing tokens to be rejected when the secret is changed.
4) Discard the old master secret.


The new secret is inserted to the Node's line on each file :
5) Wait for one token expiration period, e.g. five minutes.


    phx1,new secret,oldsecret
6) For each storage node, derive just the new node-specific secret and push
    phx2,secret
it out, so that its config file looks like this:
    ...


The Service Nodes are the first ones to be updated, then the Login Server is updated in turn, so the new tokens are immediatly recognized by the Nodes. In the interim, the Service Node fallbacks to the old secret when a token verification fails and there's an old secret in the file.
=== Pulling a secret ===


The Login Server only works with a single secret, so ignores the old secret when it creates tokens.
In case we want to instantly remove the validity of a secret, we add a new secret as described before, but prune the old secrets right away, so any token out there are instantly rejected.
 
The old secret is pruned eventually. Updating the files should ping the app so we reload them


== Backward Compatibility ==
== Backward Compatibility ==
Line 259: Line 167:
# the server process the request [Sync = I/O Bound]
# the server process the request [Sync = I/O Bound]


= APIS v1.0 =
'''Unless stated otherwise, all APIs are using application/json for the requests and responses content types.'''
'''POST /1.0/request_token'''
Asks for new token given some credentials. By default, the authentication mechanism is Browser ID
but the '''X-Authentication-Protocol''' can be used to explicitly pick a protocol. If the server does not
support the authentication protocol provided, a 400 is returned.
 
When the authentication protocol requires something else than an Authorization header, the data is provided in
the request body.
Example for Browser-Id:
 
<pre>
POST /request_token
Host: token.services.mozilla.com
Content-Type: application/json
{'audience': XXX,
'assertion': XXX}
</pre>
This API returns several values in a json mapping:
* '''oauth_consumer_key''' - a signed authorization token, containing the user's id and expiration
* '''oauth_consumer_secret''' - a secret containing a secret derived from the shared secret
* '''service_entry''': a node url
* '''metadata''' - a signed an encrypted token, containing app-specific metadata - '''optional'''
Example:
<pre>
HTTP/1.1 200 OK
Content-Type: application/json
{'oauth_consumer_key': <token>,
'oauth_consumer_secret': <derived-secret>,
'service_entry': <node>,
'metadata': <metadata-token>,
}
</pre>


= Phase 1 =
= Phase 1 =
Line 323: Line 188:
* Operational support scripts (TBD)
* Operational support scripts (TBD)
* Logging and Metrics
* Logging and Metrics
= Implementation details =
* The Token Server web service is implemented using Cornice and Pyramid, and sends crypto work to a crypto service via zmq.
* The Crypto worker is a c++ program using cryptopp
http://ziade.org/token.png

Latest revision as of 03:32, 12 June 2014

Goals

tldr: having a centralized login service.

See: http://docs.services.mozilla.com/token/index.html#goal-of-the-service

APIS

see http://docs.services.mozilla.com/token/apis.html


Proposed Design

This solution proposes to use a token-based authentication system. A user that wants to connect to one of our service asks to a central server an access token.

The central server, a.k.a. the Login Server checks the authenticity of the user with a supported authentication method, and attributes to the user a server he needs to use with that token.

The server, a.k.a. the Service Node, that gets called controls the validity of the token included in the request. Token have a limited lifespan.


Definitions and assumptions

See http://docs.services.mozilla.com/token/index.html#assumptions

Flow

see http://docs.services.mozilla.com/token/user-flow.html

Authorization token

A token is a json encoded mapping. The keys of the Authorization Token are:

  • expires: an expire timestamp (UTC) defaults to current time + 30 mn
  • uid: the app-specific user id (the user id integer in the case of sync)
  • salt: a randomly-generated salt for use in the calculation of the Token Secret (optional)
  • node: the name of the service node to which the user is assigned

Example:

 auth_token = {"uid": 123, "node": "https://sync-1.services.mozilla.com", "expires": 1324654308.907832, "salt": "sghfwq6875765..UYgs"}  


The token is signed using the Signing Secret and base64-ed. The signature is HMAC-SHA256:

 auth_token, signature = HMAC-SHA256(auth_token, sig_secret)
 auth_token = b64encode(auth_token, signature)

The authorization token is not encrypted

Secrets

Each Service Node has a unique Master Secret that it shares with the Login Server,which is used to sign and validate authentication tokens. Multiple secrets can be active at any one time to support graceful rolling over to a new secret.

To simplify management of these secrets, the tokenserver maintains a single list of master secrets and derives a secret specific to each node using HKDF:

  • node-info = "services.mozilla.com/mozsvc/v1/node_secret/" + node-name
  • node-master-secret = HKDF(master-secret, salt=None, info=node-info, size=digest-length)

The node-specific Master Secret is used to derive keys for various cryptographic routines. At startup time, the Login Server and Node should pre-calculate and cache the signing key as follows:

  • sig-secret: HKDF(node-master-secret, salt=None, info="SIGNING", size=digest-length)

By using a no salt (or a fixed salt) these secrets can be calculated once and then used for each request.

When issuing or checking an Auth Token, the corresponding Token Secret is calculated as:

  • token-secret: b64encode(HKDF(node-master-secret, salt=token-salt, info=auth-token, size=digest-length))

Note that the token-secret is base64-encoded for ease of transmission back to the client.


Configuring Secrets

The tokenserver should be configured to use the DerivedSecrets class with the list of master secrets:

   [tokenserver]
   secrets.backend = mozsvc.secrets.DerivedSecrets
   secrets.master_secrets = master-secret-one master-secret-two

A suitable master secret can be generated using mozsvc as follows:

   python -m mozsvc.secrets new

Each node should be configured to use the FixedSecrets class and its corresponding derived secret:

   [hawkauth]
   secrets.backend = mozsvc.secrets.FixedSecrets
   secrets.secrets = node-master-secret-one, node-master-secret-two

This prevents a compromise on one service node from leaking the secrets on all nodes. A suitable node-specific secret can be derived from the master secret as follows:

   python -m mozsvc.secrets derive <master_secret> https://<node_name>


Secret Update Process

To revoke the secrets for a specific node, simply rename it so that its derived secret will be different.

To update the master secrets, the following procedure should be used:

1) Generate the new master secret, but keep the old one as well for now

2) For each storage node, derive both the new and old node-specific secrets and push them out, so that its config file looks like this:

    [hawkauth]
    secrets.backend = mozsvc.secrets.FixedSecrets
    secrets.secrets = <old-derived-node-secret-as-hex> <new-derived-node-secret-as-hex>

Restart it. It is now able to accept tokens signed with either secret.

3) For each tokenserver webhead, update it with the new master secret, removing the old one. Its config file will look like:

    [tokenserver]
    secrets.backend = mozsvc.secrets.DerivedSecrets
    secrets.master_secrets = <new-master-secret-as-hex>

Restart it. It now generates tokens signed with the new derived secrets.

4) Discard the old master secret.

5) Wait for one token expiration period, e.g. five minutes.

6) For each storage node, derive just the new node-specific secret and push it out, so that its config file looks like this:

Pulling a secret

In case we want to instantly remove the validity of a secret, we add a new secret as described before, but prune the old secrets right away, so any token out there are instantly rejected.

Backward Compatibility

The Login server uses the same snode and ldap servers, so both authentication systems can cohabit during a transition period.

Infra/Scaling

On the Login Server

The flow is:

  1. the user ask for a token, with a browser id assertion
  2. the server verifies locally the assertion [CPU bound]
  3. the server calls the User DB [I/O Bound]
  4. the server calls the Node Assignment Server [I/O Bound] (optional)
  5. the server builds the token and sends it back [CPU bound]
  6. the user uses the node for the time of the ttl (30mn)

So, for 100k users it means we'll do 200k requests on the Login Server per hour, so 50 RPS. For 1M users, 500 RPS. For 10M users, 5000 RPS. For 100M users, 50000 RPS.


Deployment

  • A Login Server is stateless, so we can deploy as many as we want and have Zeus load balance over them
  • A Login Server sees all secrets, so it can be cross-cluster / cross-datacenter
  • The shared secrets files can stay in memory -- updating the files should ping the app so we reload them
  • The User DB is the current LDAP, and may evolve into a more specialised metadata DB later

On each Service Node

Flow :

  1. the server checks the token [CPU Bound]
  2. the server process the request [Sync = I/O Bound]


Phase 1

[End of January? Need to check with ally]

End to end prototype with low-level scaling

  • Fully defined API, including headers and errors
  • Assigns Nodes
  • Maintains Node state for a user (in the existing LDAP)
  • Issues valid tokens
  • Downs nodes if needed

Phase 2

[End of Q1?]

Scalable implementation of the above in place.

  • Migration
  • Operational support scripts (TBD)
  • Logging and Metrics


Implementation details

  • The Token Server web service is implemented using Cornice and Pyramid, and sends crypto work to a crypto service via zmq.
  • The Crypto worker is a c++ program using cryptopp


token.png