Identity/DevOps

< Identity
Revision as of 18:36, 5 February 2013 by Gene wood (talk | contribs) (Created page with "= Roadmap = == Q1 2013 roadmap == In Q1 2013 the Identity DevOps team will be moving services out of the physical datacenter SCL2 and into Amazon Web Services. 2/4 * 1/28 - 2/4 ...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Roadmap

Q1 2013 roadmap

In Q1 2013 the Identity DevOps team will be moving services out of the physical datacenter SCL2 and into Amazon Web Services.

2/4

  • 1/28 - 2/4 : roadmap defined and signed off
  • 1/28 - 2/4 : technology stack justification written and shared

2/11

  • 2/4 - 2/11 : chef server built and working
  • 2/4 - 2/11 : established 1 region VPC

2/18

  • 2/4 - 2/18 : completed a mini provisioning test and plan
  • 2/11 - 2/18 : written the webhead chef provisioning logic
  • 2/11 - 2/18 : written the nginx chef provisioning logic and carried over existing nginx routing logic
  • milestone : chef can fully provision webheads

2/25

  • 2/18 - 2/25 : zeus routing logic is converted into nginx logic
  • 2/18 - 2/25 : written the nagios chef provisioning logic
  • 2/18 - 2/25 : basic webhead nagios checks created
  • milestone : admin can see the monitored availability and performance of the webhead

3/4

  • 2/25 - 3/4 : ELB is setup and sending traffic to the webhead
  • 2/18 - 2/25 : basic webhead nagios checks against the ELB created
  • milestone : internet client can fetch persona main page from AWS traversing ELB

3/11

  • 3/4 - 3/11 : written swebhead chef provisioning logic
  • 3/4 - 3/11 : ELB configured for swebhead cluster
  • 3/4 - 3/11 : written db chef provisioning logic
  • 3/4 - 3/11 : ELB configured for db cluster
  • 3/4 - 3/11 : written keysign chef provisioning logic
  • 3/4 - 3/11 : ELB configured for keysign cluster
  • 3/4 - 3/11 : established VPN to PHX1
  • 3/4 - 3/11 : basic swebhead db and keysign nagios checks created
  • milestone : internet client can login using persona in AWS

3/18

  • 3/11 - 3/18 : written bigtent and squid proxy chef provisioning logic
  • 3/11 - 3/18 : ELB configured for bigtent and squid clusters
  • 3/11 - 3/18 : basic bigtent and squid proxy nagios checks created
  • milestone : internet client can login with a yahoo address using yahoo bigtent

3/25

  • 3/18 - 3/25 : load tested/validated that region 1 is ready for prod traffic
  • 3/18 - 3/25 : full security group logic is in place replicating existing physical network
  • milestone : security : network security is in place and all tiers use proxies for communication
  • milestone : load testing complete for region 1

4/1

  • 3/25 - 4/1 : moved master from PHX1 to region 1 AWS
  • dynect is changed to balance between AWS region 1 and PHX1. SCL2 sits running as a backup
  • milestone : all db writes are now going to AWS
  • milestone : AWS region 1 is live in production, SCL2 no longer receives traffic

State at end of Q1 2013

  • SCL2 is dark
  • Production is running off of 1 AWS region and 1 physical datacenter
  • runbooks for AWS deployments & core troubleshooting have been developed
  • The staging environment has been moved to AWS
    • key differences between production and staging AWS areas: server localization & access
  • monitoring: existing monitoring minus some cepmon rate-of-change monitors has been moved into a new nagios deployment in AWS
  • alerting: existing minus cepmon-triggered stuff has been migrated

Q2 2013 roadmap

In Q2 DevOps will be bringing up the second AWS region and executing remaining tasks to get us to a truly highly available architecture, ready to graduate from beta

4/8

  • 3/25 - 4/8 : spun up region 2 AWS
  • 3/25 - 4/8 : determined how to do log processing (logstash?) and pump data into zenoss

4/15

  • 4/8 - 4/15 : load tested/validated that region 2 is ready for prod traffic
  • 4/1 - 4/15 : written auto provisioning logic to call AWS and spin up instances, assign them roles, and pass them to chef for provisioning
  • dynect is changed to balance between AWS region 1 and AWS region 2
  • milestone : persona is fully hosted in AWS multi-region

4/30

  • Final day to turn down servers at SCL2

5/13

  • Modify DB architecture to remove single point of failure (single write master)
    • This is not re-evaluating our choice of persistence. It's just making our existing architecture truly fault-tolerant and highly available.
  • Add more performance monitoring to enable later platform improvements
    • There are many ways we could further scale. To make intelligent choices, we need to gather information about the performance and behavior of our servers.

Beyond

Additional Operational Improvements