Identity/DevOps
< Identity
Roadmap
Q1 2013 roadmap
In Q1 2013 the Identity DevOps team will be moving services out of the physical datacenter SCL2 and into Amazon Web Services.
2/4
- 1/28 - 2/4 : roadmap defined and signed off
- 1/28 - 2/4 : technology stack justification written and shared
2/11
- 2/4 - 2/11 : chef server built and working
- 2/4 - 2/11 : established 1 region VPC
2/18
- 2/4 - 2/18 : completed a mini provisioning test and plan
- 2/11 - 2/18 : written the webhead chef provisioning logic
- 2/11 - 2/18 : written the nginx chef provisioning logic and carried over existing nginx routing logic
- milestone : chef can fully provision webheads
2/25
- 2/18 - 2/25 : zeus routing logic is converted into nginx logic
- 2/18 - 2/25 : written the nagios chef provisioning logic
- 2/18 - 2/25 : basic webhead nagios checks created
- milestone : admin can see the monitored availability and performance of the webhead
3/4
- 2/25 - 3/4 : ELB is setup and sending traffic to the webhead
- 2/18 - 2/25 : basic webhead nagios checks against the ELB created
- milestone : internet client can fetch persona main page from AWS traversing ELB
3/11
- 3/4 - 3/11 : written swebhead chef provisioning logic
- 3/4 - 3/11 : ELB configured for swebhead cluster
- 3/4 - 3/11 : written db chef provisioning logic
- 3/4 - 3/11 : ELB configured for db cluster
- 3/4 - 3/11 : written keysign chef provisioning logic
- 3/4 - 3/11 : ELB configured for keysign cluster
- 3/4 - 3/11 : established VPN to PHX1
- 3/4 - 3/11 : basic swebhead db and keysign nagios checks created
- milestone : internet client can login using persona in AWS
3/18
- 3/11 - 3/18 : written bigtent and squid proxy chef provisioning logic
- 3/11 - 3/18 : ELB configured for bigtent and squid clusters
- 3/11 - 3/18 : basic bigtent and squid proxy nagios checks created
- milestone : internet client can login with a yahoo address using yahoo bigtent
3/25
- 3/18 - 3/25 : load tested/validated that region 1 is ready for prod traffic
- 3/18 - 3/25 : full security group logic is in place replicating existing physical network
- milestone : security : network security is in place and all tiers use proxies for communication
- milestone : load testing complete for region 1
4/1
- 3/25 - 4/1 : moved master from PHX1 to region 1 AWS
- dynect is changed to balance between AWS region 1 and PHX1. SCL2 sits running as a backup
- milestone : all db writes are now going to AWS
- milestone : AWS region 1 is live in production, SCL2 no longer receives traffic
State at end of Q1 2013
- SCL2 is dark
- Production is running off of 1 AWS region and 1 physical datacenter
- runbooks for AWS deployments & core troubleshooting have been developed
- The staging environment has been moved to AWS
- key differences between production and staging AWS areas: server localization & access
- monitoring: existing monitoring minus some cepmon rate-of-change monitors has been moved into a new nagios deployment in AWS
- alerting: existing minus cepmon-triggered stuff has been migrated
Q2 2013 roadmap
In Q2 DevOps will be bringing up the second AWS region and executing remaining tasks to get us to a truly highly available architecture, ready to graduate from beta
4/8
- 3/25 - 4/8 : spun up region 2 AWS
- 3/25 - 4/8 : determined how to do log processing (logstash?) and pump data into zenoss
4/15
- 4/8 - 4/15 : load tested/validated that region 2 is ready for prod traffic
- 4/1 - 4/15 : written auto provisioning logic to call AWS and spin up instances, assign them roles, and pass them to chef for provisioning
- dynect is changed to balance between AWS region 1 and AWS region 2
- milestone : persona is fully hosted in AWS multi-region
4/30
- Final day to turn down servers at SCL2
5/13
- Modify DB architecture to remove single point of failure (single write master)
- This is not re-evaluating our choice of persistence. It's just making our existing architecture truly fault-tolerant and highly available.
- Add more performance monitoring to enable later platform improvements
- There are many ways we could further scale. To make intelligent choices, we need to gather information about the performance and behavior of our servers.