Identity/DevOps

From MozillaWiki
< Identity
Revision as of 18:36, 5 February 2013 by Gene wood (talk | contribs) (Created page with "= Roadmap = == Q1 2013 roadmap == In Q1 2013 the Identity DevOps team will be moving services out of the physical datacenter SCL2 and into Amazon Web Services. 2/4 * 1/28 - 2/4 ...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Roadmap

Q1 2013 roadmap

In Q1 2013 the Identity DevOps team will be moving services out of the physical datacenter SCL2 and into Amazon Web Services.

2/4

  • 1/28 - 2/4 : roadmap defined and signed off
  • 1/28 - 2/4 : technology stack justification written and shared

2/11

  • 2/4 - 2/11 : chef server built and working
  • 2/4 - 2/11 : established 1 region VPC

2/18

  • 2/4 - 2/18 : completed a mini provisioning test and plan
  • 2/11 - 2/18 : written the webhead chef provisioning logic
  • 2/11 - 2/18 : written the nginx chef provisioning logic and carried over existing nginx routing logic
  • milestone : chef can fully provision webheads

2/25

  • 2/18 - 2/25 : zeus routing logic is converted into nginx logic
  • 2/18 - 2/25 : written the nagios chef provisioning logic
  • 2/18 - 2/25 : basic webhead nagios checks created
  • milestone : admin can see the monitored availability and performance of the webhead

3/4

  • 2/25 - 3/4 : ELB is setup and sending traffic to the webhead
  • 2/18 - 2/25 : basic webhead nagios checks against the ELB created
  • milestone : internet client can fetch persona main page from AWS traversing ELB

3/11

  • 3/4 - 3/11 : written swebhead chef provisioning logic
  • 3/4 - 3/11 : ELB configured for swebhead cluster
  • 3/4 - 3/11 : written db chef provisioning logic
  • 3/4 - 3/11 : ELB configured for db cluster
  • 3/4 - 3/11 : written keysign chef provisioning logic
  • 3/4 - 3/11 : ELB configured for keysign cluster
  • 3/4 - 3/11 : established VPN to PHX1
  • 3/4 - 3/11 : basic swebhead db and keysign nagios checks created
  • milestone : internet client can login using persona in AWS

3/18

  • 3/11 - 3/18 : written bigtent and squid proxy chef provisioning logic
  • 3/11 - 3/18 : ELB configured for bigtent and squid clusters
  • 3/11 - 3/18 : basic bigtent and squid proxy nagios checks created
  • milestone : internet client can login with a yahoo address using yahoo bigtent

3/25

  • 3/18 - 3/25 : load tested/validated that region 1 is ready for prod traffic
  • 3/18 - 3/25 : full security group logic is in place replicating existing physical network
  • milestone : security : network security is in place and all tiers use proxies for communication
  • milestone : load testing complete for region 1

4/1

  • 3/25 - 4/1 : moved master from PHX1 to region 1 AWS
  • dynect is changed to balance between AWS region 1 and PHX1. SCL2 sits running as a backup
  • milestone : all db writes are now going to AWS
  • milestone : AWS region 1 is live in production, SCL2 no longer receives traffic

State at end of Q1 2013

  • SCL2 is dark
  • Production is running off of 1 AWS region and 1 physical datacenter
  • runbooks for AWS deployments & core troubleshooting have been developed
  • The staging environment has been moved to AWS
    • key differences between production and staging AWS areas: server localization & access
  • monitoring: existing monitoring minus some cepmon rate-of-change monitors has been moved into a new nagios deployment in AWS
  • alerting: existing minus cepmon-triggered stuff has been migrated

Q2 2013 roadmap

In Q2 DevOps will be bringing up the second AWS region and executing remaining tasks to get us to a truly highly available architecture, ready to graduate from beta

4/8

  • 3/25 - 4/8 : spun up region 2 AWS
  • 3/25 - 4/8 : determined how to do log processing (logstash?) and pump data into zenoss

4/15

  • 4/8 - 4/15 : load tested/validated that region 2 is ready for prod traffic
  • 4/1 - 4/15 : written auto provisioning logic to call AWS and spin up instances, assign them roles, and pass them to chef for provisioning
  • dynect is changed to balance between AWS region 1 and AWS region 2
  • milestone : persona is fully hosted in AWS multi-region

4/30

  • Final day to turn down servers at SCL2

5/13

  • Modify DB architecture to remove single point of failure (single write master)
    • This is not re-evaluating our choice of persistence. It's just making our existing architecture truly fault-tolerant and highly available.
  • Add more performance monitoring to enable later platform improvements
    • There are many ways we could further scale. To make intelligent choices, we need to gather information about the performance and behavior of our servers.

Beyond

Additional Operational Improvements