User:Ckoehler/aws deployment

Migrating MozillaWiki to AWS

Application UAT

Load main page and verify it has most recent content. Main_Page
Create an account (sign-up, receive email, verify email via link, wait for admin confirmation, change password). Special:RequestAccount
View and approve account requests: Special:ConfirmAccounts
Login to existing user account. index.php?title=Special:UserLogin&returnto=Main+Page
Reset password: Special:PasswordReset
Create a wiki page in main namespace (requires account with edit permissions).
Edit page in main namespace other than Main Page.
Verify edits are reflected in RecentChanges: Special:RecentChanges
Upload file. Special:Upload
Load page with Graphiz content: Category:GraphViz
Load page with Bugzilla extension content. (Any subpage here: MozillaWiki:Releases)
Verify static assets are available and loading: Special:ListFiles
View an svg file. Does png preview work? index.php?title=Special:ListFiles&limit=200&ilsearch=.svg&user=
Import an etherpad (verifies pandoc dependency). Special:ImportFromEtherpad
Load a page with Google Calendar widget: Widget:Google_Calendar
Load a page with Google Docs widget: Widget:Google_Document
Load a page with Google Spreadsheet widget: Widget:Google_Presentation
Load a page with Google Presentation widget: Widget:Google_Presentation
Verify version of Mediawiki and extensions match expected values: Special:Version
Verify user group rights have expected values: Special:ListGroupRights
Add a page to your watchlist by starring it (verifies javascript is loading and api is working).
Load Special:Random a few times and verify different pages are loaded each time.
Verify pages are being loaded from cache. (Zeus cache... look for X-Cache headers)
Verify page cache is purged when requested. (Does a hard refresh work as expected)
Verify write api is available and responsive.
Verify read api is available and responsive.

Notes on documentation:

Documentation should be consolidated so that there is one single guide to getting set up.
As it is now, most parts are documented (though not all) but across multiple docs and sources.
Required information is not readily available.

Concerns with new deployment methodology:

Introduces a new toolchain (nubis-biulder, aws cli, etc) as well as extra work, complexity and possible points of failure for the developer + app owner.
nubis-builder has several dependencies that are not clearly documented and take a lot of effort to install. This could be improved by scripting.
Building AMIs, and therefore deployments, are tied to specific versions of nubis-builder, introducing another dependency layer. Releases of nubis-builder will have to be coordinated with releases of our hosted app.
Building AMIs from scratch takes a long time, a variable amount of time and requires a continuous internet connection.
Setting up nubis-builder and aws cli requires credentials and other information that is not available in a self-service way. (E.g. there are manual steps not currently documented and you must ask IT staff for certain details.)
Inability to run scripts via command line means we can't fully manage our apps.

Generally, we're on board with the architectural decisions made. There will be greater complexity to our deployments, but also more features available to us. To make best use of this new model, we request a few changes:

The ability to continuously deploy app code to stage and production without having to rebuild and deploy a new/updated AMI. (System and service-level changes would require re-build and deployment of new AMI.)
- deploying app code needs to be less complex and less resource intensive
Ability to run tasks locally in stage and production that modify the datastore.
Decouple code that governs AMS deployment from app code (we can help with this).

Additionally, there essential parts of the new AWS deployment that I haven't been able to test/demo yet:

A complete deployment from scratch, including datastore.
A fully working mozillawiki instance, including datastore.
Data backup and recovery, which includes both database and static files.
Deployment of updated app code.
Deployment of updated system code (eg installing new packages, changing system-level config).
Jenkins CI integration.

continuous deployment service
cannot be viewed as an application that is released
not a static machine image

A requirement is to be able to update directly from git without having to re-build and deploy an ami. Must be able to run scripts locally that modify datastore.

Okay to rebuild and deploy an ami to make system and service-level changes. ("Non-traditional" dependencies)

UAT Requirements

simulate data loss and restore from backup

agreements:

logic that governs ami creation needs to be separate from wiki app code
we need shell access. (!!!) happy to take responsibility but we need access. >:()
need continuous deployment of wiki code without having to rebuild ami
okay to rebuild + deploy new ami for system changes
datastore - what is in it, how is it backed up and restored (and deployed?)

application and webserver are decoupled in wiki deployment (any app written in PHP or other similar server-side scripting languages)

another issue: database and pii (the 3rd, and MOST IMPORTANT, gear in the machine)

===Begin Meeting===
14 April 2015

Migration window (Christie)
- Sunday morning
- Saturday morning
- Late weeknight evenings (8pm-midnight)
- Christie prefers weekend mornings
- Target to be in Production - but not cut over - around the end of the month
Marian and Richard to edit migration plan and send to ckoehler for review
Need a load testing plan
- JD and ckholer to track down tool that Metrics was using.
- Don't currently have a good test suite for mediawiki
- Prefer something that integrates with jenkins
- Mediawiki foundation has tools that are set up to work with CI
Christie wants to see a round trip deployment
- Will look very different in AWS from how it looks today
Wikimo is supported by the wiki working group
- self motivated, cross organizational team, including community members
- We will not provide access to AWS for non-employees as part of this effort
Marian will schedule next meeting

Migrating MozillaWiki to AWS
===Begin Meeting===
nn March 2015

Migration window (Christie)
- Sunday morning
- Saturday morning
- Late weeknight evenings (8pm-midnight)
- Christie prefers weekend mornings
- She will be away 3/21-22
Monitoring
- New Relic? (Richard)
  - Not part of Q1
  - Could be added later, if requirements dictate
- Requirements (Christie and JD)
File System (JD and Corey)

End Meeting

===Begin Meeting===
10 March 2015

Migration Planning
- Put MediaWiki in read-only mode
- Copy DB and file contents
- MySQL dump, copy over, bring up: Under 2 hours (JD to test)
- Timing: Christie to give us a few windows
- S3 backups (if Christie has had time to look at plugin)
- Not an option for MediaWiki because it assumes immediate consistency
- Assume nightly snapshots
- Application monitoring
- Richard to check with Shyam to see if NewRelic is part of current plan
- Christie and JD to bring application monitoring requirements
- I'd also like to discuss the subnet architecture. I think JD was looking into this.
- Security groups for
  - Web Server
  - RDS

===End Meeting===
===Begin Meeting===
5 March 2015

monitoring & logging
- DataDog
  - Basic AWS monitoring will be covered by Shyam
  - Server load spikes and service availability are most critical
  - Auto scaling should not mask DDOS attacks.
- Use fluentd to put application and system log data into an S3 bucket
  - Will handle both realtime and forensic analysis
  - will be enhanced in Q2
- Shyam: Is NewRelic part of the solution? If not, what will fill that level of application monitoring?
- Christie and JD to come up with application monitoring requirements.
Database Backups
- Database stores user accts, preferences, page content (everything but static content, like images)
- 2-3 days of continues backups
- Nightly snapshots kept for 30 days (provisionally - Christie to consider)
- DB is ~7.5GB
S3 backups
- Christie to look into MediaWiki S3 plug-in capabilities before we decide
Migration Planning
- Put MediaWiki in read-only mode
- Copy DB and content to S3
- MySQL dump, copy over, bring up: Under 2 hours (JD to test)

===End Meeting===
===Begin Meeting===
4 March 2015

== Notes ==
What are the service level / availability requirements?

Gets substantial traffic and is used by staff and contributors alike

What kind of outage can you tolerate

Users take meeting notes and use extensively, can not take downtime during business hours
Recovery time of a few hours is reasonable in a disaster scenario
- An entire day is not reasonable
How much data loss is acceptable
- ideally minutes or hours, not an entire day
Application Components
- MySql
- Apache + PhP (web head)
- S3 for content (file store)
  - Bugzilla extension
  - Smarty Extension
- ElastiCache (Memcache)
- Caching (Varnish and/or CDN)
- Route53

All data needs to be replicated across regions
Password reset and cli tools node

Future service needs:
elastic search
parsoid (nodejs)
multiple domains (for mobile front end)

Action Items

test / vette s3 extension for use with currently installed extensions
double check mw will work with elasticache
will amazon cdn service work in place of front-end caching with something like varnish?
christie: review how mediawiki handles file versioning including deletions

Constraints / Requirements for moving forward:

Identify service dependencies (e.g. MySQL)
Identify external dependencies (e.g. LDAP)
Confirm business owner and cost center
Identify service level requirements (availability, recoverability, etc.)
Plan migration (downtime vs. RO, scheduliing, communication)

== Next Meeting Items ==
monitoring
logging
Database Backups
Migration Planning

Issues with using s3:

s3 is eventually consistent
apps need to be designed for this
mediawiki is designed for immediate consistency, extensions that provide s3 functionality are not designed for this
methods that take approach of adding s3 to filesystem are not designed to work with multiple webheads

softnas
gluster
ceph