ReleaseEngineering/How To/Work with Golden AMIs

From MozillaWiki
Jump to: navigation, search


Golden AMI

Background

Puppet used to be one of the bottlenecks for Releng infra, especially for EC2 instances.

Puppet requires DNS (A and PTR entries) to work properly. Having a static IP assigned to a spot instance required us to precreate network interfaces and specify them as a part of a spot request. Even worse, to have an external IP assigned to a network interface we had to run an instance and terminate it, because the API does not allow you to do that.

This approach didn't scale for us:

  • Adding new instances requires adding new network interfaces by creating and terminating temporary instances.
  • Once a network interface is created it is bound to a subnet (to an Availability Zone). Because of this slave names cannot be reused in different Azs with better spot prices.
  • Regular puppet checks add load on the puppet infra. Puppet errors may bring down the whole infra easily.

Puppetless/DNSless Concept

To avoid the issues stated above we came up with the following architectue.

  • Regularly create “golden” AMIs:
    • Puppetize a “base” AMI
    • Disable puppet
    • Strip host specific files
    • Generate a “golden” AMI
    • Copy the “golden” AMI to other regions
  • Use the “golden” AMIs for spot instance requests
  • Make sure an instance uses a fresh AMI on boot, terminate itself otherwise.

How it works

Base AMI

Base AMIs are created by aws_create_ami.py script manually and published in the corresponding config files, e.g. configs/bld-linux64. These AMIs are used as a base image for “golden” AMIs.

Golden AMI

“Golden” AMIs are generated on aws-manager1 by multiple daily cronjobs, e.g. modules/aws_manager/manifests/cron.pp

Instances used to generate “golden” AMIs use DNS to make puppet work and rely on proper DNS entries. To reduce possibility of IP collisions they live in a different subnet and require --ignore-subnet-check as a paramter.

If the process is stuck for some reason you need to terminate the instances and kill the process.

The generated AMIs are published to https://s3.amazonaws.com/mozilla-releng-amis/amis.json by scripts/aws_publish_amis.py scheduled by modules/aws_manager/manifests/cron.pp. The published file is used by running spot instances to determine available AMIs and if they are need to be terminated if the AMI is out of date.

Once a day scripts/delete_old_spot_amis.py scheduled at modules/aws_manager/manifests/cron.pp#l71 deletes old AMIs leaving latest 10 AMIs.

Troubleshooting

Something is wrong with the new AMIs

See https://wiki.mozilla.org/ReleaseEngineering/How_To/Manage_spot_AMIs for the details how to determine AMIs in use, how to delete them and how to terminate instances based on the broken AMIs

How to determine the IP of a spot instance which is not in DNS

See https://wiki.mozilla.org/ReleaseEngineering/How_To/Manage_AWS_slaves. Running the script with “status” sub-command shows the IP used by instances.

Something is happening to the spot instances

  • Search the slave name in the AWS console's Instances section. “State transition reason” describes what was the latest state of the instance. It may have some clue about the reason why the instance was terminated.
  • Search the slave name in the AWS console's Spot Requests section. It may contain multiple entries, choose the one that matches your time frame. “Status” and “Status message” contain some information about the reason why it was terminated

How to force AMI generation

There are cronjobs on aws-manager1 responsible for this. Running the underlying scripts would regenerate “golden” AMIs and copy then to other regions.

The cronjobs can be listed by running

ls /etc/cron.d/*golden*

The underlying wrapper scripts live in

/builds/aws_manager/bin/*golden*

To regenerate AMIs one should run those scripts as “buildduty” user. Consider using screen/tmux because it may take up to 2 hours to generate some AMIs.