Balrog

From MozillaWiki
Revision as of 22:12, 8 September 2016 by NThomas (talk | contribs) (Rework hacking section)
Jump to navigation Jump to search

Overview

Balrog is the software that runs the server side component of the update system used by Firefox and other Mozilla products. It is the successor to AUS (Application Update Service), which did not scale to our current needs nor allow us to adapt to more recent business requirements. Balrog helps us ship updates faster and with much more flexability than we've had in the past.

Bugs and feature requests should be filed in the Backend or Frontend components of Bugzilla. You can find the Balrog developers in #balrog.

Domains

Balrog admin interface is accessible at aus4-admin.mozilla.org (vpn required).

The public interface that Firefox and other applications talk to is at aus5.mozilla.org. Many older update domains are also served by Balrog, including aus4.mozilla.org, aus3.mozilla.org, and aus2.mozilla.org. More details on these can be found on the Client Domains page.

Database Model

Balrog's model centres around two concepts: Rules and Releases. When a request for update from an application is received it is matched up against the rules. Once the correct rule has been found, it contains a pointer to a Release, which contains all of the metadata needed to construct a proper update response. Rules and Releases are described in greater detail below:

Rules

The most important part of Balrog to understand is its rules. When a request comes in it is matched against each of Balrog's rule to find the one that best suits it (more in this below). Once found, Balrog looks at that rule's "mapping", which points at a release that has the required information to serve an update back to the client. Without any rules, Balrog will never serve an update. With badly configured rules Balrog could do bad things like serve Firefox updates to B2G devices.

In addition to the information found in the request, each rule also has a priority, which allows us to override updates for specific things while letting the rest "fall down" to a more general rule. For example, we block unsupported versions of OS' with a rule of the highest priority but continue to serve updates for users of other OS' with a rule of slightly lower priority. The ability to override and fallback by setting the priority is one of the reasons Balrog is so flexible, and a key way that it distinguishes itself from its predecessor.

What's in a rule?

Each rule has quite a lot of columns, but they all fall into one of the buckets below:

  • Matchable - these correspond to information provided in the update request, and are used to filter out rules that don't apply to the request.
  • Decision - these are also used to filter rules, but do not correspond to information in the request.
  • Response - these contain information that ends up in the response
  • Info - informational columns, not used as part of serving updates

Individual columns are detailed in the table below:

Attribute Category Description Matching logic Examples
Product Matchable The name of the application requesting an update. Exact string match only "Firefox" or "B2G"
Version Matchable The version of the application requesting an update. Exact string match or operator plus version to compare the incoming one against "36.0" or ">=38.0a1"
Channel Matchable The update channel of the application request an update. Exact string match or a string with "*" character to glob "nightly" or "beta*"
buildTarget Matchable The "build target" of the application requesting an update. This is usually related to the target platform the app was built for. Exact string match only "Darwin_x86_64-gcc3-u-i386-x86_64" or "flame-kk-userdebug"
buildID Matchable The build ID of the application requesting an update. Exact string match or operator plus buildid to compare the incoming one against "201410010830" or "<201512010830"
Locale Matchable The locale of the application requesting an update. Exact string match or comma separated list of locales to do an exact match on "de" or "en-US,en-GB,id"
osVersion Matchable The OS Version of the application requesting an update. This field is primarily used to point desupported operating systems to their last supported build. Partial string match or comma separated list of partial strings to match on "Windows_NT 5.0" or "Darwin 6,Darwin 7,Darwin 8"
systemRequirements Matchable The supported hardware features of the application requesting an update. This field is primarily used to point desupported users based on their hardware. Eg: users who do not support SSE2 Full string match or comma separated list of full strings to match on "SSE" or "MMX,SSE"
distribution Matchable The partner distribution name of the application requesting an update or "default" if the application is not a partner build. Exact string match only "default" or "yahoo"
distVersion Matchable The version of the partner distribution of the application requesting an update or "default" if the application is not a partner build. Exact string match only "default" or "1.19"
headerArchitecture Matchable The architecture of the OS of the client as guessed based on build target. This field is mostly deprecated now that this information is included in the build target. Exact string match only "PPC" and "Intel" are the only possible values
Whitelist Matchable A pointer to a Whitelist blob (stored in the Releases table) that can determine whether an update request is authorized to have the Release this rule is mapped to. Commonly used in whitelisting IMEIs for FirefoxOS updates. If a whitelist is present, its shouldServeUpdate is called. If it returns True, this rule is considered to be matching. If it returns False, this rule is thrown out. Any valid release name, or NULL
Priority Decision The priority of the rule, relative to other rules. If multiple rules match an incoming request based on the Matchable columns, the rule with the highest priority is chosen. N/A Any number, by convention positive integers.
backgroundRate Decision The percentage of background update requests that should receive an update if they match this rule. Generally this is used as a throttle to increase or decrease the rate at which the majority of users receive an update. N/A Any number 0 to 10
Mapping Response The Release to construct an update out of. This is a foreign key to the "name" column of the Releases table. N/A Any valid release name, or NULL.
update_type Response The update_type to use in the XML response. It's very rare for a rule to use anything other than "minor" these days. N/A "minor" or "major"
id Info The id of the rule. This id is necessary to make changes to the rule through the REST API. N/A Autoincrementing integer
Alias Info A unique alias for the rule. Can be used in place of id in any REST operation that doesn't involve rule history. N/A "firefox-release-betatest", "firefox-nightly"
Comment Info A string describing the purpose of the rule. Not always necessary for obvious rules. N/A Any string

How are requests matched up to rules?

The incoming request parts match up directly to incoming URL parts. For example, most update requests will send an URL in the following format:

/update/3/<product>/<version>/<buildID>/<buildTarget>/<locale>/<channel>/<osVersion>/<distribution>/<distVersion>/update.xml?force=1 # force can also be left off

The following logic is used to figure out which rule an update matches and what to respond with:

  1. If a rule specifies one of these fields and a request's field doesn't match it, the rule is considered not to be a match and the rule is ignored for that request. See above for details on how specific columns perform matching.
  2. If "force" wasn't specified, the backgroundRate of the selected rule is looked at.
  3. If we still choose serve an update after accounting for backgroundRate we look at the rule's mapping. This is a foreign key that points at an entry in the releases table. That row has most of the information we need to construct the update.
  4. Using the update_type and release that the mapping points to, construct and return an XML response with the details of the update for the client.

Releases

needs fleshing out
To Balrog, a "release" is data about a related set of builds. This does _not_ match up with the concept of a "release" being on the "beta", "release" or "esr" channel elsewhere. In Balrog, each set of nightlies on any branch is considered a release.

While there's no enforced format on release names, there are a few conventions that we use:

  • Nightly-style builds submit to releases named by product and branch. Each nightly generally submits to two different releases, one "dated" (eg: Firefox-mozilla-central-nightly-20150513010203) and one "latest" (eg: Firefox-mozilla-central-nightly-latest).
  • Release-style builds submit to releases named by product, version number, and build number, eg: Firefox-38.0-build1
  • GMP blobs are created by hand and generally named with the version of each plugin they contain in the name, eg: GMP-20150423-CDM-v4-OpenH264-v1.4

More details on blobs can be found on the Blobs page.

Permissions

The permissions table is a simple list of usernames and the ACLs that they have. A user could be an "admin", giving them write access to everything, or could have one or more specific permissions. For example, our "ffxbld" system account has access to make create or modify "Firefox" or "Fennec" releases. These specific ACLs let us do things such as give Go Faster folks access to Balrog without the risk of them or their tools accidentally messing up Firefox updates (or visa versa).

The table below describe all possible permissions:

Object Action Options Comments
admin No supported actions No supported options An admin user has completely unrestricted access to Balrog.
rule create products - If specified, the user only has permission for the object and action if the changes they are making only affect the product specified.
modify
delete
release create
modify
delete
release_read_only set
unset
release_locale modify
permission create No supported options.
modify
delete

History Tables

Change attribution and recording is embedded deeply into Balrog. The rules, releases, and permissions tables all have a corresponding history table that records the time a change was made and who made it. This allows us to look back in time when debugging issues, attribute changes to people (aka blame), and quickly roll back bad changes.

Admin App

API

See the Admin API page.

UI Use Cases

Locking/Unlocking Nightlies

One of the most common uses for the Balrog UI is to lock a nightly update channel to a specific release for a period of time, and then unlock it later (so that users on that channel start receiving the latest available build again). This is often done if a serious bug is introduced to minimize the number of users affected by it.

Taking the B2G nightly channel as an example, let's see how we would lock it to the nightlies from 20150505160203:

  1. Log in to https://aus4-admin.mozilla.org
  2. Click on the "Rules" link at the top of the page
  3. Use the filter in the top right to narrow down the rules to "product:B2G channel:nightly"
  4. Locate the rule (or rules) on the "nightly" channel
    • Changing the sort to "Product, Channel" will group things together better.
  5. For each rule on the channel:
    • Click the "Update" button to enter edit mode
    • Find the mapping field and replace the "-latest" part with "-20150505160203" (the UI will autocomplete this for you if you start typing).
    • Scroll down and click "Save Changes"

When you're ready to unlock the updates, follow the same steps as above but replace the "-20150505160203" part of the mapping with "-latest" again.

Adding a rule for a new update channel

When nightly builds are set up on a new branch, rules need to be added to Balrog for updates to be served. Note that the nightly build automation is responsible for providing metadata about each new set of builds to Balrog.

As an example, here is how B2G updates could be set up on a hypothetical "mozilla-b2g40" branch:

  1. Log in to https://aus4-admin.mozilla.org
  2. Click on the "Rules" link at the top of the page
  3. Click on "Add a new Rule" near the top left of the page
  4. Fill out the form as follows:
    • Product: B2G
    • Channel: nightly-b2g40
    • Mapping: B2G-mozilla-b2g40-nightly-latest
    • Rate: 100
    • Priority: 90
  5. Click "Save Changes"

Modify an existing release

Most modifications to releases are done by automation, but sometimes we need to tweak them by hand. For example, when adjusting What's New page configuration.

For example, if you wanted to modify the "Firefox-38.0-build3" release, follow these steps:

  1. Log in to https://aus4-admin.mozilla.org/
  2. Click the "Releases" link at the top of the page
  3. Find the "Firefox-38.0-build3" release and click the "Download" link
  4. Save the file locally and modify it to your liking
  5. Click the "Update" link for "Firefox-38.0-build3"
  6. Click "Browse" and choose your new local version
  7. Click "Save Changes"

Code Overview

Balrog's code is organized into roughly the following parts:

  • The blobs - These contain most of the brains (business logic) behind Balrog. They know how to validate new data coming into the system and translate existing data into useful responses to update requests.
  • The database abstraction layer - This layer sits between the actual database and the applications. It defines the database schema, performs permissions checking, and ensures all changes are written to history tables. Application should never touch the database directly - they should always go through this layer.
  • The user-facing application - The entry point to requests from applications looking for updates.
  • The admin API - A simple RESTful API that allows the Admin UI and automation to make changes to Balrog's database.
  • The admin UI A human friendly interface to manage updates.

Hacking

Balrog's code is split between the backend server and the admin ui. Both Github repositories use Travis for continuous integration and accept pull requests. To get both repositories do the following:

git clone https://github.com/mozilla/balrog
cd balrog
git submodule init
git submodule update

Docker

You can run a complete Balrog environment locally with Docker, configured very similarly to production. You'll need Docker and Docker Compose installed; on Mac or Windows we require Docker for Mac/Docker for Windows v1.12 or higher. Then run the following in the root of the Balrog repo:

docker-compose up

Once it completes, you should be able to access

UI hacking

If you want to hack on the UI, you'll also need to install some npm modules:

cd ui
npm install
npm install -g lineman
cd ..

To make code changes visible to docker you need to run:

cd ui
lineman build

lineman build minifies the js & css, which can make debugging more difficult, but you can use the run mode of lineman instead. First make sure docker-compose is up, then

cd ui
lineman run

The UI will be available at http://127.0.0.1:8000. lineman will watch for changes to the source so you need only refresh the browser after saving your work.

Unit Tests

Backend

The recommended way to run backend tests is through Docker. Doing so will ensure your tests will run the same way they do in CI. To run them, use this command in the root of your Balrog repository:

./run-tests.sh

If you can't or don't want to run tests within Docker for some reason, you can run them with directly with:

tox

Tests run fine on any posix-like environment, but are only run regularly within the Docker image, so it's possible to have failures that aren't related to Balrog code or your changes when running directly with tox.

Frontend

To run the unit tests use:

cd ui
lineman spec-ci

More options can be found in ui/README.md

Getting Involved

If you like to get involved in the development of Balrog there's lots of areas where we could use some help. Below is a list of specific tasks that would be good first choices. Come talk to us in #balrog if you're interested!

No results.

0 Total; 0 Open (0%); 0 Resolved (0%); 0 Verified (0%);


Meeting Notes

See the Meetings page.

Running without Docker Compose

Creating a database

Balrog's database is controlled through sqlalchemy-migrate. To initialize a new Balrog database, run the following:

docker run --entrypoint python mozilla/balrog /app/scripts/manage-db.py -d DBURI create

Similarly, to upgrade the schema of an existing Balrog database, run the following:

docker run --entrypoint python mozilla/balrog /app/scripts/manage-db.py -d DBURI upgrade

See the "Environment Variables" section below for DBURI format. If your testing out local changes that affect database creation or upgrades, you should replace "mozilla/balrog" with your local image.

Environment Variables

The following environment variables are required by the Balrog WSGI apps:

  • DBURI - The database to use, in the format: driver://user:password@host/database.
  • SECRET_KEY - A pseudorandom string to use when generating CSRF tokens. Only used for the admin app.

These are optional:

  • LOG_LEVEL - Controls the python level the app logs at. Set to INFO by default.
  • LOG_FORMAT - Controls the log format. If unset, mozlog format (json) will be used. Can be overridden with "plain" to log simple plain-text messages. The former is recommended for production, the latter for local development.
  • NOTIFY_TO_ADDR - An address to send e-mail to if a Rule or Permission changes. Unset by default, and only used for the admin app. If set, the following additional variables are required:
    • SMTP_HOST, SMTP_PORT, SMTP_USERNAME, SMTP_PASSWORD - Information about the SMTP relay to send mail through.
    • NOTIFY_FROM_ADDR - The "from" address to use when sending mail.

Infrastructure

Support & Escalation

If the issue may be visible to users, please make sure #moc is also notified. They can also assist with the notifications below.

RelEng is the first point of contact for issues. To contact them, follow the standard RelEng escalation path.

If RelEng is unable to correct the issue, they may escalate to CloudOps.

Monitoring & Metrics

TODO: add more, flesh out.

Backups

Balrog uses the built-in RDS backups. The database in snapshotted nightly, and incremental backups are done throughout the day. If necessary, we have the ability to recover to within a 5 minute window. Database restoration is done by CloudOps, and they should be contacted immediately if needed.

Deploying Changes

Balrog's stage and production infrastructure is managed by the Cloud Operations team. This section describes how to go from a reviewed patch to deploying it in production.

Is now a good time?

Before you deploy, consider whether or not it's an appropriate time to. Some factors to consider:

  • Are we in the middle of an important release such as a chemspill? If so, it's probably not a good time to deploy.
  • How risky are your changes? If they're high risk, deploying on a Friday is probably a bad idea.
  • Do you need to migrate any data? If you do, make sure you have time to do so right after deploying.
  • Do you have enough time to safely do a push? Most pushes take at most 60 minutes to complete after the stage push has been done. This time is mostly affected by how long it takes you to verify your changes in stage and production.

Landing

UI Changes

If you've made a change to the ui repository, make sure you run "lineman build" and commit the result before pushing that change. This will rebuild the UI application and put the result into "dist", which is where the deployed instances run out of. You also need to commit the subrepository change to the Balrog repository and push that back. Generally, the workflow for this is something like:

cd /your/balrog/repo
cd ui
lineman build
git commit -a -m "Rebuild UI."
git push origin
cd ..
git commit -a -m "Update to latest UI."
git push origin

Backend Changes

Just push your change to the master branch of the Balrog repository.

Schema Upgrades

If you need to do a schema change you must ensure that either the current production code can run with your schema change applied, or that your new code can run with the old schema. Code and schema changes cannot be done at the same instant, so you must be able to support one of these scenarios. Generally, additive changes (column or table additions) should do the schema change first, while destructive changes (column or table deletions) should do the schema change second. You can simulate the upgrade with your local Docker containers to verify which is right for you.

When you file the deployment bug (see below), include a note about the schema change in it. Something like:

This push requires a schema change that needs to be done _prior_ to the new code going out. Thas can be performed by running the Docker image with the "upgrade-db" command, with DBURI set.

bug 1295678 is an example of a push with a schema change.

Testing

Before asking for a push, you should do some functional testing on your local machine with the Docker images. You should do this against the master branch of the upstream repository to ensure you're testing the exact code that is to be deployed. At the very least, you should do explicit testing of all the new code that would be included in the push. Eg: if you're changing the format of a blob, make sure that you can add a new blob of that type, and that the XML response looks correct.

Pushing to stage and production

Pushing live is a two step process. First, you must push to the stage environment and ensure things are working there. Then, you can push live.

  1. Bump the in-repo version.
  2. Tag the repository with a "vX.Y" tag. Eg: "git tag -s vX.Y"
  3. Wait for CI jobs to complete. Unit tests must pass and a new Docker Image must be pushed to Dockerhub before you proceed.
  4. File a bug to have the new version pushed to stage. Be sure to include the new version number, and Docker image tag you want deployed.
  5. Once stage has been updated, verify your changes again. Even though you've tested locally, it's important to retest in stage to make sure there's no deployment-specific issues.
  6. When stage looks good, you're ready to comment in the bug to ask for production to be updated.
  7. Reverify in production. When production has been updated, verify your changes again there. If you need to tweak rules or releases to do so, be careful not to touch any live channels (create new rules or releases if necessary). This final verification is as more about making sure the right thing got deployed than whether or not your code is correct.

Scripts

These tools all live in tools/scripts/updates. We use them to programmatically adjust the update server.

balrog-submitter.py

Used to submit nightly and release style builds into release blobs. Called on build slaves, once per each combination of platform-locale.

balrog-release-pusher.py

Used by the release automation to add the metadata to a release blob, and to push the new release onto the test channels. Example builder name - release-mozilla-beta-firefox_updates.

balrog-release-shipper.py

Used by the release automation to push a release to the production chanenl. Example builder name - release-mozilla-beta-update_shipping.

balrog-tweaker.py

We don't have UI support for modifying blobs, so this is a helper to submit a blob fragment to replace content on the server. Doesn't handle removing keys.

Example:

python scripts/updates/balrog-tweaker.py --json json -b 'Firefox-33.0-build1' --api-root 'https://aus4-admin.mozilla.org'  --credentials-file cred -u 'nthomas@mozilla.com' -v

where json is a file containing the fragment of json to submit, cred is a file containing the password for the -u argument, with format

balrog_credentials = {
     'username': 'password'
}

balrog-nightly-locker.py

Used to 'freeze' nightly updates by pointing to a dated release blob instead of the latest, eg for big code landings, merges. Also to unfreeze afterwards. More details on usage at Enable/disable updates on Aurora

Common code

cli.py and api.py in tools/lib/python/balrog/submitter/ provide the shared code.