BugzillaAutoLanding: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(Lando is the new autoland)
 
Line 1: Line 1:
=Project Description=
This project has been superseded by [https://lando.services.mozilla.com Lando], which hooks into [https://phabricator.services.mozilla.com Phabricator].
This project will create and deploy a set of tools that will:
# grab patches from a bug in Bugzilla
# land patches on try (or other specified branch(es))
# poll for the results of automated pushes and depending on the branch specified:
* report back to the bug the results
* on success continue to push to another branch (eg: try, then mozilla-inbound)
* be able to automatically backout a push if failure detected
 
=Goals=
*A set of smaller tools that each have the ability to be part of a larger script that runs "autolanding" to try & other hg.m.o repos that do builds per checkin
*Tools that can be used in command line via API calls
*Tools to control the process - lots of toggles to go back to manual sheriffing (Global KillSwitch)
*Using our build/test resources wisely and not increasing the load so much that try is unusable for developers to work on their patches prior to automated/assisted landings
*Streamlining as much as possible the try server and trunk landing process without removing the human interaction with the build/test/perf results.
 
=Non-Goals=
* Replacing humans in the landing process
* Handling performance regressions
* Providing the best possible UI
* Going to extreme lengths to support auto-landing patches which do not follow the rules which the tool requires
 
=People=
* Lukas Blakk
* Marc Jessome
 
=Designing the System=
A simple survey on try usage ([http://bit.ly/try_usage survey], [http://bit.ly/oGSVAr results]) was created and advertised to find out how developers currently use try and what they think about the autolanding workflow. We got 52 responses. This helped catch some things we missed like paying attention to LDAP authentication before pushing something from Bugzilla, and it also provided new observations on try workflow. Many developers state that they would find landing to try via Bugzilla to be 'more work' than what is currently offered with push-to-try (using try syntax). We are taking that into account in our design and are adjusting the goal to be primarily for the purposes of landing to trunk and not just getting try results posted to the bug. That last part will be handled more through try syntax where if a bug is specified, you can post the results to the bug and turn off email notification if desired.
 
=API=
==Architecture==
[[File:Autoland_API_Outline3.png|800px]]
 
==REST Interface==
List of request methods, urls, and parameters:
 
'''GET /bugs/{bugID}''' -> get bug patchsets
 
'''GET /patchSets/{patchSetID}''' -> get patchset information
 
'''GET /branches/branchName''' -> get branch information
 
 
 
'''POST /bugs/{bugID}''' -> create empty patchset
 
'''POST /branches/{branchName}''' -> create a new branch
 
 
 
'''PUT /patchSets/{patchSetID}/{patchID}''' -> add patch to patchset
 
'''PUT /branches/{branchName}''' -> update branch
 
'''PUT /branches/{branchName}/threshold''' -> set branch threshold
 
'''PUT /branches/{branchName}/status''' -> set status enabled/disabled
 
 
 
'''DELETE /patchSets/{patchSetID}''' -> delete a patchset (if not processing)
 
'''DELETE /branches/{branchName}''' -> delete branch from db
 
==Object Definitions==
PatchSets:
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `bug_id` int(11) DEFAULT NULL,
  `patches` text,
  `revision` text,
  `branch` text,
  `try_run` int(11) DEFAULT NULL,
  `creation_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `push_time` timestamp NULL DEFAULT NULL,
  `completion_time` timestamp NULL DEFAULT NULL,
 
Branches:
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` text,
  `repo_url` text,
  `threshold` int(11) DEFAULT NULL,
  `status` text,
  `approval_required` text,
  `push_to_closed` text,
 
=Project Timeline=
* Tracking bug: {{bug|657828}}
 
{| class="fullwidth-table sortable"
|-
| style="background: none repeat scroll 0% 0% rgb(239, 239, 239);" | '''Component Name'''
| style="background: none repeat scroll 0% 0% rgb(239, 239, 239);" | '''Bug(s)'''
| style="background: none repeat scroll 0% 0% rgb(239, 239, 239);" | '''Assigned To'''
| style="background: none repeat scroll 0% 0% rgb(239, 239, 239);" | '''Description'''
| style="background: none repeat scroll 0% 0% rgb(239, 239, 239);" | '''Start Date (est.)'''
| style="background: none repeat scroll 0% 0% rgb(239, 239, 239);" | '''Completion Date (est.)'''
| style="background: none repeat scroll 0% 0% rgb(239, 239, 239);" | '''On Track'''
| style="background: none repeat scroll 0% 0% rgb(239, 239, 239);" | '''Updates'''
|-
| HgPusher
| {{bug|657832}}
| Marc
| [[BugzillaAutoLanding#HgPusher|Details]]
| Monday May 23rd
| Friday Dec 30th (actual completion)
| style="background: none repeat scroll 0% 0% rgb(144, 238, 144);" | Yes
| style="background: none repeat scroll 0% 0% rgb(144, 238, 144);" | Complete
|-
| SchedulerDBPoller
| {{bug|430942}}
| Lukas
| [[BugzillaAutoLanding#SchedulerDBPoller|Details]]
| Monday May 23rd
| Friday Dec 30th (actual completion)
| style="background: none repeat scroll 0% 0% rgb(144, 238, 144);" | Yes
| style="background: none repeat scroll 0% 0% rgb(144, 238, 144);" | Complete
|-
| BugCommenter
| {{bug|659167}}
| Marc
| [[BugzillaAutoLanding#BugCommenter|Details]]
| Monday June 20th
| Friday July 8th
| style="background: none repeat scroll 0% 0% rgb(144, 238, 144);" | Yes
| style="background: none repeat scroll 0% 0% rgb(144, 238, 144);" | Complete
|-
| AutolandDB
| {{bug|659166}}
| Marc
| [[BugzillaAutoLanding#AutolandDB|Details]]
| Monday June 20th
| Friday July 22nd
| style="background: none repeat scroll 0% 0% rgb(144, 238, 144);" | Yes
| style="background: none repeat scroll 0% 0% rgb(144, 238, 144);" | Complete
|-
| MessageQueue
| {{bug|659166}}
| Marc
| [[BugzillaAutoLanding#MessageQueue|Details]]
| Monday June 13th
| Friday July 22nd
| style="background: none repeat scroll 0% 0% rgb(144, 238, 144);" | Yes
| style="background: none repeat scroll 0% 0% rgb(144, 238, 144);" | Complete
|-
| LDAP Tool
| {{bug|666860}}
| Marc
| [[BugzillaAutoLanding#LDAPTool|Details]]
| Monday July 11th
| Friday July 29th
| style="background: none repeat scroll 0% 0% rgb(144, 238, 144);" | Yes
| style="background: none repeat scroll 0% 0% rgb(144, 238, 144);" | Complete to fill needs of this project
|}
 
==Testing==
June 20th - July 22nd
* Set up on staging masters as per {{bug|661634}} running the automation against the sandbox Bugzilla with autolanding/logging set to staging repos to test the individual components and watch for issues in the message queue and system.
 
==Testing Round 2==
===Week of Dec 5 - 9===
* Another round of testing the life cycle of a patch attached to [https://landfill.bugzilla.org/bzapi_sandbox/show_bug.cgi?id=10480 sandbox bug] using real push to try
What we need to ensure:
* Not losing every second message to mq
* No extraneous or unexpected postings to the bug from any modules
* Push to try should not require full headers on patches
* Whiteboard syntax either specifying attachment id or not - so [autoland-try] or [autoland-try,1234]
* Retrying oranges
* (feature I would like to have) Report back to bug if builds are canceled through self-serve
===Week of Dec 12 - 16===
Didn't get a full run going on autoland-staging last week, still working out kinks with schedulerdbpoller.
 
TODO:
* bug post attempts to non-existent or blocked bugs needs to be handled so that the revision is kicked out of the queue and doesn't get re-posted
* working on 400 bad requests right now in test_bz_utils
 
===Week of Dec 19 - 23===
* working on thresholds, queue messages, setting up autoland-staging01 to run autoland_queue.py and hgpusher.py
* filed {{bug|712360}} to address LDAP connection errors
* have both modules working and speaking to each other but new bugs:
** hgpusher.py has to be started before autoland_queue.py so that the first time we find an autoland tag the patchset msg succeeds
** need to not put in a time when a 'bad msg' is returned on attempt to hgpusher
** message from hgpusher to autoland is broken right now
** thresholds and the use of 'mozilla-central' as the branch for try are complicated and require review and updating
* hg clone error doesn't return a message to autoland_queue and doesn't retry
<pre>
remote: Permission denied (publickey,gssapi-with-mic).
abort: no suitable response from remote hg!
[Clone] error cloning 'ssh://hg.mozilla.org/try' into clean repository:
Command '['hg', 'clone', u'ssh://hg.mozilla.org/try', u'/root/autoland-env/tools/scripts/autoland/build/clean/try']' returned non-zero exit status 255
[HgPusher] Clone error...
</pre>
* Why is hgpusher picking up stray autoland.db messages after a blowout error?
<pre>
(autoland-env)[root@autoland-staging01 autoland]# python hgpusher.py
[RabbitMQ] Established connection to localhost.
[HgPusher] Erroneous message: {u'_meta': {u'queue': u'autoland', 'received_time': '2011-12-21 02:51:24.446963', u'sent_time': u'2011-12-21 02:46:26.146165', u'routing_key': [u'autoland.db'], u'exchange': u'autoland'}, u'payload': {u'action': u'patchset.apply', u'patchsetid': 66, u'type': u'error'}}
</pre>
 
===Week of Dec 26 - 30===
* bug fixes
* reworking queues to simplify, use direct topic
* completed autoland_queue <-> hgpusher communication
* refactored to be able to do try as its own branch
* added in a couple more variables to the patchsets (author, retries) and removed to_branch
* thresholds fixed so that try and mozilla-central aren't conflicting with each other
* schedulerdbpoller and autoland_queue both push things out if timed out or can't post to bug
'''Still to do'''
* complete end-to-end test running schedulerdbpoller on autoland-staging01
** first attempt: message of completion from schedulerdbpoller got nabbed by hgpusher
* hgpusher still has to be started before autoland_queue (why???)
* more code clean up and review
* test on actual bugzilla bug (check on attachment name as custom syntax)
== Testing Round 3 ==
===Weeks of Jan 9 - 13 && Jan 16 - 20===
* review tests and try to get better coverage (80% or higher) on all modules
* confirm message queues working (Marc: Done)
 
==Deployment==
===Week of Jan 16th===
Go live to a small subset of devs (ehsan plus one or two others), watch for issues using live data from actual pushes/db
===Week of Jan 23
Then go live to all with docs & usage info broadcast widely
 
===TO DO===
* Monitor load & machine resources/wait times
* Confirm bugs hit from previous attempts are no longer reproducible
* Write up developer documentation on how to use the current system
* Write blog posts and promote the new functionality as well as explaining where this is heading (automated landing across release branches)
* Extend functionality to additional branches (with specific per-branch criteria)
 
===Possible Bugs===
These two bugs came up in previous staging runs and should be watched for specifically:
* Repeated bug posting
* Queue messages getting lost
 
==API==
Write and enable API to give sheriff access to the functionality of this system. Include Kill Switch, ability to override.
 
==Post-Deployment==
===Outreach===
In order to get a lot of eyes (and users) we have a multi-step outreach process to inform developers of this system.
* Blog posts
* Tweeting
* Yammer posts
* Monday meeting lightning talk
* Brown Bag
* File a tracking bug for all bugs filed about any module issues
* Documentation in wiki.m.o (and possibly in MDN)
 
==Component Descriptions and Implementation Notes==
===AutolandQueue===
A polling script that will pull all bugs from bugzilla with whiteboard tags matching:
[autoland-$branch]
[autoland-$branch:$patchID:$patchID]
 
Default behaviour would be to take [autoland-$branch] and grab all non-obsolete, non r- patches and create a patchset db object that is then picked out of the queue to be sent to hgpusher for landing on $branch, results are returned from hgpusher to the AutolandQueue module and analyzed for further actions (like try->mozilla-{central,inbound}).  To override or search for regressions the API will support directly inserting a patchset to create a custom autoland run.
 
===HgPusher===
Listens for messages that specify a branch push is requested. Accepts a branch, patch id(s) and can clone the branch, apply the patches and report back to the AutolandQueue via message the results of the push (success == revision or FAIL). 
* Input(s): AutolandQueue Messages, command line
* Output(s): hg.mozilla.org, AutolandQueue Messages, stdout
 
===SchedulerDBPoller===
Regularly polls the scheduler DB (cronjob) and checks for completed buildruns for any actively monitored branches.  There should be a list to check against for what branches need to be watched. The incomplete runs from the last N units of time are kept by revision in a local cache file to check for completion.  When an observed branch has a completed buildrun the SchedulerDBPoller can check two things:
* if a --post-to-bugzilla flag is present and a bug number is provided following that flag, then the revision will be kept in a local cache and watched until completion (or time out), when done a bug comment will be posted and a message sent out that this revision is complete in case AutolandQueue is interested in that revision
 
* Input(s): command line
* Output(s): AutolandQueue, stdout
 
===Bugzilla Autolanding Extension===
The plan is for a [https://wiki.mozilla.org/BMO/AutoLand BMO Extension] that will only be available to people with the correct LDAP permissions.
 
===MessageQueue===
This is a util that manages the creation of message queues and routing used by the Autolanding system.  It ensures that the right modules receive the right messages.
 
===AutolandDB===
Keeps track of the state of an autoland-triggered push from start to finish.
 
===LDAPTool===
Checks for hg permission level of submitter
Compares bugzilla email to ldap email
 
=Notes=
* Set of tools which automate each step of this work and we can 'on-demand' turn off any of the tools
* Sheriff needs to be able to turn off the auto-landing altogether
* No limit in workflow -- sheriff should be able to override the queue to auto-land a priority patch
** can have oranges, but sheriff could 'force' the landing anyway
* We need to have stages:
** (TBD): how to deal with bugzilla
** '''HgPusher''': deals with the hg stuff (merging patches, grabbing patches from bugs, what happens if any steps fail)
** '''SchedulerDB Poller''': deals with a way of getting results from try/m-c to make sure we have the data on whether to proceed and/or leave comment in the bug regarding outcomes
** (fourth, optional/desired/future) how to monitor perf results and get an idea of when it's ok to push or flag on those results
* Be strict in accepting things from humans
** Rules for automated landings
** If they forget a rule, they get a comment - "Step X failed (reason)"
** For bugs with multiple patches - to land all together, look at non-obsolete patches which have been reviewed in alphabetical order
** Common for devs to name things "part 1", "part 2"
** Don't need to be lenient toward human mistakes
** Need correct descriptions, author information, header of patch -- if it doesn't include a header, try syntax
** Look at "checkin-needed" box, grab all non-obsolete, and if they have the right message then push all those - if all those steps succeed then you get the try push otherwise fail with comment -- clear "checkin-needed" from the bug
* Is the bug the right place? It's the public record for the issue, so yes - emailing the assignee takes the information away from the record
* Specialized tool for patch queues would be nice (not in the scope of this project!)
* Try syntax presence == try on try_repo but NEVER auto-land on trunk
* Developers will be encouraged to watch perf results on trunk, since try perf is not really useful information
* Bot has to watch m-c after it lands something there, results come back to the bug
** Merge tracking for commits (what was auto-landed and what was not)
** Merging between two auto-landed pushes
** We have to watch these to know when all the results are in, and what was successful/not (scheduler db has this information for us to get)
* Autoland Message Queue - consumes information from BugzillaScraper (pulse?) as event triggerer for HgPusher, BugCommenter
==Security==
* Must ensure that the patch is attached by someone with L1 hg access, so that we are auto-landing patches from authors with the same level of security as current push-to-try
** Why L1?  This is about pushing to mozilla-central, so I think we should check for L3 access.
** L1 to push to try, then reviewer should have L3 for autolanding to trunk
=Deployment Coordination with IT=
We want to have a production-level system here so what will that require?
* Set up an autoland-master01 (perhaps move autoland-staging02 to fill this role?) so we can keep autoland-staging01 as our staging environment
* Nagios alerts on the modules to ensure they are running
* Emails when there are failures that stop any part of the system
 
=Setting up the staging masters=
==First run, August 2011==
On autoland-staging01 I have done the following:
* sudo yum install hg
* checked out my [http://hg.mozilla.org/users/lsblakk_mozilla.com/tools tools repo]
* Installed Python 2.6.7 from [http://www.python.org/ftp/python/2.6.7/Python-2.6.7.tgz source]
** ./configure, make (yum install zlib-devel, readline-devel, bzip2-devel), make install
** added /usr/local/bin to the PATH in .bashrc to use 2.6 as default Python
* Set up a virtualenv using [https://wiki.mozilla.org/ReleaseEngineering/Virtualenv these notes]
* in the virtualenv installed:
** argparse
** simplejson
** sqlalchemy
** mysql-python
* Also required (but not currently installed):
** [http://www.python-ldap.org/ python-ldap] for ldap utils
** [http://pika.github.com/ pika] for rabbitmq communication with python
** mock for tests
 
==Second run Dec 2011==
* Cloned a read-only of git://github.com/lsblakk/tools.git to /root/autoland-env on autoland-staging01
* copied in config.ini.prod values from /root/lsblakk-tools/scripts/autoland and made a backup of the config.ini, then did a symlink from config.ini.prod to config.ini
* pip install urllib3 (to get urllib2) - also had to add import urllib2 to autoland_queue.py
* rebooted autoland-staging01 to kill any stray processes of mq/rabbit
* start rabbit server:
su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmq-server
* JSON errors on request call for bz poller
* fixed calls to bz_utils __init__ in both autoland_queue and hg_pusher
* now I can start both and have their queues listening but autoland_queue is reading my bug as not having patches, trying to get it to work with :attachment_id in the tag and there's some patch list/splitting errors
* yum package of rabbitmq-server was 2.2 and to get rabbitmq-plugins I needed a newer version so did:
yum remove rabbitmq-server
rpm -ivh http://www.rabbitmq.com/releases/rabbitmq-server/v2.7.1/rabbitmq-server-2.7.1-1.noarch.rpm
* thanks to newer rabbitmq-server, newer erlang needed as well so as per [http://www.rabbitmq.com/install-rpm.html rabbit docs] did:
wget -O /etc/yum.repos.d/epel-erlang.repo http://repos.fedorapeople.org/repos/peter/erlang/epel-erlang.repo
yum install erlang
 
 
==HG Update January 24 2012==
* get latest release from http://pkgs.repoforge.org/mercurial/ (eg mercurial-1.9-1.el5.rf.x86_64.rpm)
* remove old mercurial version: yum remove mercurial
* install new version, eg: yum install --nogpgcheck mercurial-1.9-1.e15.rf.x86_64.rpm

Latest revision as of 01:00, 25 October 2018

This project has been superseded by Lando, which hooks into Phabricator.