CIDuty/Other Duties: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
Line 97: Line 97:
'''This could affect mobile talos numbers or break the jobs altogether. Please coordinate with sheriffs'''
'''This could affect mobile talos numbers or break the jobs altogether. Please coordinate with sheriffs'''


'''NOTE: There's a great deal of data we can not check into revision control for legal reasons, so there's an extensive .hgignore file.  If you're adding new data to the tree that can not be checked in, please make sure to add it to the .hgignore file as well so that people are not confused by files when they perform an hg status.'''
'''NOTE: There's a great deal of data we can not check into revision control for legal reasons, so there's an extensive .hgignore file.  If you're adding new data to the tree that can not be checked in, please ask a talos developer/reviewer or [[https://bugzilla.mozilla.org/enter_bug.cgi?product=Testing&component=Talos file a bug]] to request any [[http://hg.mozilla.org/build/talos/file/8d6fb3704417/.hgignore .hgignore]] changes'''
* '''Please ask a talos developer/reviewer or [[https://bugzilla.mozilla.org/enter_bug.cgi?product=Testing&component=Talos file a bug]] for any [[http://hg.mozilla.org/build/talos/file/8d6fb3704417/.hgignore .hgignore]] changes'''


'''NOTE: For now, we're only using the "old, mac mini" setup. Update both but only talk about the old setup'''
=== webapp cluster ===
 
=== new, webapp cluster ===


  ssh relengwebadm.private.scl3.mozilla.com
  ssh relengwebadm.private.scl3.mozilla.com
Line 115: Line 112:
  # ..capture the output here; the remainder will be long and not that useful..
  # ..capture the output here; the remainder will be long and not that useful..
  /data/releng/src/talos-remote/update
  /data/releng/src/talos-remote/update
=== old, mac mini ===
We have a load balancer (bm-remote) that is in front of three web hosts (bm-remote-talos-0{1,2,3}). Here is how you update them: Update Procedure:
ssh root@bm-remote-talos-webhost-01
cd /var/www/html/talos-repo
# NOTICE that we have uncommitted files
hg status
# Take note of the current revision to revert to (just in case)
hg id
hg pull -u
# 488bc187a3ef tip
rsync -azv --delete /var/www/html/. bm-remote-talos-webhost-02:/var/www/html/.
rsync -azv --delete /var/www/html/. bm-remote-talos-webhost-03:/var/www/html/.


= TBPL =
= TBPL =

Revision as of 16:40, 5 May 2014

Tree Maintenance

Repo Errors

If a dev reports a problem pushing to hg (either m-c or try repo) then you need to do the following:

  • File a bug (or have dev file it) and then poke in #ops noahm
    • If he doesn't respond, then escalate the bug to page on-call
  • Follow the steps below for "How do I close the tree"

How do I see problems in TBPL?

All "infrastructure" (that's us!) problems should be purple at http://tbpl.mozilla.org. Some aren't, so keep your eyes open in IRC, but get on any purples quickly.

How do I close the tree?

See ReleaseEngineering/How_To/Close_or_Open_the_Tree

How do I claim a rentable project branch?

See ReleaseEngineering/DisposableProjectBranches#BOOKING_SCHEDULE

Clean up the scheduler DB

Sometimes we get some jobs pending for days: https://secure.pub.build.mozilla.org/buildapi/pending

These are supposed to be cleaned up automatically now. See bug 755012 for details.

Re-run jobs

How to trigger Talos jobs

see ReleaseEngineering/How_To/Trigger_Talos_Jobs

How to re-trigger all Talos runs for a build (by using sendchange)

see ReleaseEngineering/How_To/Trigger_Talos_Jobs

How to re-run a build

Do not go to the page of the build you'd like to re-run and cook up a sendchange to try to re-create the change that caused it. Changes without revlinks trigger releases, which is not what you want.

Find the revision you want, find a builder page for the builder you want (preferably, but not necessarily, on the same master), and plug the revision, your name, and a comment into the "Force Build" form. Note that the YOU MUST specify the branch, so there's no null keys in the builds-running.js. Otherwise your build will not show up in self-serve or tbpl.

Nightlies

How do I re-spin mozilla-central nightlies?

To rebuild the same nightly for some platform

buildbot's Rebuild button works fine. BUT, if the original build uploaded files or published updates you don't want to do this (caching on ftp.m.o will mean the old file is served for some time). Check the log of the original build - if it failed before uploading it's OK to rebuild.

To build new nightlies

NB: For b2g, the revision set can only be the mercurial gecko revision.

To build nightlies on all the platforms

Use self-serve for this. Scroll to the bottom of the page and put the requested revision into the box next to Create new nightly builds on mozilla-central revision. Click on the Submit button. This takes care of using the same buildID for all builds.

To rebuild some subset of all the nightly builders (eg only desktop/Android/b2g)

To reach a specific nightly builder on a build master, you can use a finished nightly job on tbpl and this snippet of javascript code (add it to the bookmark toolbar):

javascript:(function(){function%20JSONRequest(url,callback){var%20req=new%20XMLHttpRequest();req.open('GET',url+'?format=json');req.withCredentials=true;req.timeout=5000;req.onreadystatechange=function(){if(req.readyState===4){if(req.status!==200||!req.responseText){if(req.status===0){alert('Self-Serve%20API%20request%20timed%20out.');}else{alert('Self-Serve%20API%20request%20failed%20('+req.status+').%20\nIs%20the%20job%20still%20pending?%20\n\n'+req.responseText);}return;}try{callback(JSON.parse(req.responseText));}catch(e){alert(e);}}};req.send();}function%20openJob(job){var%20bm=job.claimed_by_name.split(':');var%20port=/build[0-9]/.test(bm[1])?"8001":/try[0-9]/.test(bm[1])?"8101":/tests[0-9]/.test(bm[1])?"8201":'unknown';window.open('http://'+bm[0]+':'+port+'/builders/'+job.buildername+'/builds/'+job.buildnumber);}function%20openCompletedJob(jobs){var%20selectedJob=UserInterface._activeResultObject();var%20startTime=Date.parse(selectedJob.startTime)/1000;jobs=jobs.filter(function(j){return(j.buildername===selectedJob.machine.name)&&((j.starttime-startTime)<=1);});if(jobs.length!==1){alert('Should%20have%20found%20exactly%20one%20matching%20job,%20found%20'+jobs.length+'!');return;}openJob(jobs[0]);}var%20id=(typeof%20UserInterface!=="undefined")?UserInterface._activeResult:;if(!id){alert('Error%20-%20are%20you%20on%20TBPL%20with%20a%20job%20selected?');return;}var%20baseURL='https://secure.pub.build.mozilla.org/buildapi/self-serve/'+UserInterface.treeInfo.buildbotBranch;var%20re=/^(?:running%7Cpending)-/;if(!re.test(id)){var%20rev=UserInterface._activeResultObject().revs[UserInterface.treeInfo.primaryRepo];JSONRequest(baseURL+'/rev/'+rev,openCompletedJob);}else{JSONRequest(baseURL+'/build/'+id.replace(re,),openJob);}})()

You will need to force the specific builders on a build master, setting several parameters. First open a waterfall page (eg mozilla-central) and search for 'nightly'. Open each builder you want in a new tab.

For each builder you should set

  • 'Your name' to your name
  • 'Reason for build' to 'bug ###' or 'requested by <someone>'
  • 'Branch to build' to 'mozilla-central' (other branches may have a prefix, eg releases/mozilla-aurora)
  • 'Revision to build' to the <revision>
  • 'Property 1': 'Name' to 'buildid', 'Value' to the current Pacific time (eg from running TZ=US/Pacific date +%Y%m%d%H%M%S)

Then click on the 'Force build' button.

The first two are nice-to-have for later debugging. The branch and revisions are required so that TBPL shows the builds while they are pending and running (rather than just after completion). Setting the buildid property the same across all builds helps keep nightlies consistent. An add-on like AutoFill Forms is invaluable in storing these values and filling the form for you, but there is probably a way to do this with curl too (it's a POST).

Disable updates

See ReleaseEngineering/How_To/Shut_off_all_updates for global shutoff. We use Balrog now for nightly & aurora updates.

Talos

How to update the talos zips

We only need to do this for mobile requests.

This deployment is super safe. NPOTB

# running this from cruncher is faster than downloading/uploading from your localhost
ssh -A cruncher 
export URL=http://people.mozilla.org/~jmaher/taloszips/zips/talos.07322bbe0f7d.zip
export TALOS_ZIP=`basename $URL`
wget $URL
#relengwebadmn has limited access to the internet - that is why we scp from another host
scp ${TALOS_ZIP} relengwebadm.private.scl3.mozilla.com:/mnt/netapp/relengweb/talos-bundles/zips
ssh relengwebadm.private.scl3.mozilla.com "chmod 644 /mnt/netapp/relengweb/talos-bundles/zips/${TALOS_ZIP}"
ssh relengwebadm.private.scl3.mozilla.com "sha1sum /mnt/netapp/relengweb/talos-bundles/zips/${TALOS_ZIP}"
curl -I http://talos-bundles.pvt.build.mozilla.org/zips/${TALOS_ZIP}

For talos.zip changes: Once deployed, notify the a-team and let them know that they can land at their own convenience.

  • Please verify the shasum matches what is in the [comment], we have had a few instances where the talos.zip was incorrect.

Update mobile talos webhosts

Keep track of what revisions is being run. Copy/paste the output into the bug. Please update our maintenance page

This could affect mobile talos numbers or break the jobs altogether. Please coordinate with sheriffs

NOTE: There's a great deal of data we can not check into revision control for legal reasons, so there's an extensive .hgignore file. If you're adding new data to the tree that can not be checked in, please ask a talos developer/reviewer or [file a bug] to request any [.hgignore] changes

webapp cluster

ssh relengwebadm.private.scl3.mozilla.com
sudo su -
cd /data/releng/src/talos-remote/www/talos-repo
# NOTICE that we have uncommitted files
hg status
# Take note of the current revision to revert to (just in case)
hg id
hg pull -u
# 488bc187a3ef tip
# ..capture the output here; the remainder will be long and not that useful..
/data/releng/src/talos-remote/update

TBPL

How to deploy changes

RelEng no longer has access to do this. TBPL devs will request a push from Server Ops.

How to hide/unhide builders

  • In the 'Tree Info' menu select 'Open tree admin panel'
  • Filter/select the builders you want to change
  • Save changes
  • Enter the sheriff password and a description (with bug number if available) of your changes
  • CC :edmorley & :philor on the relevant bug so that they know what to expect when sheriffing.

Ganglia

  • if you see that a host is reporting to ganglia in an incorrect manner it might just take this to fix it (e.g. bug 674233):
switch to root, service gmond restart

Queue Directories

If you see this in #build:

<nagios-sjc1> [54] buildbot-master12.build.scl1:Command Queue is CRITICAL: 4 dead items

It means that there are items in the "dead" queue for the given master. You need to look at the logs and fix any underlying issue and then retry the command by moving *only* the json file over to the "new" queue. See the Queue directories wiki page for details.

Cruncher

If you get an alert about cruncher running out of space it might be a sendmail issue (backed up emails taking up too much space and not getting sent out):

<nagios-sjc1> [07] cruncher.build.sjc1:disk - / is WARNING: DISK WARNING - free space: / 384 MB (5% inode=93%):

As root:

du -s -h /var/spool/*
# confirm that mqueue or clientmqueue is the oversized culprit
# stop sendmail, clean out the queues, restart sendmail
/etc/init.d/sendmail stop
rm -rf /var/spool/clientmqueue/*
rm -rf /var/spool/mqueue/*
/etc/init.d/sendmail start

hg<->git conversion

This is a production system RelEng built, but has not yet transitioned to full IT operation. As a production system, it is supported 24x7x365 - escalate to IT oncall (who can page) as needed.

We'll get problem reports from 2 sources:

  • via email from vcs2vcs user to release+vcs2vcs@m.c - see email handling instructions for those.
  • via a bug report for a customer visible condition - this should only be if there is a new error we aren't detecting ourselves. See the resources below and/or page hwine.

Documentation for this system:

All services run as user vcs2vcs on one of the following hosts (as of 2013-01-07): github-sync1-dev.dmz.scl3.mozilla.com, github-sync1.dmz.scl3.mozilla.com, github-sync2.dmz.scl3.mozilla.com, github-sync3.dmz.scl3.mozilla.com.

Handling alert_major_errors

# SSH as yourself to the hostname in the 'from' address of the alert_major_errors email.
$ ssh yourname@github-sync3.dmz.scl3.mozilla.com
$ sudo su - vcs2vcs
$ cd etc
# find the repo name that vcs2vcs is complaining about. For example:
$ grep releases-mozilla-central-no-cvs *
job02_cmds:#    "hg:$HOME/repos/releases-mozilla-central-no-cvs" "github"
# discover where that job runs
$ grep job02 status
job02_cmds,github-sync3.dmz.scl3.mozilla.com,m-c w/o cvs as used by b2g
# connect to that host the same as we did above (if not already connected)
# then
$ cd logs/job02 # same job as above
$ show_update_errors update.log
# Note: the command exit code precedes the command itself
# eg. ...;255;hg --cwd...

Continue with instructions here.

disable/re-enable aurora updates

Take care of by the person doing the final release since merge day activities are on the Monday before the release.

Upload

Python packages

Warning signWarning: Mozharness no longer uses packages from the PuppetAgain repositories! Instead, it uses http://pypi.pub.build.mozilla.org/pub and http://pypi.pvt.build.mozilla.org/pub, both served from the same directory.

From your local machine:

FILE=your_python_package.tar.gz
scp $FILE $LDAP_SHORT_USERNAME@relengwebadm.private.scl3.mozilla.com:

From relengwebadm:

ssh $LDAP_SHORT_USERNAME@relengwebadm.private.scl3.mozilla.com

FILE=your_python_package.tar.gz
sudo mv -vi $FILE /mnt/netapp/relengweb/pypi/pub/
sudo chmod 644 /mnt/netapp/relengweb/pypi/pub/$FILE

From your local machine:

curl -I http://pypi.pub.build.mozilla.org/pub/$FILE
# You should see "HTTP/1.1 200 OK"

How to upload to Tooltool

If you don't want to upload from your own laptop (because, eg, you have a slow uplink) you can do this from cruncher.

Access cruncher with your credentials:

ME="your_short_ldap_username" # or `whoami`
ssh -A $ME@cruncher.build.mozilla.org

Download the file and then:

scp filename.tar.xz $ME@relengwebadm.private.scl3.mozilla.com:

Login to relengwebadmn:

ssh $ME@relengwebadm.private.scl3.mozilla.com

And deploy the file to tooltool:

FILE=~/emulator.zip # or whatever you're uploading
export SHA512=`openssl sha512 $FILE | cut -d' ' -f2`
sudo mv -i $FILE /mnt/netapp/relengweb/tooltool/pvt/build/sha512/${SHA512}
sudo chmod 644  /mnt/netapp/relengweb/tooltool/pvt/build/sha512/${SHA512}
ls -l  /mnt/netapp/relengweb/tooltool/pvt/build/sha512/${SHA512}
  • Add the filename, filesize, and sha512 digest to the bug you are working on. These can be added to the tooltool manifests later.

How to upload Talos ZIPs

See How to update the talos zips.

How to add NPM packages

See ReleaseEngineering/How To/Mirror NPM Packages