CIDuty:QuarantineInstances: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(Smoothed out the page)
m (added link to quarantine multiple instances)
 
(8 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{Build Duty How To|Quarantine Instances}}
===== When and How to quarantine taskcluster instances =====


= When and How to quarantine instances =
: 1. Choose the worker types you wish to investigate. You can find them [https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/ here.]


Choose the worker types you wish to investigate. You can find them [https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/ here.]
: 2. Check which instances have exception(orange) or failed(red) at task state and investigate each of them separately.


[[File:Worker List.png|Worker List.png]]


Check which instances have exception(orange) at task state and investigate each of them separately.


Quarantine the ones that have the last 10+ tasks resolved as exception(orange). And accept the default 1000 year quarantine, RelOps can disable the quarantine whenever
===== Log analyses and quarantine the machine  =====
the instances are fixed.


File a bug in Bugzilla under RelOps similar to this [https://bugzilla.mozilla.org/show_bug.cgi?id=1441820 bug example.]
: 1. If the last 4-5+ tests are problematic, be sure to check out a few of them.
 
[[File:Test name.png]]
 
:: Check public/logs/live_backing.log for errors on a few of the latest tests by going to <Test-Name> -> Run Artifacts -> public/logs/live_backing.log as shown below:
 
[[File:Log location.png]]
 
: 2. Judging by the error logs we will know if the machine is faulty (quarantine if this is true) or not. There is no correct answer for this, only that we will know it from experience. Thus far we know if the above conditions are met and the error log terminated with error code -1 and a message like :
 
[[File:Error log.png]]
 
 
: 3. Quarantine the machine for which all of the above is true by pressing the Quarantine button and leaving the default 1000 years as expiration date, as shown in below :
 
[[File:Quarantine pic.png|center]]
 
===== Bugzilla =====
 
: 1. Check if there is any bug opened for the affected machine on [https://bugzilla.mozilla.org/ Bugzilla], under CIDuty and/or Relops using the keywords
 
ALL machine_name
: 2. If there is a bug created, just update with a message that you have quarantined the machine and add the reason.
 
: 3. If there is no bug created, file a bug in Bugzilla under [https://bugzilla.mozilla.org/enter_bug.cgi?product=Infrastructure%20%26%20Operations&component=RelOps%3A%20Hardware RelOps]
 
===== Update Moonshot inventory =====
 
: 1. Update the [https://docs.google.com/spreadsheets/d/1IPTmppvqDw0PQV-O1LgXLJg_7TC-H_IAAnSxcur8c7I/edit?ts=5ad7748a#gid=562893333 Master Moonshot Inventory] spreadsheet with the bug number for the affected machine.
 
Also check how to quarantine a machine or multiple machine using [[CIDuty/How_To/QuarantineMultipleInstances|taskcluster cli]]

Latest revision as of 13:47, 23 April 2019

When and How to quarantine taskcluster instances
1. Choose the worker types you wish to investigate. You can find them here.
2. Check which instances have exception(orange) or failed(red) at task state and investigate each of them separately.

Worker List.png


Log analyses and quarantine the machine
1. If the last 4-5+ tests are problematic, be sure to check out a few of them.

Test name.png

Check public/logs/live_backing.log for errors on a few of the latest tests by going to <Test-Name> -> Run Artifacts -> public/logs/live_backing.log as shown below:

Log location.png

2. Judging by the error logs we will know if the machine is faulty (quarantine if this is true) or not. There is no correct answer for this, only that we will know it from experience. Thus far we know if the above conditions are met and the error log terminated with error code -1 and a message like :

Error log.png


3. Quarantine the machine for which all of the above is true by pressing the Quarantine button and leaving the default 1000 years as expiration date, as shown in below :
Quarantine pic.png
Bugzilla
1. Check if there is any bug opened for the affected machine on Bugzilla, under CIDuty and/or Relops using the keywords
ALL machine_name 
2. If there is a bug created, just update with a message that you have quarantined the machine and add the reason.
3. If there is no bug created, file a bug in Bugzilla under RelOps
Update Moonshot inventory
1. Update the Master Moonshot Inventory spreadsheet with the bug number for the affected machine.

Also check how to quarantine a machine or multiple machine using taskcluster cli