Connect and Troubleshoot workers in CI: Difference between revisions

m
quick updates
(How to connect to and troubleshoot the workers)
 
m (quick updates)
Line 7: Line 7:
If we cannot ssh into OSX nodes, we can try to restart them from Taskcluster. But if they are not visible in the Taskcluster worker explorer, then you can create them using this version of [https://github.com/davehouse/relops-infra/blob/quarantine_nonexisting/quarantine_tc.py quarantine script] that will add/define a worker if it is missing.
If we cannot ssh into OSX nodes, we can try to restart them from Taskcluster. But if they are not visible in the Taskcluster worker explorer, then you can create them using this version of [https://github.com/davehouse/relops-infra/blob/quarantine_nonexisting/quarantine_tc.py quarantine script] that will add/define a worker if it is missing.
* Step 1: [[BuildDuty:TaskClusterCli|Connect to Taskcluster CLI]]  
* Step 1: [[BuildDuty:TaskClusterCli|Connect to Taskcluster CLI]]  
* Step 2: Use this command: e.g.  > python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g mdc2 t-yosemite-r7-449`
* Step 2: Use this command: e.g.  <pre>python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g mdc2 t-yosemite-r7-449</pre>


Then the worker explorer will show the machine and you can reboot it from there, using [[ReleaseEngineering/How To/RelOps Hardware Controller (Roller)|roller]]
After the steps above the worker explorer will show the machine and you can reboot it from there, using [[ReleaseEngineering/How To/RelOps Hardware Controller (Roller)|roller]]


If the issue is not fix ( the machine does not take jobs and SSH is still not working ), create a bug for DCOps to physically reboot and reimage/netboot the machines.The Automatic Bug Generator will create a bug for DCOps if the restart fails.
If the issue is not fix ( the machine does not take jobs and SSH is still not working ), create a bug for DCOps to physically reboot and reimage/netboot the machines.The Automatic Bug Generator will create a bug for DCOps if the restart fails.
Line 15: Line 15:
= Taskcluster Checker =
= Taskcluster Checker =


Using the [https://github.com/Akhliskun/taskcluster-worker-checker] client.py script from the GitHub Repository you can find all TC workers which are missing and need to be debugged.  
Using the [https://github.com/Akhliskun/taskcluster-worker-checker client.py] script from the GitHub Repository you can find all TC workers which are missing and need to be debugged.  


In the [https://github.com/Akhliskun/taskcluster-worker-checker/blob/master/README.md] README file you can find how to use the checker.
In the [https://github.com/Akhliskun/taskcluster-worker-checker/blob/master/README.md README] file you can find how to use the checker.


= Machine Quick Check =
= Machine Quick Check =
Here are a few methods to check a worker:
Here are a few methods to check a worker:
   
   
* Check the problem tracking bug: e.g [https://bugzilla.mozilla.org/buglist.cgi?quicksearch=ALL%20T-W1064-MS-072&list_id=14303312]  
* Check the problem tracking bug: e.g [https://bugzilla.mozilla.org/buglist.cgi?quicksearch=ALL%20T-W1064-MS-072&list_id=14303312 problem tracking bug]  
* Check the node definition in puppet repo: e.g [https://github.com/mozilla-releng/build-puppet/search?q=T-W1064-MS-072&unscoped_q=T-W1064-MS-072]  
* Check the node definition in puppet repo: e.g [https://github.com/mozilla-releng/build-puppet/search?q=T-W1064-MS-072&unscoped_q=T-W1064-MS-072 node definition]  
* Look into Papertrail for logs: e.g [https://papertrailapp.com/systems/2238171932/events]
* Look into Papertrail for logs: e.g [https://papertrailapp.com/systems/2238171932/events papertrail logs]
* Check if the host respond to ping.
* Check if the host responds to ping.
* Connect to the worker using SSH:
* Connect to the worker using SSH:
** check if the worker process is running: '#ps -ef|grep'
** check if the worker process is running: <pre>ps -ef|grep</pre>
** check the logs: '#top -u' to see if there are high CPU usage from something other than python or firefox
** check the logs: <pre> top -u </pre> to see if there are high CPU usage from something other than python or firefox


= Rebooting workers =
= Rebooting workers =
Line 47: Line 47:
   
   
How to re-image:  
How to re-image:  
* Windows MS: [https://mana.mozilla.org/wiki/display/RelEng/How+To+Image+or+Reimage+a+Windows+Linux+Server]  
* Windows MS : [https://mana.mozilla.org/wiki/display/RelEng/How+To+Image+or+Reimage+a+Windows+Linux+Server How To Image or Reimage a Windows Linux Server]  
* Linux MS: [https://mana.mozilla.org/wiki/display/DC/How+To+Reimage+Releng+HP+Moonshot+Linux+Machines]  
* Linux MS: [https://mana.mozilla.org/wiki/display/DC/How+To+Reimage+Releng+HP+Moonshot+Linux+Machines How To Reimage Releng HP Moonshot Linux Machines]  
* Mac OS X: [https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=34014655]  
* Mac OS X: [https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=34014655 How To Reimage Mac Minis [Remotely]]  


= SSH not working =
= SSH not working =
* Step 1 : Check the Papertrail logs
* Step 1 : Check the Papertrail logs
* Step 2 : Reboot it from Taskcluster. It may have old auth_keys or not completed re-imaging
* Step 2 : Reboot it from Taskcluster. It may have old ''auth_keys'' or not completed re-imaging
* Step 3 : File a problem tracking bug or update the existent problem tracking bug.
* Step 3 : File a problem tracking bug or update the existent problem tracking bug.
Confirmed users
14

edits