Flume ElasticSearch WOO Maintenance Page: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(Created page with "Flume WOO project facilitates realtime buildbot log ingestion inside HDFS/Hive and ElasticSearch via Flume. This page describes the different machines, installed software, and s...")
 
No edit summary
Line 1: Line 1:
Flume WOO project facilitates realtime buildbot log ingestion inside HDFS/Hive and ElasticSearch via Flume.  
Flume WOO project facilitates realtime buildbot log ingestion inside HDFS/Hive and ElasticSearch via Flume. This page describes the different machines, installed software, and steps to restart services.  
This page describes the different machines, installed software, and steps to restart services.


ElasticSearch cluster:
'''ElasticSearch cluster: '''<br>elasticsearch1.metrics.sjc1.mozilla.com (master) elasticsearch2.metrics.sjc1.mozilla.com (slave) elasticsearch3.metrics.sjc1.mozilla.com (slave)  
elasticsearch1.metrics.sjc1.mozilla.com (master)
elasticsearch2.metrics.sjc1.mozilla.com (slave)
elasticsearch3.metrics.sjc1.mozilla.com (slave)


Symptom: Nagios ElasticSearch alert indicates one (or many) machines are down.
'''Symptom''': Nagios ElasticSearch alert indicates one (or many) machines are down. <br>'''Fix''': Login to the relevant machine/s. Kill all running elasticsearch processes (ps ax|grep elasticsearch)<br> Restart the services in following order ('''elasticsearch1, elasticsearch2, elasticsearch3''') <br>'''Restart command:''' /usr/lib/es/bin/elasticsearch  
Fix: Login to the relevant machine/s
Kill all running elasticsearch processes (ps ax|grep elasticsearch)
Restart the services in following order (elasticsearch1, elasticsearch2, elasticsearch3)
Restart command: /usr/lib/es/bin/elasticsearch


Please email aphadke@mozilla.com, desinspanjer@mozilla.com if problem persists.
Please email aphadke@mozilla.com, desinspanjer@mozilla.com if problem persists.  


<br> Flume cluster: elasticsearch3.metrics.sjc1.mozilla.com (master) elasticsearch4.metrics.sjc1.mozilla.com (node-collector) elasticsearch5.metrics.sjc1.mozilla.com (node-agent)


Flume cluster:
Symptom: Nagios Flume alert indicates a given machine is down. Hostname: elasticsearch4.metrics.sjc1.mozilla.com Stop Flume: /usr/lib/flume/bin/flume-daemon.sh stop Confirm flume has stopped (ps ax|grep flume) else kill -9 the pid Start Flume: /usr/lib/flume/bin/flume-daemon.sh start node_nowatch -n elasticsearch4.metrics.sjc1.mozilla.com  
elasticsearch3.metrics.sjc1.mozilla.com (master)
elasticsearch4.metrics.sjc1.mozilla.com (node-collector)
elasticsearch5.metrics.sjc1.mozilla.com (node-agent)


Symptom: Nagios Flume alert indicates a given machine is down.
Symptom: Nagios Flume alert indicates a given machine is down. Hostname: elasticsearch5.metrics.sjc1.mozilla.com Stop Flume: /usr/lib/flume/bin/flume-daemon.sh stop Confirm flume has stopped (ps ax|grep flume) else kill -9 the pid Start Flume: /usr/lib/flume/bin/flume-daemon.sh start node -n elasticsearch5.metrics.sjc1.mozilla.com  
Hostname: elasticsearch4.metrics.sjc1.mozilla.com
Stop Flume: /usr/lib/flume/bin/flume-daemon.sh stop
Confirm flume has stopped (ps ax|grep flume) else kill -9 the pid
Start Flume: /usr/lib/flume/bin/flume-daemon.sh start node_nowatch -n elasticsearch4.metrics.sjc1.mozilla.com


Symptom: Nagios Flume alert indicates a given machine is down.
<br> Symptom: Nagios Flume alert indicates a given machine is down. Hostname: elasticsearch3.metrics.sjc1.mozilla.com Resolution: Please email aphadke@mozilla.com (213-509-0575) or deinspanjer@mozilla.com. While we can restart Flume master, a master going down might indicate deeper problems. Given the infancy nature of flume, its best to investigate further before just restarting it. Stop Flume: /usr/lib/flume/bin/flume-daemon.sh stop Confirm flume has stopped (ps ax|grep flume) else kill -9 the pid Start Flume: /usr/lib/flume/bin/flume-daemon.sh start master
Hostname: elasticsearch5.metrics.sjc1.mozilla.com
Stop Flume: /usr/lib/flume/bin/flume-daemon.sh stop
Confirm flume has stopped (ps ax|grep flume) else kill -9 the pid
Start Flume: /usr/lib/flume/bin/flume-daemon.sh start node -n elasticsearch5.metrics.sjc1.mozilla.com
 
 
Symptom: Nagios Flume alert indicates a given machine is down.
Hostname: elasticsearch3.metrics.sjc1.mozilla.com
Resolution: Please email aphadke@mozilla.com (213-509-0575) or deinspanjer@mozilla.com. While we can restart Flume master, a master going down might indicate deeper problems. Given the infancy nature of flume, its best to investigate further before just restarting it.
Stop Flume: /usr/lib/flume/bin/flume-daemon.sh stop
Confirm flume has stopped (ps ax|grep flume) else kill -9 the pid
Start Flume: /usr/lib/flume/bin/flume-daemon.sh start master

Revision as of 16:15, 7 February 2011

Flume WOO project facilitates realtime buildbot log ingestion inside HDFS/Hive and ElasticSearch via Flume. This page describes the different machines, installed software, and steps to restart services.

ElasticSearch cluster:
elasticsearch1.metrics.sjc1.mozilla.com (master) elasticsearch2.metrics.sjc1.mozilla.com (slave) elasticsearch3.metrics.sjc1.mozilla.com (slave)

Symptom: Nagios ElasticSearch alert indicates one (or many) machines are down.
Fix: Login to the relevant machine/s. Kill all running elasticsearch processes (ps ax|grep elasticsearch)
Restart the services in following order (elasticsearch1, elasticsearch2, elasticsearch3)
Restart command: /usr/lib/es/bin/elasticsearch

Please email aphadke@mozilla.com, desinspanjer@mozilla.com if problem persists.


Flume cluster: elasticsearch3.metrics.sjc1.mozilla.com (master) elasticsearch4.metrics.sjc1.mozilla.com (node-collector) elasticsearch5.metrics.sjc1.mozilla.com (node-agent)

Symptom: Nagios Flume alert indicates a given machine is down. Hostname: elasticsearch4.metrics.sjc1.mozilla.com Stop Flume: /usr/lib/flume/bin/flume-daemon.sh stop Confirm flume has stopped (ps ax|grep flume) else kill -9 the pid Start Flume: /usr/lib/flume/bin/flume-daemon.sh start node_nowatch -n elasticsearch4.metrics.sjc1.mozilla.com

Symptom: Nagios Flume alert indicates a given machine is down. Hostname: elasticsearch5.metrics.sjc1.mozilla.com Stop Flume: /usr/lib/flume/bin/flume-daemon.sh stop Confirm flume has stopped (ps ax|grep flume) else kill -9 the pid Start Flume: /usr/lib/flume/bin/flume-daemon.sh start node -n elasticsearch5.metrics.sjc1.mozilla.com


Symptom: Nagios Flume alert indicates a given machine is down. Hostname: elasticsearch3.metrics.sjc1.mozilla.com Resolution: Please email aphadke@mozilla.com (213-509-0575) or deinspanjer@mozilla.com. While we can restart Flume master, a master going down might indicate deeper problems. Given the infancy nature of flume, its best to investigate further before just restarting it. Stop Flume: /usr/lib/flume/bin/flume-daemon.sh stop Confirm flume has stopped (ps ax|grep flume) else kill -9 the pid Start Flume: /usr/lib/flume/bin/flume-daemon.sh start master