Flume ElasticSearch WOO Maintenance Page

From MozillaWiki
Revision as of 16:15, 7 February 2011 by Aphadke (talk | contribs)
Jump to navigation Jump to search

Flume WOO project facilitates realtime buildbot log ingestion inside HDFS/Hive and ElasticSearch via Flume. This page describes the different machines, installed software, and steps to restart services.

ElasticSearch cluster:
elasticsearch1.metrics.sjc1.mozilla.com (master) elasticsearch2.metrics.sjc1.mozilla.com (slave) elasticsearch3.metrics.sjc1.mozilla.com (slave)

Symptom: Nagios ElasticSearch alert indicates one (or many) machines are down.
Fix: Login to the relevant machine/s. Kill all running elasticsearch processes (ps ax|grep elasticsearch)
Restart the services in following order (elasticsearch1, elasticsearch2, elasticsearch3)
Restart command: /usr/lib/es/bin/elasticsearch

Please email aphadke@mozilla.com, desinspanjer@mozilla.com if problem persists.


Flume cluster: elasticsearch3.metrics.sjc1.mozilla.com (master) elasticsearch4.metrics.sjc1.mozilla.com (node-collector) elasticsearch5.metrics.sjc1.mozilla.com (node-agent)

Symptom: Nagios Flume alert indicates a given machine is down. Hostname: elasticsearch4.metrics.sjc1.mozilla.com Stop Flume: /usr/lib/flume/bin/flume-daemon.sh stop Confirm flume has stopped (ps ax|grep flume) else kill -9 the pid Start Flume: /usr/lib/flume/bin/flume-daemon.sh start node_nowatch -n elasticsearch4.metrics.sjc1.mozilla.com

Symptom: Nagios Flume alert indicates a given machine is down. Hostname: elasticsearch5.metrics.sjc1.mozilla.com Stop Flume: /usr/lib/flume/bin/flume-daemon.sh stop Confirm flume has stopped (ps ax|grep flume) else kill -9 the pid Start Flume: /usr/lib/flume/bin/flume-daemon.sh start node -n elasticsearch5.metrics.sjc1.mozilla.com


Symptom: Nagios Flume alert indicates a given machine is down. Hostname: elasticsearch3.metrics.sjc1.mozilla.com Resolution: Please email aphadke@mozilla.com (213-509-0575) or deinspanjer@mozilla.com. While we can restart Flume master, a master going down might indicate deeper problems. Given the infancy nature of flume, its best to investigate further before just restarting it. Stop Flume: /usr/lib/flume/bin/flume-daemon.sh stop Confirm flume has stopped (ps ax|grep flume) else kill -9 the pid Start Flume: /usr/lib/flume/bin/flume-daemon.sh start master