Flume ElasticSearch WOO Maintenance Page

From MozillaWiki
Revision as of 16:12, 7 February 2011 by Aphadke (talk | contribs) (Created page with "Flume WOO project facilitates realtime buildbot log ingestion inside HDFS/Hive and ElasticSearch via Flume. This page describes the different machines, installed software, and s...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Flume WOO project facilitates realtime buildbot log ingestion inside HDFS/Hive and ElasticSearch via Flume. This page describes the different machines, installed software, and steps to restart services.

ElasticSearch cluster: elasticsearch1.metrics.sjc1.mozilla.com (master) elasticsearch2.metrics.sjc1.mozilla.com (slave) elasticsearch3.metrics.sjc1.mozilla.com (slave)

Symptom: Nagios ElasticSearch alert indicates one (or many) machines are down. Fix: Login to the relevant machine/s Kill all running elasticsearch processes (ps ax|grep elasticsearch) Restart the services in following order (elasticsearch1, elasticsearch2, elasticsearch3) Restart command: /usr/lib/es/bin/elasticsearch

Please email aphadke@mozilla.com, desinspanjer@mozilla.com if problem persists.


Flume cluster: elasticsearch3.metrics.sjc1.mozilla.com (master) elasticsearch4.metrics.sjc1.mozilla.com (node-collector) elasticsearch5.metrics.sjc1.mozilla.com (node-agent)

Symptom: Nagios Flume alert indicates a given machine is down. Hostname: elasticsearch4.metrics.sjc1.mozilla.com Stop Flume: /usr/lib/flume/bin/flume-daemon.sh stop Confirm flume has stopped (ps ax|grep flume) else kill -9 the pid Start Flume: /usr/lib/flume/bin/flume-daemon.sh start node_nowatch -n elasticsearch4.metrics.sjc1.mozilla.com

Symptom: Nagios Flume alert indicates a given machine is down. Hostname: elasticsearch5.metrics.sjc1.mozilla.com Stop Flume: /usr/lib/flume/bin/flume-daemon.sh stop Confirm flume has stopped (ps ax|grep flume) else kill -9 the pid Start Flume: /usr/lib/flume/bin/flume-daemon.sh start node -n elasticsearch5.metrics.sjc1.mozilla.com


Symptom: Nagios Flume alert indicates a given machine is down. Hostname: elasticsearch3.metrics.sjc1.mozilla.com Resolution: Please email aphadke@mozilla.com (213-509-0575) or deinspanjer@mozilla.com. While we can restart Flume master, a master going down might indicate deeper problems. Given the infancy nature of flume, its best to investigate further before just restarting it. Stop Flume: /usr/lib/flume/bin/flume-daemon.sh stop Confirm flume has stopped (ps ax|grep flume) else kill -9 the pid Start Flume: /usr/lib/flume/bin/flume-daemon.sh start master