CloudServices/Sagrada/Metlog: Difference between revisions

m
No edit summary
 
(7 intermediate revisions by 3 users not shown)
Line 1: Line 1:
= Overview =
= Overview =


The '''Metrics''' project is part of [[Services/Sagrada|Project Sagrada]], providing a service for applications to capture and inject arbitrary data into a back end storage suitable for out-of-band analytics and processing.
The '''Metlog''' project is part of [[Services/Sagrada|Project Sagrada]], providing a service for applications to capture and inject arbitrary data into a back end storage suitable for out-of-band analytics and processing.


= Project =
= Project =
Line 12: Line 12:
== User Requirements ==
== User Requirements ==


The first version of the Metrics system will focus on providing an easy mechanism for the [[Services/Sync| Sync]] and [https://browserid.org/ BrowserID] projects (and any other internal Mozilla services) to efficiently send profiling data and any other arbitrary metrics information that may be desired into one or more backend storage locations.  Once the data has made it to its final destination, there should be available to those w/ appropriate access the ability to do analytics queries and report generation on the accumulated data.
The first version of the Metlog system will focus on providing an easy mechanism for the [[Services/Sync| Sync]] and [https://browserid.org/ BrowserID] projects (and any other internal Mozilla services) to efficiently send profiling data and any other arbitrary metrics information that may be desired into one or more back end storage locations.  Once the data has made it to its final destination, there should be available to those w/ appropriate access the ability to do analytics queries and report generation on the accumulated data.


Requirements:
Requirements:
Line 26: Line 26:
== Proposed Architecture ==
== Proposed Architecture ==


The proposed Services Metrics architecture will consist of 3 layers:
The proposed Services Metlog architecture will consist of 3 layers:


; generator : The generator portion of the system is the actual service application that is generating the data that is to be sent into the system.  We will provide libraries (described below) that app authors can use to easily plug in.  The libraries will take messages generated by the applications, serialize them, and then send them out (using ZeroMQ as the transport, by default).  The metrics generating apps that need to be supported initially are based on the following platforms:
; generator : The generator portion of the system is the actual service application that is generating the data that is to be sent into the system.  We will provide libraries (described below) that app authors can use to easily plug in.  The libraries will take messages generated by the applications, serialize them, and then send them out (using ZeroMQ as the transport, by default).  The metrics generating apps that need to be supported initially are based on the following platforms:
Line 32: Line 32:
* Node.js (BrowserID).
* Node.js (BrowserID).


; router : The router is what will be listening for the messages sent out by the provided libraries.  It will deserialize these messages and examine the metadata to determine the appropriate back end(s) to which the message should be delivered.  The format and protocol for delivering these messages to the endpoints will vary from back end to back end.  We plan on using [http://logstash.net/ logstash] as the message router, because it is already planned to be installed on every services server machine, and it is built specifically for this type of event-based messager routing.
; router : The router is what will be listening for the messages sent out by the provided libraries.  It will deserialize these messages and examine the metadata to determine the appropriate back end(s) to which the message should be delivered.  The format and protocol for delivering these messages to the endpoints will vary from back end to back end.  We plan on initially using [http://logstash.net/ logstash] as the message router, because it is already planned to be installed on every services server machine, and it is built specifically for this type of event-based message routing.


; endpoints : Different types of messages lend themselves to different types of presentation, processing, and analytics.  We will start with a small selection of back end destinations, but we will be able to add to this over time as we generate more types of metrics data and we spin up more presentation and query layers.  Proposed back ends are as follows:
; endpoints : Different types of messages lend themselves to different types of presentation, processing, and analytics.  We will start with a small selection of back end destinations, but we will be able to add to this over time as we generate more types of metrics data and we spin up more presentation and query layers.  Proposed back ends are as follows:
* [https://github.com/etsy/statsd statsd]: '''(Phase 1)''' statsd is already in the pipeline to be running on every Services machine
* [https://github.com/fetep/ruby-statsd ruby-statsd]: '''(Phase 1)''' ruby-statsd is already in the pipeline to be running on every Services machine.
* [https://github.com/mozilla-metrics/bagheera Bagheera]: '''(Phase 1)''' Bagheera is a REST service provided by the Mozilla Metrics team that will insert data into the Metrics team's Hadoop infrastructure, available for later processing.
* [https://hadoop.apache.org/hdfs/ HDFS] '''(Phase 1)''' Some data will be inserted into the Mozilla Metrics team's HDFS infrastructure where it will be available for later [https://hive.apache.org/ Hive] and/or map reduce based queries.
* [https://github.com/dcramer/django-sentry Sentry]: '''(Phase 1)''' Sentry is an exception logging infrastructure that provides useful debugging tools to service app developers.  Sentry is not yet planned on being provided by any Mozilla operations team, using it would require buy-in from and coordination with a Mozilla internal service provider (probably the Services Ops team).
* [http://www.arcsight.com/products/products-esm/ ArcSight ESM] '''(Phase 1)''' A "security correlation engine" already in use throughout the Mozilla organization.
* [http://esper.codehaus.org/ Esper]: '''(Phase 2)''' System for "complex event processing", i.e. one which will watch various statistic streams in real time looking for anomalous behavior.
* [https://github.com/dcramer/django-sentry Sentry]: '''(Phase 2)''' Sentry is an exception logging infrastructure that provides useful debugging tools to service app developers.  Sentry is not yet planned on being provided by any Mozilla operations team, using it would require buy-in from and coordination with a Mozilla internal service provider (probably the Services Ops team).
* [http://opentsdb.net/ OpenTSDB] '''(Phase 2)''' A "Time Series Database" providing fine grained real time monitoring and graphing.
* [http://esper.codehaus.org/ Esper]: '''(Phase 3?)''' System for "complex event processing", i.e. one which will watch various statistic streams in real time looking for anomalous behavior.
* [http://www.arcsight.com/products/products-esm/ ArcSight ESM] '''(Phase 2)''' A "security correlation engine" already in use throughout the Mozilla organization.
* [http://opentsdb.net/ OpenTSDB] '''(Phase 3?)''' A "Time Series Database" providing fine grained real time monitoring and graphing.


== Proposed API ==
== API ==


The atomic unit for the Services Metrics system is the "message". The structure of a message is inspired by that of the well known syslog message standard, with some slight extensions to allow for more rich metadata. Each message will consist of the following fields:
The atomic unit for the Metlog system is the "message". The structure of a message is inspired by that of the well known syslog message standard, with some slight extensions to allow for more rich metadata. Each message will consist of the following fields:


* ''timestamp'': Time at which the message is generated.
* ''timestamp'': Time at which the message is generated.
Line 52: Line 52:
* ''payload'': Actual message contents.
* ''payload'': Actual message contents.
* ''fields'': Arbitrary set of key/value pairs that includes any additional data that may be useful for back end reporting or analysis.
* ''fields'': Arbitrary set of key/value pairs that includes any additional data that may be useful for back end reporting or analysis.
* ''env_version'': API version number of the "message envelope", i.e. any changes to the message data structure (exclusive of message-type-specific changes that may be embedded within the fields or the payload) must increment the env_version value. The structure described in this document is envelope version 0.8.
* ''env_version'': API version number of the "message envelope", i.e. any changes to the message data structure (exclusive of message-type-specific changes that may be embedded within the fields or the payload) must increment the env_version value. The structure described in this document is envelope version 0.8.


We will provide a "metlog" library that will both ease generation of these messages and that will handle packaging them up and delivering them into the message processing infrastructure.  Implementations of this library will likely be available in both Python and Javascript, but the Python library will be available first and this document will, for now, only describe the Python API. The Javascript API will be similar, modulo syntactic sugar that is available in Python but not in JS (e.g. decorators, context managers), and will be documented in detail in the future.  The proposed Python API is as follows:
We will provide Metlog client libraries that will both ease generation of these messages and that will handle packaging them up and delivering them into the message processing infrastructure.  Implementations of this library are available in both [https://github.com/mozilla-services/metlog-py Python] and [https://github.com/mozilla-services/metlog-node Node.js style Javascript]. Please see the documentation for these client libraries to learn more about the specific APIs available in each environment.
 
; '''MetlogClient(sender, logger="", severity=6)''' : Primary metlog client class which can accept metlog messages, format and serialize them, and pass them along for delivery to the message processor.
 
* ''sender'': A "sender" object, i.e. any object that exposes a ''send_message'' method accepting a single ''msg'' argument.  All actual message transport and delivery is the responsibility of the sender.  We will provide a ZeroMQ Publisher sender implementation.
* ''logger'': Default for all subsequent ''metlog'' calls which do not explicitly pass this value.
* ''severity'': Default for all subsequent ''metlog'' calls which do not explicitly pass this value.
 
; '''MetlogClient.metlog(type, timestamp=None, logger=None, severity=None, message="", fields=None)''' : Sends a single log message along to the metlog listener(s). Most of the arguments correspond to the message fields described above.  Only ''type'' is strictly required, the rest will be populated by reasonable defaults if they aren't provided:
 
* ''timestamp'': Defaults to current system time
* ''logger'': Defaults to the current value of MetlogClient.logger
* ''severity'': Defaults to the current value of MetlogClient.severity
* ''message'': Defaults to an empty string
* ''fields'': Defaults to an empty dictionary
 
; '''MetlogClient.timer(name, timestamp=None, logger=None, severity=None, fields=None, rate=1)''' : Can be used as either a context manager or a decorator.  Will calculate the time required to execute the enclosed code, and will generate and send a metlog message (of type "timer") containing the timing information upon completion.
 
* ''name'': A required string label for the timer that will be added to the message fields
* ''timestamp'': Defaults to current system time
* ''logger'': Defaults to the current value of MetlogClient.logger
* ''severity'': Defaults to the current value of MetlogClient.severity
* ''fields'': Defaults to an empty dictionary
* ''rate'' Represents what fraction of these invocations should actually be timed; a value of 0.3 would mean that the code would be timed and the results sent off approximately 30% of the time it was executed
 
; '''MetlogClient.incr(name, timestamp=None, logger=None, severity=None, fields=None)''' : Sends an "increment counter" message to metlog.  ''name'' is a required string label for the counter that will be added to the message metadata.
 
* ''name'': A required string label for the counter that will be added to the message fields
* ''timestamp'': Defaults to current system time
* ''logger'': Defaults to the current value of MetlogClient.logger
* ''severity'': Defaults to the current value of MetlogClient.severity
* ''fields'': Defaults to an empty dictionary


== Use Cases ==
== Use Cases ==
Line 91: Line 60:
=== Python App Framework performance metrics ===
=== Python App Framework performance metrics ===


The Python framework that underlies the Services Apps will be annotated w/ ''timer'' calls to automatically generate performance metrics for such key activities as authentication and execution of the actual view callable.  The sample rate of these calls will be able to be specified in the app configuration, where a value of 0 can be entered to turn off the timers altogether.  These will ultimately feed into a [https://github.com/etsy/statsd statsd] / [http://graphite.wikidot.com/ graphite] back end provided by Services Ops, where app owners will be able to see graphs of the captured data.
The Python framework that underlies the Services Apps will be annotated w/ ''timer'' calls to automatically generate performance metrics for such key activities as authentication and execution of the actual view callable.  The sample rate of these calls will be able to be specified in the app configuration, where a value of 0 can be entered to turn off the timers altogether.  These will ultimately feed into a [https://github.com/fetep/ruby-statsd ruby-statsd] / [http://graphite.wikidot.com/ graphite] back end provided by Services Ops, where app owners will be able to see graphs of the captured data.


=== Python App Framework exception logging  ===
=== Python App Framework exception logging  ===
Line 99: Line 68:
=== Ad-Hoc service app metrics gathering ===
=== Ad-Hoc service app metrics gathering ===


Any service app will have the ability to easily generate arbitrary message data and metadata for delivery into the services metrics system.  Any messages not specifically recognized as being intended for statsd or sentry will be delivered to a Hadoop cluster provided by the Metrics team, allowing for later analysis via custom map-reduce jobs or [https://hive.apache.org/ Hive] queries.
Any service app will have the ability to easily generate arbitrary message data and metadata for delivery into the Metlog system.  Any messages not specifically recognized as being intended for another back end will be delivered to an HDFS cluster provided by the Metrics team, allowing for later analysis via custom map-reduce jobs or [https://hive.apache.org/ Hive] queries.


=== CEF security logging ===
=== CEF security logging ===


Several groups in Mozilla are already using ArcSight ESM to track events and to evaluate them looking for patterns that may indicate attempts at security or abuse violations.  ArcSight expects messages in the [http://www.arcsight.com/solutions/solutions-cef/ "Common Event Format"].  Rather than talking to ArcSight directly, services developers could send messages of type "cef" through metlog, decoupling service applications from a vendor-specific back end.
Several groups in Mozilla are already using ArcSight ESM to track events and to evaluate them looking for patterns that may indicate attempts at security or abuse violations.  ArcSight expects messages in the [http://www.arcsight.com/solutions/solutions-cef/ "Common Event Format"].  Rather than talking to ArcSight directly, services developers could send messages of type "cef" through metlog, decoupling service applications from a vendor-specific back end.
== Setting up Metlog with Logstash ==
Setting up logstash to operate with metlog involves installing the logstash-metlog package.
You can find the latest version of the code on [https://github.com/mozilla-services/6logstash-metlog/ github] and the latest documentation at [http://logstash-metlog.readthedocs.org/en/latest/ logstash-metlog.rtfd.org].
We keep a working vagrant instance as well - the [https://github.com/mozilla-services/vagrant-metlog-backend/blob/master/files/logstash.conf logstash.conf] configuration file is a useful reference point for setting up your own metlog server instance.
canmove, Confirmed users
1,173

edits