CloudServices/Sagrada/Metlog: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
No edit summary
 
(18 intermediate revisions by 3 users not shown)
Line 1: Line 1:
= Overview =
= Overview =


The '''Metrics''' project is part of [[Services/Sagrada|Project Sagrada]], providing a service for applications to capture and inject arbitrary data into a back end storage suitable for out-of-band analytics and processing.
The '''Metlog''' project is part of [[Services/Sagrada|Project Sagrada]], providing a service for applications to capture and inject arbitrary data into a back end storage suitable for out-of-band analytics and processing.


= Project =
= Project =
Line 8: Line 8:


* Rob Miller
* Rob Miller
* Victor Ng


== User Requirements ==
== User Requirements ==


=== Phase 1 ===
The first version of the Metlog system will focus on providing an easy mechanism for the [[Services/Sync| Sync]] and [https://browserid.org/ BrowserID] projects (and any other internal Mozilla services) to efficiently send profiling data and any other arbitrary metrics information that may be desired into one or more back end storage locations.  Once the data has made it to its final destination, there should be available to those w/ appropriate access the ability to do analytics queries and report generation on the accumulated data.
 
The first version of the Metrics system will focus on providing an easy mechanism for the [[Services/Sync| Sync]] and [https://browserid.org/ BrowserID] projects (and any other internal Mozilla services) to efficiently send profiling data and any other arbitrary metrics information that may be desired into one or more backend storage locations.  Once the data has made it to its final destination, there should be available to those w/ appropriate access the ability to do analytics queries and report generation on the accumulated data.


Requirements:
Requirements:


* Service Apps using primary Services Python framework will get certain pre-defined metrics captured for them "for free" (i.e. w/o having to explicitly request them).  Configuration options will be provided to prevent some or all of this data collection from happening when this is a requirement for user privacy and/or security reasons.
* Services apps should be provided an easy to use API that will allow them to send arbitrary text data into the metrics and reporting infrastructure.
* Service Apps will be provided a simple mechanism for inserting arbitrary data points into the metrics system.
* Processing and I/O load generated by the API calls made by the services apps must be extremely small to allow for minimal impact on app performance even when there is a very high volume of messages being passed.
* All inserted data will be transparently (to the service app) processed and passed into the appropriate back end storage and analytics destination.
* API should provide a mechanism for arbitrary metadata to be attached to every message payload.
* Service app owners have access to an interface (or interfaces) where they can perform arbitrary queries against the data points that have been captured.
* Overall system should provide a sensible set of message categories so that commonly generated types of messages can be labeled as such, and so that the processing and reporting functionality can easily distinguish between the various types of message payloads.
* Message taxonomy must be easily extendable to support message types that are not defined up front.
* Message processing system must be able to distinguish between different message types, so the various types can be routed to the appropriate back end(s) for effective analysis and reporting.
* Service app owners must have access to an interface (or interfaces) that will provide reporting and querying capabilities appropriate to the various types of messages that have been sent into the system.


== Proposed Architecture ==


=== Phase 2 ===
The proposed Services Metlog architecture will consist of 3 layers:


The second phase of will focus on improving the back end reporting infrastructure.  Once data has started flowing and there is an opportunity to assess which reports and graphs would be most generally useful, the goal will to be to make it as easy as possible for existing and new service app owners to get to their information.
; generator : The generator portion of the system is the actual service application that is generating the data that is to be sent into the system.  We will provide libraries (described below) that app authors can use to easily plug in.  The libraries will take messages generated by the applications, serialize them, and then send them out (using ZeroMQ as the transport, by default).  The metrics generating apps that need to be supported initially are based on the following platforms:
* Mozilla Services team's Python app framework (sync, reg, sreg, message queue, etc.)
* Node.js (BrowserID).


Requirements:
; router : The router is what will be listening for the messages sent out by the provided libraries.  It will deserialize these messages and examine the metadata to determine the appropriate back end(s) to which the message should be delivered.  The format and protocol for delivering these messages to the endpoints will vary from back end to back end.  We plan on initially using [http://logstash.net/ logstash] as the message router, because it is already planned to be installed on every services server machine, and it is built specifically for this type of event-based message routing.


* Service app owners have access to an interface (or interfaces) where they can access a set of predefined reports and/or graphs & charts displaying useful information based on captured data points (e.g. unique daily users, average time elapsed handling requests).
; endpoints : Different types of messages lend themselves to different types of presentation, processing, and analytics.  We will start with a small selection of back end destinations, but we will be able to add to this over time as we generate more types of metrics data and we spin up more presentation and query layers.  Proposed back ends are as follows:
* [https://github.com/fetep/ruby-statsd ruby-statsd]: '''(Phase 1)''' ruby-statsd is already in the pipeline to be running on every Services machine.
* [https://hadoop.apache.org/hdfs/ HDFS] '''(Phase 1)''' Some data will be inserted into the Mozilla Metrics team's HDFS infrastructure where it will be available for later [https://hive.apache.org/ Hive] and/or map reduce based queries.
* [http://www.arcsight.com/products/products-esm/ ArcSight ESM] '''(Phase 1)''' A "security correlation engine" already in use throughout the Mozilla organization.
* [https://github.com/dcramer/django-sentry Sentry]: '''(Phase 2)''' Sentry is an exception logging infrastructure that provides useful debugging tools to service app developers.  Sentry is not yet planned on being provided by any Mozilla operations team, using it would require buy-in from and coordination with a Mozilla internal service provider (probably the Services Ops team).
* [http://esper.codehaus.org/ Esper]: '''(Phase 3?)''' System for "complex event processing", i.e. one which will watch various statistic streams in real time looking for anomalous behavior.
* [http://opentsdb.net/ OpenTSDB] '''(Phase 3?)''' A "Time Series Database" providing fine grained real time monitoring and graphing.


== API ==


The atomic unit for the Metlog system is the "message". The structure of a message is inspired by that of the well known syslog message standard, with some slight extensions to allow for more rich metadata. Each message will consist of the following fields:


== Proposed API ==
* ''timestamp'': Time at which the message is generated.
* ''logger'': String token identifying the message generator, such as the name of the service application in question.
* ''type'': String token identifying the type of message payload
* ''severity'': Numerical code from 0-7 indicating the severity of the message, as defined by [https://tools.ietf.org/html/rfc5424 RFC 5424].
* ''payload'': Actual message contents.
* ''fields'': Arbitrary set of key/value pairs that includes any additional data that may be useful for back end reporting or analysis.
* ''env_version'': API version number of the "message envelope", i.e. any changes to the message data structure (exclusive of message-type-specific changes that may be embedded within the fields or the payload) must increment the env_version value. The structure described in this document is envelope version 0.8.


The full Services Metrics infrastructure will consist of a couple of different APIs.  The first will be a mechanism for sending performance- and ops-related data, limited to increment counters and timers (i.e. time elapsed for completion of a certain operation), into a [https://github.com/etsy/statsd statsd] setup, which will ultimately feed into a [http://graphite.wikidot.com/ graphite] installation. The second will provide a way to capture arbitrary text data, analogous to syslog-style log entries, with each record accompanied by a set of string tokens that will identify the type of payload the record contains as well as any other metadata that may be useful for analytics and/or processing.
We will provide Metlog client libraries that will both ease generation of these messages and that will handle packaging them up and delivering them into the message processing infrastructure. Implementations of this library are available in both [https://github.com/mozilla-services/metlog-py Python] and [https://github.com/mozilla-services/metlog-node Node.js style Javascript]. Please see the documentation for these client libraries to learn more about the specific APIs available in each environment.


The statsd API will be achieved by the inclusion of existing statsd client libraries: [https://github.com/jsocol/pystatsd pystatsd] for the Python services and [https://github.com/sivy/node-statsd node-statd] for node.js-based services.  The core Python service app platform that Services provides will already contain pystatsd calls capturing basic information such as successful login counters, total time elapsed for HTTP request handling, etc.  Inclusion of a 'statsd = false' setting in the app configuration will prevent this data from being collected.
== Use Cases ==


The API for general metrics data collection will be minimal.  For Python apps we will provide a `metlog` library that will provide the following functions:
=== Python App Framework performance metrics ===


'''set_metlog_dest(host, port)'''
The Python framework that underlies the Services Apps will be annotated w/ ''timer'' calls to automatically generate performance metrics for such key activities as authentication and execution of the actual view callable.  The sample rate of these calls will be able to be specified in the app configuration, where a value of 0 can be entered to turn off the timers altogether.  These will ultimately feed into a [https://github.com/fetep/ruby-statsd ruby-statsd] / [http://graphite.wikidot.com/ graphite] back end provided by Services Ops, where app owners will be able to see graphs of the captured data.


    Specifies the address and port of the metlog listener, the destination of
=== Python App Framework exception logging ===
    the UDP packets that will be sent out as a result of subsequent ''metlog''
    calls. The Services Python framework will provide a mechanism to specify
    this via configuration files so services authors won't have to make this
    call themselves.


'''metlog(tokens, msg)'''
In addition to timing information, the Python framework for services apps can automatically capture exceptions, sending a full traceback and some amount of local variable information as part of the message payload.  This can ultimately be delivered to a [https://github.com/dcramer/django-sentry Sentry] installation for developer introspection and debugging.


    Sends a single log message to the previously specified metlog listener.
=== Ad-Hoc service app metrics gathering ===
    ''tokens'' should be a sequence of string tokens containing any metadata
    required to identify and classify the message, while ''msg'' should contain
    the main data payload.  This will be serialized into a simple format and
    sent via UDP to the listener, "fire and forget"-style for minimal
    performance impact on the calling application.


A similar library can be constructed in Javascript for use in node.js applications.
Any service app will have the ability to easily generate arbitrary message data and metadata for delivery into the Metlog system.  Any messages not specifically recognized as being intended for another back end will be delivered to an HDFS cluster provided by the Metrics team, allowing for later analysis via custom map-reduce jobs or [https://hive.apache.org/ Hive] queries.


=== CEF security logging ===


The first iteration of this solution will not require a great deal of engineering to implement, as it will leverage lots of infrastructure that is already in placeThe statsd client will be configured to talk to the statsd services that are already planned to be running on every Services hostThe stats gathered from the various hosts will be sent on to the Services Ops graphite installation, which will aggregate and graph the stats for developer consumption.
Several groups in Mozilla are already using ArcSight ESM to track events and to evaluate them looking for patterns that may indicate attempts at security or abuse violationsArcSight expects messages in the [http://www.arcsight.com/solutions/solutions-cef/ "Common Event Format"]Rather than talking to ArcSight directly, services developers could send messages of type "cef" through metlog, decoupling service applications from a vendor-specific back end.


The metlog portion will similarly make use of existing infrastructure.  Services Ops is going to have instances of the [http://logstash.net/ logstash] service in place which will be processing the log output from our various processes.  We will write a UDP listener input module for logstash which will be the metlog listener.  Logstash will then batch these messages and will construct HTTP requests providing collections of messages to a [https://github.com/mozilla-metrics/bagheera Bagheera] instance provided to us by the Metrics team.  The messages will ultimately end up in a Hadoop data store.  Access to a [https://hive.apache.org/ Hive] interface will be available to allow for construction of arbitary queries against any of the data that has landed.


== Setting up Metlog with Logstash ==


== Use Cases ==
Setting up logstash to operate with metlog involves installing the logstash-metlog package.
 
=== Sync ===


=== BrowserID ===
You can find the latest version of the code on [https://github.com/mozilla-services/6logstash-metlog/ github] and the latest documentation at [http://logstash-metlog.readthedocs.org/en/latest/ logstash-metlog.rtfd.org].


The BrowserID team has started specifying their metrics gathering requirements, described in some detail in [https://bugzilla.mozilla.org/show_bug.cgi?id=679139 Bug 679139].  The conversation attached to that bug focuses primarily on specific information that can be extracted from captured log files.  While there is useful information to be obtained from the logs, it's already evident that some inference will need to be made, and certain information will need to be explicitly prevented from being processed to ensure sufficient levels of user privacy.  The ability to capture and store arbitrary data points from within the code itself will simplify collection of certain data points, and the ability to use statsd timers will provide application performance metrics that would be impossible from log files alone.
We keep a working vagrant instance as well - the [https://github.com/mozilla-services/vagrant-metlog-backend/blob/master/files/logstash.conf logstash.conf] configuration file is a useful reference point for setting up your own metlog server instance.

Latest revision as of 19:55, 19 November 2013

Overview

The Metlog project is part of Project Sagrada, providing a service for applications to capture and inject arbitrary data into a back end storage suitable for out-of-band analytics and processing.

Project

Engineers

  • Rob Miller
  • Victor Ng

User Requirements

The first version of the Metlog system will focus on providing an easy mechanism for the Sync and BrowserID projects (and any other internal Mozilla services) to efficiently send profiling data and any other arbitrary metrics information that may be desired into one or more back end storage locations. Once the data has made it to its final destination, there should be available to those w/ appropriate access the ability to do analytics queries and report generation on the accumulated data.

Requirements:

  • Services apps should be provided an easy to use API that will allow them to send arbitrary text data into the metrics and reporting infrastructure.
  • Processing and I/O load generated by the API calls made by the services apps must be extremely small to allow for minimal impact on app performance even when there is a very high volume of messages being passed.
  • API should provide a mechanism for arbitrary metadata to be attached to every message payload.
  • Overall system should provide a sensible set of message categories so that commonly generated types of messages can be labeled as such, and so that the processing and reporting functionality can easily distinguish between the various types of message payloads.
  • Message taxonomy must be easily extendable to support message types that are not defined up front.
  • Message processing system must be able to distinguish between different message types, so the various types can be routed to the appropriate back end(s) for effective analysis and reporting.
  • Service app owners must have access to an interface (or interfaces) that will provide reporting and querying capabilities appropriate to the various types of messages that have been sent into the system.

Proposed Architecture

The proposed Services Metlog architecture will consist of 3 layers:

generator
The generator portion of the system is the actual service application that is generating the data that is to be sent into the system. We will provide libraries (described below) that app authors can use to easily plug in. The libraries will take messages generated by the applications, serialize them, and then send them out (using ZeroMQ as the transport, by default). The metrics generating apps that need to be supported initially are based on the following platforms:
  • Mozilla Services team's Python app framework (sync, reg, sreg, message queue, etc.)
  • Node.js (BrowserID).
router
The router is what will be listening for the messages sent out by the provided libraries. It will deserialize these messages and examine the metadata to determine the appropriate back end(s) to which the message should be delivered. The format and protocol for delivering these messages to the endpoints will vary from back end to back end. We plan on initially using logstash as the message router, because it is already planned to be installed on every services server machine, and it is built specifically for this type of event-based message routing.
endpoints
Different types of messages lend themselves to different types of presentation, processing, and analytics. We will start with a small selection of back end destinations, but we will be able to add to this over time as we generate more types of metrics data and we spin up more presentation and query layers. Proposed back ends are as follows:
  • ruby-statsd: (Phase 1) ruby-statsd is already in the pipeline to be running on every Services machine.
  • HDFS (Phase 1) Some data will be inserted into the Mozilla Metrics team's HDFS infrastructure where it will be available for later Hive and/or map reduce based queries.
  • ArcSight ESM (Phase 1) A "security correlation engine" already in use throughout the Mozilla organization.
  • Sentry: (Phase 2) Sentry is an exception logging infrastructure that provides useful debugging tools to service app developers. Sentry is not yet planned on being provided by any Mozilla operations team, using it would require buy-in from and coordination with a Mozilla internal service provider (probably the Services Ops team).
  • Esper: (Phase 3?) System for "complex event processing", i.e. one which will watch various statistic streams in real time looking for anomalous behavior.
  • OpenTSDB (Phase 3?) A "Time Series Database" providing fine grained real time monitoring and graphing.

API

The atomic unit for the Metlog system is the "message". The structure of a message is inspired by that of the well known syslog message standard, with some slight extensions to allow for more rich metadata. Each message will consist of the following fields:

  • timestamp: Time at which the message is generated.
  • logger: String token identifying the message generator, such as the name of the service application in question.
  • type: String token identifying the type of message payload
  • severity: Numerical code from 0-7 indicating the severity of the message, as defined by RFC 5424.
  • payload: Actual message contents.
  • fields: Arbitrary set of key/value pairs that includes any additional data that may be useful for back end reporting or analysis.
  • env_version: API version number of the "message envelope", i.e. any changes to the message data structure (exclusive of message-type-specific changes that may be embedded within the fields or the payload) must increment the env_version value. The structure described in this document is envelope version 0.8.

We will provide Metlog client libraries that will both ease generation of these messages and that will handle packaging them up and delivering them into the message processing infrastructure. Implementations of this library are available in both Python and Node.js style Javascript. Please see the documentation for these client libraries to learn more about the specific APIs available in each environment.

Use Cases

Python App Framework performance metrics

The Python framework that underlies the Services Apps will be annotated w/ timer calls to automatically generate performance metrics for such key activities as authentication and execution of the actual view callable. The sample rate of these calls will be able to be specified in the app configuration, where a value of 0 can be entered to turn off the timers altogether. These will ultimately feed into a ruby-statsd / graphite back end provided by Services Ops, where app owners will be able to see graphs of the captured data.

Python App Framework exception logging

In addition to timing information, the Python framework for services apps can automatically capture exceptions, sending a full traceback and some amount of local variable information as part of the message payload. This can ultimately be delivered to a Sentry installation for developer introspection and debugging.

Ad-Hoc service app metrics gathering

Any service app will have the ability to easily generate arbitrary message data and metadata for delivery into the Metlog system. Any messages not specifically recognized as being intended for another back end will be delivered to an HDFS cluster provided by the Metrics team, allowing for later analysis via custom map-reduce jobs or Hive queries.

CEF security logging

Several groups in Mozilla are already using ArcSight ESM to track events and to evaluate them looking for patterns that may indicate attempts at security or abuse violations. ArcSight expects messages in the "Common Event Format". Rather than talking to ArcSight directly, services developers could send messages of type "cef" through metlog, decoupling service applications from a vendor-specific back end.


Setting up Metlog with Logstash

Setting up logstash to operate with metlog involves installing the logstash-metlog package.

You can find the latest version of the code on github and the latest documentation at logstash-metlog.rtfd.org.

We keep a working vagrant instance as well - the logstash.conf configuration file is a useful reference point for setting up your own metlog server instance.