Socorro/ElasticSearch API: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(Rename page + add details about the implementation)
Line 1: Line 1:
= Middleware API for ElasticSearch =
= Socorro Middleware =


'''This is a draft''' of the new API for querying ElasticSearch through the middleware API of Socorro.
The Middleware API in Socorro aims at separating the front-end from the back-end by providing an interface to access the data. By doing so, the front-end will not have to care about the storage system, and will retrieve data from Hbase, PostgreSQL or ElasticSearch in a consistent and simple way, through a consistent REST API.  
 
The Middleware API aims to separate the front-end from the back-end by providing an interface to access the data. By doing so, the front-end will not have to care about the storage system, and will retrieve data from Hbase, PostgreSQL or ElasticSearch in a consistent and simple way, through our REST API.  


The API is separated in several categories / entry points:
The API is separated in several categories / entry points:
Line 11: Line 9:
* /crash
* /crash
* /stats
* /stats
* /util


This categories are explained below.  
This categories are explained below.  
Line 27: Line 26:


== Query ==
== Query ==
Note: this should be removed, it is useless and potentially unsecured.


=== Description ===
=== Description ===
Line 194: Line 195:
= Implementation =
= Implementation =


''Describe how we are going to implement this...''
Actual implementation:
<pre>
socorro/
        ...
        services/
                search.py
                ...
        search/
                searchapi.py
                elasticsearch.py
                postgresql.py
        ...
</pre>
 
Plan for reorganizing the middleware:
 
<pre>
socorro/
        external/
                elasticsearch/
                        es_middlewareapi.py
                        search.py
                        report.py
                        crash_storage.py
                        ...
                hbase/
                        hb_middlewareapi.py
                        crash.py
                        hbaseclient.py
                        crash_storage.py
                        ...
                postgresql/
                        pg_middlewareapi.py
                        search.py
                        report.py
                        ...
        middleware/
                seach_service.py
                report_service.py
                crash_service.py
                ...
</pre>
 
The idea is to have all the uri services (services accessible through the REST API) in a same module called <tt>middleware</tt>. Those services don't implement anything but instead call other classes or functions depending on the storage system to use.
 
The <tt>external</tt> module contains ways of communicating with external resources like HBase, PostgreSQL or ElasticSearch. It will contain implementations of the middleware API as well as other useful functions related to the concerned resource.
 
For example, when calling the search service via the REST API, and with a configuration pointing to ElasticSearch, here is what will be called:
 
* 1. socorro.middleware.search_service
* 2. socorro.external.elasticsearch.es_middlewareapi
* 3. socorro.external.elasticsearch.search
 
middlewareapi classes are not supposed to implement anything, but instead direct requests to the good class and function, and send back the result. This way it's easier to add new classes or functions, and it's also easier to share code between components.

Revision as of 01:25, 24 August 2011

Socorro Middleware

The Middleware API in Socorro aims at separating the front-end from the back-end by providing an interface to access the data. By doing so, the front-end will not have to care about the storage system, and will retrieve data from Hbase, PostgreSQL or ElasticSearch in a consistent and simple way, through a consistent REST API.

The API is separated in several categories / entry points:

  • /query
  • /search
  • /report
  • /crash
  • /stats
  • /util

This categories are explained below.

This API is designed to be built on top of ElasticSearch. However, we want our users to have the choice of using ES or not. That is why we will try to make this API as modular as possible, so we can have different implementations using different storage or search engines (e.g. ElasticSearch, PostGreSQL... ). The Socorro UI should be completely independent from the storage engine used, and should use this API without caring about it.

The API

Version

Every URI is prefixed by a version number, so final URIs should look like: http://example.com/(api_version)/(request)/.

Forcing an implementation

When calling the API it is possible to choose the implementation to use (ie. ElasticSearch or PostgreSQL) by giving a value to force_search_impl. Value can be 'es' for ElasticSearch or 'pg' for PostgreSQL.

Query

Note: this should be removed, it is useless and potentially unsecured.

Description

Low level query, just sends a JSON query to ES directly, and returns the result of this query.

API Spec

HTTP request: POST
Data: JSON query to send to ElasticSearch
URI: /query/[(types)/]

  • types: Types of data we are looking into. If omitted, default value is _all. Several types can be specified, separated by a + symbol.

Return value

This request returns the exact data the storage system returned.

Example

curl -XPOST 'http://example.com/110505/query/crashes/' -d '{
    "query" : {
        "match_all" : {}
    }
}'

Search

Description

Searches for crashes and returns them. This search is highly configurable, but can also be really simple using default values.

API Spec

HTTP request: GET
URI: /search/(type)/(optional_parameters)

  • type: Type of data we are looking for. Can be set to crashes to retrieve a set of crashes, or to signatures to get an aggregate of signatures.

Optional parameters:

Except for the first one, every parameter can be omitted. Any omitted parameter has a default value or is not used while querying ES. You can use only some of those parameters or all of them. The order of parameters doesn't matter except for the first one (types).

The complete URI is as follow: /search/(type)/for/(terms)/product/(product)/from/(from_date)/to/(to_date)/in/(fields)/version/(version)/os/(os_name)/branches/(branches)/search_mode/(search_mode)/reason/(crash_reason)/build/(build_id)/build_from/(build_from)/build_to/(build_to)/report_process/(report_process)/report_type/(report_type)/plugin_in/(plugin_in)/plugin_search_mode/(plugin_search_mode)/plugin_term/(plugin_term)

  • terms: Terms we are search for. Each term must be URL encoded. Several terms can be specified, separated by a + symbol. If not specified, nothing is searched, and the query returns the results corresponding to the other parameters.
  • product: The product we are interested in. (e.g. Firefox, Fennec, Thunderbird... ) Default value is "firefox".
  • from_date: Search for crashes that happened after this date. Can use the following formats: "yyyy-MM-dd", "yyyy-MM-dd HH:ii:ss" or "yyyy-MM-dd HH:ii:ss.S". Default value is a week ago.
  • to_date: Search for crashes that happened before this date. Can use the following formats: "yyyy-MM-dd", "yyyy-MM-dd HH:ii:ss" or "yyyy-MM-dd HH:ii:ss.S". Default value is now.
  • fields: Fields we are searching in. Several fields can be specified, separated by a + symbol. Default value is search in all fields. This is NOT implemented for PostgreSQL.
  • version: Version of the product. Can be set to _all to search into all versions. Default value is search in all versions.
  • os_name: Name of the Operating System. (e.g. Windows, Mac, Linux... ) Default value is search in all OS.
  • branches: Several branches can be specified, separated by a + symbol. Default value is search in all branches.
  • search_mode: Set how to search. Can be either is_exactly, contains or starts_with. Default value is contains.
  • crash_reason: Restricts search to crashes caused by this reason. Default value is empty.
  • build_id: Restricts search to crashes that happened on a product with this build ID. Default value is empty.
  • build_from: Restricts search to crashes that happened on a product within a range of build IDs. Default value is empty.
  • build_to: Restricts search to crashes that happened on a product within a range of build IDs. Default value is empty.
  • report_process: Can be any, browser or plugin. Default value is any.
  • report_type: Can be any, crash or hang. Default value is any.
  • plugin_in: Search for a plugin in this field. report_process has to be set to plugin. Default value is empty.
  • plugin_search_mode: How to search for this plugin. report_process has to be set to plugin. Default value is empty.
  • plugin_term: Terms to search for. Several terms can be specified, separated by a + symbol. report_process has to be set to plugin. Default value is empty.
  • result_number: Number of results to return. Default value is 100.
  • result_offset: Offset of the first result to return. Default value is 0.

Return value

The full JSON documents that meet the search parameters. JSON documents schema to be determined.

Example

http://example.com/110505/search/crashes/for/libflash.so/in/signature/product/firefox/version/4.0.1/from/2011-05-01/to/2011-05-05/os/Windows/

Report

Description

Get a specific report.

API Spec

HTTP request: GET
URI: /report/(report_name)/product/(product)/version/(version)/from/(from_date)/to/(to_date)/

  • report_name: The wanted report. Can be:
    • top_changers_by_signature
    • top_crashers_by_signature
    • top_crashers_by_url
    • top_crashers_by_domain
    • top_crashers_by_topsite
  • product: The product we are interested in. (e.g. Firefox, Fennec, Thunderbird... )
  • version: Version of the product.
  • from_date: Only crashes that happened after this date.
  • to_date: Only crashes that happened before this date.

Example: http://example.com/201105/report/top_crashers_by_url/product/firefox/version/5.0/from/2011-05-01/to/2011-05-05/

Crash

Description

Searches a crash by it's OOID and returns it. This query is already implemented in the Middleware.

API Spec

See http://code.google.com/p/socorro/wiki/APICalls

Stats

Description

This is a proposition.

Get some statistics around the data. E.g. counting by OS, by product, by ADU, by build... The difference with report is that stats only send back numeric data, counting through the entire data set or in a certain date range.

This may not be useful for Socorro UI right now, but could be in the future. It may be a good way of extending the information we give to our users. Those stats are very likely to be cached, meaning performance should not be an issue.

Examples of ES queries

{
    "size" : 0,
    "query" : {
        "match_all" : {}
    },
    "facets" : {
        "os" : {
            "terms" : { "script_field" : "_source.os_name" }
        }
    }
}

Gives the number of crashes by OS. Example result:

{
  "took" : 50,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 229,
    "max_score" : 1.0,
    "hits" : [ ]
  },
  "facets" : {
    "os" : {
      "_type" : "terms",
      "missing" : 74,
      "terms" : [ {
        "term" : "Windows NT",
        "count" : 134
      }, {
        "term" : "Mac OS X",
        "count" : 13
      }, {
        "term" : "Linux",
        "count" : 8
      } ]
    }
  }
}

Implementation

Actual implementation:

socorro/
        ...
        services/
                search.py
                ...
        search/
                searchapi.py
                elasticsearch.py
                postgresql.py
        ...

Plan for reorganizing the middleware:

socorro/
        external/
                elasticsearch/
                        es_middlewareapi.py
                        search.py
                        report.py
                        crash_storage.py
                        ...
                hbase/
                        hb_middlewareapi.py
                        crash.py
                        hbaseclient.py
                        crash_storage.py
                        ...
                postgresql/
                        pg_middlewareapi.py
                        search.py
                        report.py
                        ...
        middleware/
                seach_service.py
                report_service.py
                crash_service.py
                ...

The idea is to have all the uri services (services accessible through the REST API) in a same module called middleware. Those services don't implement anything but instead call other classes or functions depending on the storage system to use.

The external module contains ways of communicating with external resources like HBase, PostgreSQL or ElasticSearch. It will contain implementations of the middleware API as well as other useful functions related to the concerned resource.

For example, when calling the search service via the REST API, and with a configuration pointing to ElasticSearch, here is what will be called:

  • 1. socorro.middleware.search_service
  • 2. socorro.external.elasticsearch.es_middlewareapi
  • 3. socorro.external.elasticsearch.search

middlewareapi classes are not supposed to implement anything, but instead direct requests to the good class and function, and send back the result. This way it's easier to add new classes or functions, and it's also easier to share code between components.