EngineeringProductivity/Projects/ActiveData: Difference between revisions
Klahnakoski (talk | contribs) (rephrase) |
Klahnakoski (talk | contribs) (it has been 3 months, total rewrite) |
||
Line 2: | Line 2: | ||
= Overview = | = Overview = | ||
ActiveData is a publicly accessible data warehouse containing the results of the unit tests run on Mozilla's products. | |||
== Problem == | |||
In order to improve our testing infrastructure we require data on how that infrastructure is performing. That information can be extracted from the raw logs, but that requires downloading samples, parsing data, insertion into a database (or worse, writing queries in an imperative language, like Python). When we are done an analysis we have effectively built an ETL pipeline that does not scale, and is too specific to be reused elsewhere. The next project does this work all over again. | |||
== Solution== | |||
= | ActiveData will serve as a reusable ETL pipeline; annotating the test results with as much relevant data as possible. It also provides a query service to explore and aggregate the data, so there is minimal setup required to access this data. | ||
= Design = | |||
ActiveData attempts to provide the benefits of an available database to the public; except larger and faster. | |||
== Goals == | |||
An active data instance distinguishes itself from a static resource, or database, or big data solution, by delivering a particular set of features: | An active data instance distinguishes itself from a static resource, or database, or big data solution, by delivering a particular set of features: | ||
* '''A service, open to third party clients''' - By providing the service, clients don't need to setup their own datastore | * '''A service, open to third party clients''' - By providing the service, clients don't need to setup their own datastore | ||
* '''Fast filtering''' - Sub-second filtering over the contents of the whole datastore, independent of size, saves the application developer from declaring and managing indexes that do the same: There is sufficient information in the queries to determine which indexes should be built to deliver a quick response. | * '''Fast filtering''' - Sub-second filtering over the contents of the whole datastore, independent of size, saves the application developer from declaring and managing indexes that do the same: There is sufficient information in the queries to determine which indexes should be built to deliver a quick response. | ||
Line 17: | Line 19: | ||
* '''Metadata on dimensions and measures''' - ActiveData also provides context to the data it holds. It serves the purpose to allow exploration and discovery by third parties; by describing unit-of-measure, how dimensions relate to others, and provide human descriptions of the columns stored. This metadata is also invaluable in automating the orientation and formatting of dashboard charts: Knowing the domain of an axis allows code to decide the best (default) chart form, and provides logically reasonable aggregate options. | * '''Metadata on dimensions and measures''' - ActiveData also provides context to the data it holds. It serves the purpose to allow exploration and discovery by third parties; by describing unit-of-measure, how dimensions relate to others, and provide human descriptions of the columns stored. This metadata is also invaluable in automating the orientation and formatting of dashboard charts: Knowing the domain of an axis allows code to decide the best (default) chart form, and provides logically reasonable aggregate options. | ||
* '''Has a security model''' - Simpler applications can avoid the complications of a security model if it is baked into the ActiveData solution. If ActiveData is to become mainstream it is important that it can manage sensitive data and PII. | * '''Has a security model''' - Simpler applications can avoid the complications of a security model if it is baked into the ActiveData solution. If ActiveData is to become mainstream it is important that it can manage sensitive data and PII. | ||
== Limitations == | |||
ActiveData makes specific tradeoffs to achieve its goals. It has the following limitations: | |||
* large memory requirements | |||
* low add/update/remove speeds | |||
* strict data model (snowflake schema, hierarchical relations only) | |||
* non-relational | |||
* ETL work required to de-normalize data | |||
* ETL work required to provide dimension metadata | |||
== Non Goals == | |||
ActiveData is not meant to replace an application database. Applications often track significantly more data related to good interface design, process sequences, complex relations, and object life cycles. | |||
ActiveData's simple model makes it difficult to track object life cycles and impossible to model many-to-many relations. | |||
Data is not live, and definitly does not track "pending jobs" like TreeHerder or TaskCluster do. Test results may take a day, or more, to be indexed. | |||
= Dependencies / Who will use this = | |||
== Dependencies == | |||
== Users == | |||
= Let's Use It! = | = Let's Use It! = | ||
The service listens at http://activedata.allizom.org/query and accepts queries in [https://github.com/klahnakoski/Qb/blob/master/docs/Qb_Reference.md Qb format]. | The service listens at http://activedata.allizom.org/query and accepts queries in [https://github.com/klahnakoski/Qb/blob/master/docs/Qb_Reference.md Qb format]. | ||
== The Query Tool == | |||
The ActiveData service is intended for use by automated clients, not humans. The [http://activedata.allizom.org/tools/query.html Query Tool] is a minimal web page for humans to do some exporation, and to test phrasing queries. | |||
* [http://activedata.allizom.org/tools/query.html ActiveData QueryTool] | |||
== Documentation == | |||
* [https://github.com/klahnakoski/ActiveData/blob/master/docs/Qb_Tutorial.md Simple tutorial] | |||
* [https://github.com/klahnakoski/ActiveData/blob/master/docs/Unittest%20Schema.md Unittest results schema] | |||
= Code = | = Code = | ||
Development is still in the early stages, setting up your own service | |||
* Github: https://github.com/klahnakoski/ActiveData | |||
== Tests == | |||
* [https://github.com/klahnakoski/ActiveData/blob/master/tests/test_unittests.py testing code which sends some unittest-specific queries] | |||
== Bugs == | |||
Bug are tracked in [https://bugzilla.mozilla.org/buglist.cgi?quicksearch=activedata Bugzilla]. The open issues are shown here: | |||
<bugzilla> | |||
{ | |||
"quicksearch":"activedata" | |||
} | |||
</bugzilla> | |||
= Contact = | |||
* Kyle Lahnakoski | |||
* | ** IRC: ekyle@irc.mozilla.org | ||
** Email: klahnakoski@mozilla.org | |||
** Bugzilla: :ekyle | |||
=Context= | = More Context= | ||
Mostly rambling, optional reading. | |||
== | == Inspiration== | ||
This project is inspired by the data warehouse and data mart technology that is common inside large corporations. These warehouses are useful because they are "active" services: This means the data is not only available, but it can be explored interactively by large audience using a query language. | |||
== General Problem == | |||
A significant portion of any application is the backend database/datastore, which include: | A significant portion of any application is the backend database/datastore, which include: | ||
* Managing resources and machines to support the datastore | * Managing resources and machines to support the datastore | ||
Line 39: | Line 85: | ||
The manual effort put toward these features becomes significant as the amount of data grows in size and complexity. More importantly, this effort is being spent over and over on a multitude of applications, each a trivial variation of the next. | The manual effort put toward these features becomes significant as the amount of data grows in size and complexity. More importantly, this effort is being spent over and over on a multitude of applications, each a trivial variation of the next. | ||
==Solution== | ==General Solution== | ||
Abstractly, we desire to reduce this redundant workload by adding a layer of abstraction called ActiveData: Clients using ActiveData benefit from the features it provides and avoid the datastore management complexities. While the ActiveData implementers can focus on these common issues while being given a simpler data model, and simpler query language, upon which to calculate optimizations. | Abstractly, we desire to reduce this redundant workload by adding a layer of abstraction called ActiveData: Clients using ActiveData benefit from the features it provides and avoid the datastore management complexities. While the ActiveData implementers can focus on these common issues while being given a simpler data model, and simpler query language, upon which to calculate optimizations. | ||
Line 50: | Line 96: | ||
Applications that leverage an active data warehouse can forgo significant server side development, if not all, and put the logic on the client side. | Applications that leverage an active data warehouse can forgo significant server side development, if not all, and put the logic on the client side. | ||
Revision as of 15:27, 18 June 2015
Overview
ActiveData is a publicly accessible data warehouse containing the results of the unit tests run on Mozilla's products.
Problem
In order to improve our testing infrastructure we require data on how that infrastructure is performing. That information can be extracted from the raw logs, but that requires downloading samples, parsing data, insertion into a database (or worse, writing queries in an imperative language, like Python). When we are done an analysis we have effectively built an ETL pipeline that does not scale, and is too specific to be reused elsewhere. The next project does this work all over again.
Solution
ActiveData will serve as a reusable ETL pipeline; annotating the test results with as much relevant data as possible. It also provides a query service to explore and aggregate the data, so there is minimal setup required to access this data.
Design
ActiveData attempts to provide the benefits of an available database to the public; except larger and faster.
Goals
An active data instance distinguishes itself from a static resource, or database, or big data solution, by delivering a particular set of features:
- A service, open to third party clients - By providing the service, clients don't need to setup their own datastore
- Fast filtering - Sub-second filtering over the contents of the whole datastore, independent of size, saves the application developer from declaring and managing indexes that do the same: There is sufficient information in the queries to determine which indexes should be built to deliver a quick response.
- Fast aggregates - Sub-second calculation of statistics over the whole datastore saves the application developer from building and managing caches of those aggregates.
- API is a query language (SQL?, MDX?) - Building upon the formalisms, and familiarity, of existing query languages, we reduce the learning curve, and also provide Active Data implementations with more insight into the intent of the client application; and optimize for its use cases.
- Uniform, Cartesian space of values - Mozilla has a mandate of data driven decision making. Data analysis tools, like Spreadsheets, R, Scipy, Numpy, and Pandas are used to perform data analysis, and they all require uniform data in multi-dimensional arrays, commonly known as "pivot tables" or "data frames". ActiveData's objective is to provide query results in these formats
- Metadata on dimensions and measures - ActiveData also provides context to the data it holds. It serves the purpose to allow exploration and discovery by third parties; by describing unit-of-measure, how dimensions relate to others, and provide human descriptions of the columns stored. This metadata is also invaluable in automating the orientation and formatting of dashboard charts: Knowing the domain of an axis allows code to decide the best (default) chart form, and provides logically reasonable aggregate options.
- Has a security model - Simpler applications can avoid the complications of a security model if it is baked into the ActiveData solution. If ActiveData is to become mainstream it is important that it can manage sensitive data and PII.
Limitations
ActiveData makes specific tradeoffs to achieve its goals. It has the following limitations:
- large memory requirements
- low add/update/remove speeds
- strict data model (snowflake schema, hierarchical relations only)
- non-relational
- ETL work required to de-normalize data
- ETL work required to provide dimension metadata
Non Goals
ActiveData is not meant to replace an application database. Applications often track significantly more data related to good interface design, process sequences, complex relations, and object life cycles. ActiveData's simple model makes it difficult to track object life cycles and impossible to model many-to-many relations. Data is not live, and definitly does not track "pending jobs" like TreeHerder or TaskCluster do. Test results may take a day, or more, to be indexed.
Dependencies / Who will use this
Dependencies
Users
Let's Use It!
The service listens at http://activedata.allizom.org/query and accepts queries in Qb format.
The Query Tool
The ActiveData service is intended for use by automated clients, not humans. The Query Tool is a minimal web page for humans to do some exporation, and to test phrasing queries.
Documentation
Code
Development is still in the early stages, setting up your own service
Tests
Bugs
Bug are tracked in Bugzilla. The open issues are shown here:
2 Total; 2 Open (100%); 0 Resolved (0%); 0 Verified (0%);
Contact
- Kyle Lahnakoski
- IRC: ekyle@irc.mozilla.org
- Email: klahnakoski@mozilla.org
- Bugzilla: :ekyle
More Context
Mostly rambling, optional reading.
Inspiration
This project is inspired by the data warehouse and data mart technology that is common inside large corporations. These warehouses are useful because they are "active" services: This means the data is not only available, but it can be explored interactively by large audience using a query language.
General Problem
A significant portion of any application is the backend database/datastore, which include:
- Managing resources and machines to support the datastore
- Data migrations on schemas during application lifetime
- Manually defining database indexes for responsive data retrieval
- Coding caching logic to reduce application latency
The manual effort put toward these features becomes significant as the amount of data grows in size and complexity. More importantly, this effort is being spent over and over on a multitude of applications, each a trivial variation of the next.
General Solution
Abstractly, we desire to reduce this redundant workload by adding a layer of abstraction called ActiveData: Clients using ActiveData benefit from the features it provides and avoid the datastore management complexities. While the ActiveData implementers can focus on these common issues while being given a simpler data model, and simpler query language, upon which to calculate optimizations.
Columnar datastores, have solved many (but not all) problems with changing schema. Query-directed indexing has been around for decades in Oracle's query optimization algorithms, and are available for free in ElasticSearch. We now have the technology to build an ActiveData solution.
By defining an ActiveData standard, we can innovate on both sides of the ActiveData abstraction layer independently
Client Architecture
Applications that leverage an active data warehouse can forgo significant server side development, if not all, and put the logic on the client side.