EngineeringProductivity/Projects/ActiveData: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(better wording)
(reformatting)
Line 6: Line 6:
This project is inspired by the data warehouse and data mart technology that is common inside large corporations, but largely non-existent in the public space.  Since Mozilla's mandate is an open web, and we have a lot of data to share, it is only logical we make our data active.
This project is inspired by the data warehouse and data mart technology that is common inside large corporations, but largely non-existent in the public space.  Since Mozilla's mandate is an open web, and we have a lot of data to share, it is only logical we make our data active.


 
= Features =
= Architecture =
 
Applications that leverage an active data warehouse can forgo significant server side development, if not all, and put the logic on the client side.
 
== Features ==


An active data instance distinguishes itself from a static resource, or database, or big data solution, by delivering a particular set of features:
An active data instance distinguishes itself from a static resource, or database, or big data solution, by delivering a particular set of features:


=== A service, open to third party clients===
* '''A service, open to third party clients''' - By providing the service, clients save the need to stand up their own datastore
: By providing the service, clients save the need to stand up their own datastore
* '''Fast filtering''' - Sub-second filtering over the contents of the whole datastore, independent of size, saves the application developer from declaring and managing indexes that do the same:  There is sufficient information in the queries to determine which indexes should be built to deliver a quick response.
=== Fast filtering===
* '''Fast aggregates''' - Sub-second calculation of statistics over the whole datastore saves the application developer from building and managing caches of those aggregates.   
: Sub-second filtering over the contents of the whole datastore, independent of size, saves the application developer from declaring and managing indexes that do the same:  There is sufficient information in the queries to determine which indexes should be built to deliver a quick response.
* '''API is a query language (SQL, MDX)''' - Building upon the formalisms, and familiarity, of existing query languages, we reduce the learning curve, and also provide Active Data implementations with more insight into the intent of the client application; and optimize for its use cases.
=== Fast aggregates===
* '''Uniform, Cartesian space of values''' - Mozilla has a mandate of data driven decision making.  Data analysis tools, like R, Scipy, Numpy, and Pandas are what's use to perform data analysis, and they all require uniform data in multi-dimensiton arrays.  ActiveData's objective is to provide query results in these formats
: Sub-second calculation of statistics over the whole datastore saves the application developer from building and managing caches of those aggregates.   
* '''Metadata on dimensions and measures''' - ActiveData also provides context to the data it holds.  It serves the purpose to allow exploration and discovery by third parties; by describing unit-of-measure, how dimensions relate to others, and maybe even provide human descriptions of the columns stored.  This metadata is also invaluable in automating the orientation and formatting of dashboard charts: Knowing the domain of an axis allows code to decide the best (default) chart form, and provides logically reasonable aggregate options.   
=== API is a query language (SQL, MDX)===
* '''Has a security model''' - Simpler applications can avoid the complications of a security model if it is baked into the ActiveData solution.  If ActiveData is to become mainstream it is important that it can manage sensitive data and PII.
: Building upon the formalisms, and familiarity, of existing query languages, we reduce the learning curve, and also provide Active Data implementations with more insight into the intent of the client application; and optimize for its use cases.
=== Uniform, Cartesian space of values===
: Mozilla has a mandate of data driven decision making.  Data analysis tools, like R, Scipy, Numpy, and Pandas are what's use to perform data analysis, and they all require uniform data in multi-dimensiton arrays.  ActiveData's objective is to provide query results in these formats
=== Metadata on dimensions and measures===
: ActiveData also provides context to the data it holds.  It serves the purpose to allow exploration and discovery by third parties; by describing unit-of-measure, how dimensions relate to others, and maybe even provide human descriptions of the columns stored.  This metadata is also invaluable in automating the orientation and formatting of dashboard charts: Knowing the domain of an axis allows code to decide the best (default) chart form, and provides logically reasonable aggregate options.   
=== Has a security model===
Simpler applications can avoid the complications of a security model if it is baked into the ActiveData solution.  If ActiveData is to become mainstream it is important that it can manage sensitive data and PII.


=Context=


== Problem ==
== Problem ==
Line 47: Line 36:
Columnar datastores, have solved many (but not all) problems with changing schema.  Query-directed indexing has been around for decades in Oracle's query optimization algorithms, and are available for free in ElasticSearch.  We now have the technology to build an ActiveData solution.
Columnar datastores, have solved many (but not all) problems with changing schema.  Query-directed indexing has been around for decades in Oracle's query optimization algorithms, and are available for free in ElasticSearch.  We now have the technology to build an ActiveData solution.


By defining an ActiveData standard, we can innovate on both sides of the ActiveData abstraction layer independently


== Client Architecture ==


By defining an ActiveData standard, we can innovate on both sides of the ActiveData abstraction layer independently
Applications that leverage an active data warehouse can forgo significant server side development, if not all, and put the logic on the client side.


==Non Solutions==
==Non Solutions==


ActiveData makes specific tradeoffs to achieve it's goals, and there are situations  
ActiveData makes specific tradeoffs to achieve it's goals, and there are situations that active data will not provide benefit
* large memory requirements
* large memory requirements
* low add/update/remove speeds
* low add/update/remove speeds

Revision as of 13:29, 18 December 2014

Overview

This project is born from the belief that to make our data useful it must be active. This means the data is not only available, but it can be explored interactively by a larger audience.

This project is inspired by the data warehouse and data mart technology that is common inside large corporations, but largely non-existent in the public space. Since Mozilla's mandate is an open web, and we have a lot of data to share, it is only logical we make our data active.

Features

An active data instance distinguishes itself from a static resource, or database, or big data solution, by delivering a particular set of features:

  • A service, open to third party clients - By providing the service, clients save the need to stand up their own datastore
  • Fast filtering - Sub-second filtering over the contents of the whole datastore, independent of size, saves the application developer from declaring and managing indexes that do the same: There is sufficient information in the queries to determine which indexes should be built to deliver a quick response.
  • Fast aggregates - Sub-second calculation of statistics over the whole datastore saves the application developer from building and managing caches of those aggregates.
  • API is a query language (SQL, MDX) - Building upon the formalisms, and familiarity, of existing query languages, we reduce the learning curve, and also provide Active Data implementations with more insight into the intent of the client application; and optimize for its use cases.
  • Uniform, Cartesian space of values - Mozilla has a mandate of data driven decision making. Data analysis tools, like R, Scipy, Numpy, and Pandas are what's use to perform data analysis, and they all require uniform data in multi-dimensiton arrays. ActiveData's objective is to provide query results in these formats
  • Metadata on dimensions and measures - ActiveData also provides context to the data it holds. It serves the purpose to allow exploration and discovery by third parties; by describing unit-of-measure, how dimensions relate to others, and maybe even provide human descriptions of the columns stored. This metadata is also invaluable in automating the orientation and formatting of dashboard charts: Knowing the domain of an axis allows code to decide the best (default) chart form, and provides logically reasonable aggregate options.
  • Has a security model - Simpler applications can avoid the complications of a security model if it is baked into the ActiveData solution. If ActiveData is to become mainstream it is important that it can manage sensitive data and PII.

Context

Problem

A significant portion of any application is the backend database/datastore, which include:

  • Managing resources and machines to support the datastore
  • Data migrations on schemas during application lifetime
  • Manually defining database indexes for responsive data retrieval
  • Coding caching logic to reduce application latency

The manual effort put toward these features becomes significant as the amount of data grows in size and complexity. More importantly, this effort is being spent over and over on a multitude of applications, each a trivial variation of the next.

Solution

Abstractly, we desire to reduce this redundant workload by adding a layer of abstraction called ActiveData: Clients using ActiveData benefit from the features it provides and avoid the datastore management complexities. While the ActiveData implementers can focus on these common issues while being given a simpler data model, and simpler query language, upon which to calculate optimizations.

Columnar datastores, have solved many (but not all) problems with changing schema. Query-directed indexing has been around for decades in Oracle's query optimization algorithms, and are available for free in ElasticSearch. We now have the technology to build an ActiveData solution.

By defining an ActiveData standard, we can innovate on both sides of the ActiveData abstraction layer independently

Client Architecture

Applications that leverage an active data warehouse can forgo significant server side development, if not all, and put the logic on the client side.

Non Solutions

ActiveData makes specific tradeoffs to achieve it's goals, and there are situations that active data will not provide benefit

  • large memory requirements
  • low add/update/remove speeds
  • strict data model (snowflake schema, hierarchical relations only)
  • non-relational -
  • etl work required to denormalize data
  • etl work required to provide dimension metadata