EngineeringProductivity/Projects/ActiveData: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(moved from /projects/activedata)
 
(save this draft)
Line 5: Line 5:


This project is inspired by the data warehouse and data mart technology that is common inside large corporations, but largely non-existent in the public space.  Since Mozilla's mandate is an open web, and we have a lot of data to share, it is only logical we make our data active.
This project is inspired by the data warehouse and data mart technology that is common inside large corporations, but largely non-existent in the public space.  Since Mozilla's mandate is an open web, and we have a lot of data to share, it is only logical we make our data active.


= Architecture =
= Architecture =


Applications that leverage an active data warehouse can forgo the server side development and put the logic on the client side.
Applications that leverage an active data warehouse can forgo significant server side development, if not all, and put the logic on the client side.


== Features ==
== Features ==
Line 14: Line 15:
An active data instance distinguishes itself from a static resource, or database, or big data solution, by delivering a particular set of features:
An active data instance distinguishes itself from a static resource, or database, or big data solution, by delivering a particular set of features:


* A service, open to third party clients
=== A service, open to third party clients===
* Fast filtering
: By providing the service, clients save the need to stand up their own datastore
* Fast aggregates
=== Fast filtering===
* API is a query language (SQL, MDX)
: Sub-second filtering over the contents of the whole datastore, independent of size, saves the application developer from declaring and managing indexes that do the same:  There is sufficient information in the queries to determine which indexes should be built to deliver a quick response.
* Uniform, Cartesian space of values
=== Fast aggregates===
* Metadata on dimensions and measures
: Sub-second calculation of statistics over the whole datastore saves the application developer from building and managing caches of those aggregates. 
* Has a security model
=== API is a query language (SQL, MDX)===
: Building upon the formalisms, and familiarity, of existing query languages, we reduce the learning curve, and also provide Active Data implementations with more insight into the intent of the client application; and optimize for its use cases.
=== Uniform, Cartesian space of values===
: Mozilla has a mandate of data driven decision making.  Data analysis tools, like R, Scipy, Numpy, and Pandas are what's use to perform data analysis, and they all require uniform data in multi-dimensiton arrays.  ActiveData's objective is to provide query results in these formats
=== Metadata on dimensions and measures===
: ActiveData also provides context to the data it holds.  It serves the purpose to allow exploration and discovery by third parties; by describing unit-of-measure, how dimensions relate to others, and maybe even provide human descriptions of the columns stored.  This metadata is also invaluable in automating the orientation and formatting of dashboard charts: Knowing the domain of an axis allows code to decide the best (default) chart form, and provides logically reasonable aggregate options. 
=== Has a security model===
Simpler applications can avoid the complications of a security model if it is baked into the ActiveData solution.  If ActiveData is to become mainstream it is important that it can manage sensitive data and PII.
 
 
== Problem ==
 
A significant portion of any application is the backend database/datastore, which include:
* Managing resources and machines to support the datastore
* Data migrations on schemas during application lifetime
* Manually defining database indexes for responsive data retrieval
* Coding caching logic to reduce application latency
 
The manual effort put toward these features becomes significant as the amount of data grows in size and complexity.  More importantly, this effort is being spent over and over on a multitude of applications, each a trivial variation of the next.
 
==Solution==
 
Abstractly, we desire to reduce this redundant workload by adding a layer of abstraction called ActiveData:  Clients using ActiveData  benefit from the features it provides and avoid the datastore management complexities.  While the ActiveData implementers can focus on these common issues while being given a simpler data model, and simpler query language, upon which to calculate optimizations.
 
Columnar datastores, have solved many (but not all) problems with changing schema.  Query-directed indexing has been around for decades in Oracle's query optimization algorithms, and are available for free in ElasticSearch.  We now have the technology to build an ActiveData solution.
 
 
 
By defining an ActiveData standard, we can innovate on both sides of the ActiveData abstraction layer independently
 
==Non Solutions==
 
ActiveData makes specific tradeoffs to achieve it's goals, and there are situations
* memory hog
* transactional speed
* strict data model (snowflake schema, hierarchical relations
* non-relational -                                                                         
* etl work required to denormalize data and provide dimension metadata

Revision as of 08:07, 18 December 2014

Overview

This project is born from the belief that to make our data useful it must be active. This means the data is not only available, but it can be explored interactively by a larger audience.

This project is inspired by the data warehouse and data mart technology that is common inside large corporations, but largely non-existent in the public space. Since Mozilla's mandate is an open web, and we have a lot of data to share, it is only logical we make our data active.


Architecture

Applications that leverage an active data warehouse can forgo significant server side development, if not all, and put the logic on the client side.

Features

An active data instance distinguishes itself from a static resource, or database, or big data solution, by delivering a particular set of features:

A service, open to third party clients

By providing the service, clients save the need to stand up their own datastore

Fast filtering

Sub-second filtering over the contents of the whole datastore, independent of size, saves the application developer from declaring and managing indexes that do the same: There is sufficient information in the queries to determine which indexes should be built to deliver a quick response.

Fast aggregates

Sub-second calculation of statistics over the whole datastore saves the application developer from building and managing caches of those aggregates.

API is a query language (SQL, MDX)

Building upon the formalisms, and familiarity, of existing query languages, we reduce the learning curve, and also provide Active Data implementations with more insight into the intent of the client application; and optimize for its use cases.

Uniform, Cartesian space of values

Mozilla has a mandate of data driven decision making. Data analysis tools, like R, Scipy, Numpy, and Pandas are what's use to perform data analysis, and they all require uniform data in multi-dimensiton arrays. ActiveData's objective is to provide query results in these formats

Metadata on dimensions and measures

ActiveData also provides context to the data it holds. It serves the purpose to allow exploration and discovery by third parties; by describing unit-of-measure, how dimensions relate to others, and maybe even provide human descriptions of the columns stored. This metadata is also invaluable in automating the orientation and formatting of dashboard charts: Knowing the domain of an axis allows code to decide the best (default) chart form, and provides logically reasonable aggregate options.

Has a security model

Simpler applications can avoid the complications of a security model if it is baked into the ActiveData solution. If ActiveData is to become mainstream it is important that it can manage sensitive data and PII.


Problem

A significant portion of any application is the backend database/datastore, which include:

  • Managing resources and machines to support the datastore
  • Data migrations on schemas during application lifetime
  • Manually defining database indexes for responsive data retrieval
  • Coding caching logic to reduce application latency

The manual effort put toward these features becomes significant as the amount of data grows in size and complexity. More importantly, this effort is being spent over and over on a multitude of applications, each a trivial variation of the next.

Solution

Abstractly, we desire to reduce this redundant workload by adding a layer of abstraction called ActiveData: Clients using ActiveData benefit from the features it provides and avoid the datastore management complexities. While the ActiveData implementers can focus on these common issues while being given a simpler data model, and simpler query language, upon which to calculate optimizations.

Columnar datastores, have solved many (but not all) problems with changing schema. Query-directed indexing has been around for decades in Oracle's query optimization algorithms, and are available for free in ElasticSearch. We now have the technology to build an ActiveData solution.


By defining an ActiveData standard, we can innovate on both sides of the ActiveData abstraction layer independently

Non Solutions

ActiveData makes specific tradeoffs to achieve it's goals, and there are situations

  • memory hog
  • transactional speed
  • strict data model (snowflake schema, hierarchical relations
  • non-relational -
  • etl work required to denormalize data and provide dimension metadata