EngineeringProductivity/Projects/ActiveData: Difference between revisions

reformatting
(better wording)
(reformatting)
Line 6: Line 6:
This project is inspired by the data warehouse and data mart technology that is common inside large corporations, but largely non-existent in the public space.  Since Mozilla's mandate is an open web, and we have a lot of data to share, it is only logical we make our data active.
This project is inspired by the data warehouse and data mart technology that is common inside large corporations, but largely non-existent in the public space.  Since Mozilla's mandate is an open web, and we have a lot of data to share, it is only logical we make our data active.


 
= Features =
= Architecture =
 
Applications that leverage an active data warehouse can forgo significant server side development, if not all, and put the logic on the client side.
 
== Features ==


An active data instance distinguishes itself from a static resource, or database, or big data solution, by delivering a particular set of features:
An active data instance distinguishes itself from a static resource, or database, or big data solution, by delivering a particular set of features:


=== A service, open to third party clients===
* '''A service, open to third party clients''' - By providing the service, clients save the need to stand up their own datastore
: By providing the service, clients save the need to stand up their own datastore
* '''Fast filtering''' - Sub-second filtering over the contents of the whole datastore, independent of size, saves the application developer from declaring and managing indexes that do the same:  There is sufficient information in the queries to determine which indexes should be built to deliver a quick response.
=== Fast filtering===
* '''Fast aggregates''' - Sub-second calculation of statistics over the whole datastore saves the application developer from building and managing caches of those aggregates.   
: Sub-second filtering over the contents of the whole datastore, independent of size, saves the application developer from declaring and managing indexes that do the same:  There is sufficient information in the queries to determine which indexes should be built to deliver a quick response.
* '''API is a query language (SQL, MDX)''' - Building upon the formalisms, and familiarity, of existing query languages, we reduce the learning curve, and also provide Active Data implementations with more insight into the intent of the client application; and optimize for its use cases.
=== Fast aggregates===
* '''Uniform, Cartesian space of values''' - Mozilla has a mandate of data driven decision making.  Data analysis tools, like R, Scipy, Numpy, and Pandas are what's use to perform data analysis, and they all require uniform data in multi-dimensiton arrays.  ActiveData's objective is to provide query results in these formats
: Sub-second calculation of statistics over the whole datastore saves the application developer from building and managing caches of those aggregates.   
* '''Metadata on dimensions and measures''' - ActiveData also provides context to the data it holds.  It serves the purpose to allow exploration and discovery by third parties; by describing unit-of-measure, how dimensions relate to others, and maybe even provide human descriptions of the columns stored.  This metadata is also invaluable in automating the orientation and formatting of dashboard charts: Knowing the domain of an axis allows code to decide the best (default) chart form, and provides logically reasonable aggregate options.   
=== API is a query language (SQL, MDX)===
* '''Has a security model''' - Simpler applications can avoid the complications of a security model if it is baked into the ActiveData solution.  If ActiveData is to become mainstream it is important that it can manage sensitive data and PII.
: Building upon the formalisms, and familiarity, of existing query languages, we reduce the learning curve, and also provide Active Data implementations with more insight into the intent of the client application; and optimize for its use cases.
=== Uniform, Cartesian space of values===
: Mozilla has a mandate of data driven decision making.  Data analysis tools, like R, Scipy, Numpy, and Pandas are what's use to perform data analysis, and they all require uniform data in multi-dimensiton arrays.  ActiveData's objective is to provide query results in these formats
=== Metadata on dimensions and measures===
: ActiveData also provides context to the data it holds.  It serves the purpose to allow exploration and discovery by third parties; by describing unit-of-measure, how dimensions relate to others, and maybe even provide human descriptions of the columns stored.  This metadata is also invaluable in automating the orientation and formatting of dashboard charts: Knowing the domain of an axis allows code to decide the best (default) chart form, and provides logically reasonable aggregate options.   
=== Has a security model===
Simpler applications can avoid the complications of a security model if it is baked into the ActiveData solution.  If ActiveData is to become mainstream it is important that it can manage sensitive data and PII.


=Context=


== Problem ==
== Problem ==
Line 47: Line 36:
Columnar datastores, have solved many (but not all) problems with changing schema.  Query-directed indexing has been around for decades in Oracle's query optimization algorithms, and are available for free in ElasticSearch.  We now have the technology to build an ActiveData solution.
Columnar datastores, have solved many (but not all) problems with changing schema.  Query-directed indexing has been around for decades in Oracle's query optimization algorithms, and are available for free in ElasticSearch.  We now have the technology to build an ActiveData solution.


By defining an ActiveData standard, we can innovate on both sides of the ActiveData abstraction layer independently


== Client Architecture ==


By defining an ActiveData standard, we can innovate on both sides of the ActiveData abstraction layer independently
Applications that leverage an active data warehouse can forgo significant server side development, if not all, and put the logic on the client side.


==Non Solutions==
==Non Solutions==


ActiveData makes specific tradeoffs to achieve it's goals, and there are situations  
ActiveData makes specific tradeoffs to achieve it's goals, and there are situations that active data will not provide benefit
* large memory requirements
* large memory requirements
* low add/update/remove speeds
* low add/update/remove speeds
Confirmed users
513

edits