AUS:v3

Database Schema Goals

Referential integrity in data source
- Add an RDBMS - MySQL likely - that uses a schema designed to store the data required for business rules and update metadata -- including data points we may not even track currently (these need to be defined).
No duplication of data
- Basic normalization (without OVER normalizing) should quite easily take care of this.

Proposed Solution

Using this less than completely normalized solution is theorized to be faster. Rather than using joins, each table will have it's value looked up with an independent query. Caching the results of these queries will minimize actual hits on the database.

Collapsed

Mapping Files To Tables

In mapping the pathname to database fields, each segment in the path leads to an entry in a database table:

incoming/2/Firefox/trunk/Linux_x86-gcc3/2007061204/en-US/complete.txt -> Product.name

incoming/2/Firefox/trunk/Linux_x86-gcc3/2007061204/en-US/complete.txt -> Version.name

incoming/2/Firefox/trunk/Linux_x86-gcc3/2007061204/en-US/complete.txt -> Build_target.name

incoming/2/Firefox/trunk/Linux_x86-gcc3/2007061204/en-US/complete.txt -> Build.name

incoming/2/Firefox/trunk/Linux_x86-gcc3/2007061204/en-US/complete.txt -> locale.name

However, that leaves four tables with no source of values:

? -> channel.name

? -> os_version.name

? -> Distribution.name

? -> Distibution_version.name

The rest of the values to be mapped into the database come from the complete.txt and partial.txt files. An example mapping follows:

complete  -> patches.type
http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2007-06-12-07-trunk/firefox-3.0a6pre.en-US.linux-i686.complete.mar -> patches.URL
sha1 -> patches.hashFunction
4fe7d4db03ef96eb8f34bce05d740d986fcb0bc0 -> patches.hashValue
10752925 -> ??
2007061207 -> ??
3.0a6pre -> updates.version
3.0a6pre -> updates.extensionVersion

Unfortunately, that leaves three columns in two tables without a data source:

?? -> updates.types

?? -> updates.detailsURL

?? -> patches.size

table 1: updates

column name	type	attribute	comment
id	sequence	primary
type	varchar \| enum \| integer		several options for implementation
version	varchar		likely a foreign key?
extensionVersion	varchar		likely a foreign key?
detailsURL	varchar		is this url completely unique?

id	type	version	extensionVersion	detailsURL
0	minor	1.0.4	1.0	http://www.foo.com/1.0.4/whatsnew.html
1	major	1.1.2	1.1	http://www.foo.com/1.1.2/whatsnew.html

table 2: patches

column name	type	attribute	comment
id	sequence	primary
type	varchar \| enum \| integer		several options for implementation
URL	varchar		is this url completely unique?
hashFunction	varchar
hashValue	varchar
size	varchar		if it's never actually used as an integer
updateId	integer	FK(updates)

id	type	URL	hashFunction	hashValue	size	updateId
0	partial	http://www.foo.com/1.0.4-partial.mar				0
1	complete	http://www.foo.com/1.0.4-complete.mar				0
2	complete	http://www.foo.com/1.1.2-complete.mar				1

table 3: UpdateParameters

column name	type	attribute	comment
id	sequence	primary
productId	integer	FK(product)
versionId	integer	FK(version)
build_idId	integer	FK(build_id)
build_targetId	integer	FK(build_target)
localeId	integer	FK(locale)
channelId	integer	FK(channel)
os_versionId	integer	FK(os_version)
distributionId	integer	FK(distribution)
distribution_versionId	integer	FK(distribution_version)
updateId	integer	FK(updates)

table 4: Product

column name	type	attribute	comment
id	sequence	primary
name	varchar

table 5: Version

column name	type	attribute	comment
id	sequence	primary
name	varchar

table 6: buildId

column name	type	attribute	comment
id	sequence	primary
name	varchar

table 7: buildTarget

column name	type	attribute	comment
id	sequence	primary
name	varchar

table 8: Locale

column name	type	attribute	comment
id	sequence	primary
name	varchar

table 9: Channel

column name	type	attribute	comment
id	sequence	primary
name	varchar

table 10: Os_version

column name	type	attribute	comment
id	sequence	primary
name	varchar

table 11: Distribution

column name	type	attribute	comment
id	sequence	primary
name	varchar

table 11: Distribution_version

column name	type	attribute	comment
id	sequence	primary
name	varchar

Previous Proposals

Fully Normalized Hierarchy

Stats:

number of tables: 17
number of joins to get data: 16

Sample SQL:

select 
  payload.* 
from 
  product join pv on (product.name = %PRODUCT% and product.id = pv.productId)  
            join version on (version.name = %VERSION% and version.id = pv.versionId)
              join pvb on (pv.id = pvb.pvId)
                join build_id on (build_id.name = %BUILD_ID% and build_id.id = pvb.build_idId)
                  join pvbbt on (pvb.id = pvbbt.pvbId)
                    join build_target on (build_target.name = %BUILD_TARGET% and build_target.id = pvbbt.build_targetId)
                      join pvbbtl on (pvbbt.id = pcbbtl.pvbbtId)
                        join locale on (locale.name = %LOCALE% and locale.id = pvbbtl.localId)
                          join pvbbtlc on (pvbbtl.id = pvbbtlc.pvbbtlId)
                            join channel on (channel.name = %CHANNEL% and channel.id = pvbbtlc.channelId)
                              join pvbbtlcos on (pvbbtlc.id = pcvbbtlcos.pvbbtlcId)
                                join os_version on (os_version.name = %OS_VERSION% and os_version.id = pvbbtlcos.os_versionId)
                                  join pvbbtlcosd on (pvbbtlcos.id = pvbbtlcosd.pcbbtlcosId)
                                    join distribution in (distribution.name = %DISTRIBUTION% and distribution.id = pvbbtlcosd.distributionId)
                                      join payload on (pvbbtlcosd.id = payload.pvbbtlcosdId)
                                        join distibution_version on (distibution_version.name = %DISTRIBUTION_VERSION% and distibution_version.id = payload.distibution_versionId)

Advantages:

simulates the original hierarchical structure
highly normalized

Disadvantages:

very verbose SQL
structure, while faithful to the previous system, doesn't necessarily accurately model the data relationships

Mostly Normalized Hierarchy with Rearrangement

Stats

number of tables: 16
number of joins: 15

Sample SQL

select 
  payload.* 
from 
  payload join (product join pv on (product.name = %PRODUCT% and product.id = pv.productId)  
                          join version on (version.name = %VERSION% and version.id = pv.versionId)
                            join pvb on (pv.id = pvb.pvId)
                              join build_id on (build_id.name = %BUILD_ID% and build_id.id = pvb.build_idId))
                on payload.pvbId = pcbId.id
            join (btos join build_target on (build_target.name = %BUILD_TARGET% and build_target.id = btos.build_targetId)
                         join os_version on (os_version.name = %OS_VERSION% and os_version.id = btos.os_versionId)
                           join btosl on (btos.id = btosl.btosId)
                             join locale on (locale.name = %LOCALE% and locale.id = btosl.localeId))
                  on payload.btoslId = btosl.id
               join (cd join channel on (channel.name = %CHANNEL% and channel.id = cd.channelId)
                          join distribution on (distribution.name = %DISTRIBUTION% and distribution.id = cd.distributionId)
                            join cddv on (cd.id = cddv.cdId)
                              join distribution_version on (distribution_version.name = %DISTRIBUTION_VERSION% and distribution_version.id = cddv.distribution_versionId))
                     on payload.cddvId = cddv.id

Advantages

normalized
rearrangement more accurately reflect the data relationships

Disadvantages

complicated SQL

Collapsed

Stats:

number of tables: 10
number of joins to get data: 9

Sample SQL:

select 
  payload.* 
from 
  payload join product on (product.name = %PRODUCT% and product.id = payload.productId)  
            join version on (version.name = %VERSION% and version.id = payload.versionId)
              join build_id on (build_id.name = %BUILD_ID% and build_id.id = payload.build_idId)
                join build_target on (build_target.name = %BUILD_TARGET% and build_target.id = payload.build_targetId)
                  join locale on (locale.name = %LOCALE% and locale.id = payload.localId)
                    join channel on (channel.name = %CHANNEL% and channel.id = payload.channelId)
                      join os_version on (os_version.name = %OS_VERSION% and os_version.id = payload.os_versionId)
                        join distribution in (distribution.name = %DISTRIBUTION% and distribution.id = payload.distributionId)
                          join distibution_version on (distibution_version.name = %DISTRIBUTION_VERSION% and distibution_version.id = payload.distibution_versionId)

Advantages

fewer joins means faster queries

Disadvantages

poor normalization
no direct expression of the relationships between the keys

Degenerate

No Image

The information is stored in a single table keyed by the entire URL.

Stats:

number of tables: 1
number of joins to get data: 0

Sample SQL:

select payload.* from payload where url = %URL%

Advantages

fastest of the options
no complicated SQL to maintain

Disadvantages

no relationships between keys stored in the database

Metadata First Cut

Converting an XML Schema into a relational database schema can be pretty straight forward. An element generally corresponds with a row in a table. The attributes inside the start tags correspond to the data in columns of the table. Nested elements are rows in another table that has a foreign key relationship with the table to which the parent element is associated. The tree structure of XML lends itself to orderly relational database schema.

While more complicated relationships can exist between the elements, the schema of this project seems very simple.

<?xml version="1.0"?>

<updates>
  <update type="minor" version="1.0.4" extensionVersion="1.0" 
          detailsURL="http://www.foo.com/1.0.4/whatsnew.html">
    <patch type="partial" URL="http://www.foo.com/1.0.4-partial.mar"
           hashFunction="" hashValue="" size=""/>
    <patch type="complete" URL="http://www.foo.com/1.0.4-complete.mar"
           hashFunction="" hashValue="" size=""/>
  </update>
  ..
  <update type="major" version="1.1.2" extensionVersion="1.1"
          detailsURL="http://www.foo.com/1.1.2/whatsnew.html">
    <patch type="complete" URL="http://www.foo.com/1.1.2-complete.mar"
           hashFunction="" hashValue="" size=""/>
  </update>
</updates>

AUS Metadata Database Schema Naive Version 1

This proposal introduces just two tables: updates and patches. They correspond to the major elements of the aforementioned XML file.

table 1: updates

column name	type	attribute	comment
id	sequence	primary
type	varchar \| enum \| integer		several options for implementation
version	varchar		likely a foreign key?
extensionVersion	varchar		likely a foreign key?
detailsURL	varchar		is this url completely unique?

id	type	version	extensionVersion	detailsURL
0	minor	1.0.4	1.0	http://www.foo.com/1.0.4/whatsnew.html
1	major	1.1.2	1.1	http://www.foo.com/1.1.2/whatsnew.html

table 2: patches

column name	type	attribute	comment
id	sequence	primary
type	varchar \| enum \| integer		several options for implementation
URL	varchar		is this url completely unique?
hashFunction	varchar
hashValue	varchar
size	varchar		if it's never actually used as an integer
updateId	integer	FK(updates)

id	type	URL	hashFunction	hashValue	size	updateId
0	partial	http://www.foo.com/1.0.4-partial.mar				0
1	complete	http://www.foo.com/1.0.4-complete.mar				0
2	complete	http://www.foo.com/1.1.2-complete.mar				1

Discussion

This schema is missing any indication of the product. Does this system need to take into account more than just one product? If is does, imagine another enclosing

How does this schema meet the goals at the top of the page?

...data required for business rules and update metadata...: What are the business rules? Can't test to see if this is sufficient for the business rules until I have business rules. If the "update metadata" is captured in its entirety by the XML example, the it is safe to judge that this database schema is equivalent.
Basic normalization (without OVER normalizing)...: This simple schema is normalized if the values in the columns are unique. Each table has a type attribute that could be foreign keys to tables listing the allowed types. Since there are so few options, splitting them off into their own tables would be over normalization. The URL columns, if not unique within their tables, could be consolidated into their own tables leaving behind a foreign key.

AUS:v3

Contents

Database Schema Goals

Proposed Solution

Collapsed

Mapping Files To Tables

table 1: updates

table 2: patches

table 3: UpdateParameters

table 4: Product

table 5: Version

table 6: buildId

table 7: buildTarget

table 8: Locale

table 9: Channel

table 10: Os_version

table 11: Distribution

table 11: Distribution_version

Previous Proposals

Fully Normalized Hierarchy

Mostly Normalized Hierarchy with Rearrangement

Collapsed

Degenerate

Metadata First Cut

AUS Metadata Database Schema Naive Version 1

table 1: updates

table 2: patches

Discussion

Navigation menu

AUS:v3

Database Schema Goals

Proposed Solution

Collapsed

Mapping Files To Tables

table 1: updates

table 2: patches

table 3: UpdateParameters

table 4: Product

table 5: Version

table 6: buildId

table 7: buildTarget

table 8: Locale

table 9: Channel

table 10: Os_version

table 11: Distribution

table 11: Distribution_version

Previous Proposals

Fully Normalized Hierarchy

Mostly Normalized Hierarchy with Rearrangement

Collapsed

Degenerate

Metadata First Cut

AUS Metadata Database Schema Naive Version 1

table 1: updates

table 2: patches

Discussion

Navigation menu

Search