Wednesday, January 08, 2014

Baseline Conceptual Models: Data Catalog and Schema Model

Data Catalog & Schema Model:   In a data driven pipeline for a data warehouse the data definitions stored in a meta data repository and is used to speed up pipeline development as well as enforce design conformity.  Entities are organized into a category hierarchy for simple exploratory navigation.   Throughout the live cycle of an entity can be revised several times.  Each revision of the entity will have one or more attributes.   Attributes are first predefined independently from any entity.  Then each entity will bind to the appropriate predefined attributes.  This enforces a solid naming convention and definitions throughout the pipelines in the data warehouse.

Entity Types:  In a data warehouse data can be classified into different types to signify the value it brings to the warehouse.
  • 3rd Party Datasets are upstream data that are provided by external companies.
  • Raw Logs are upstream data that are provided by other systems within the company.  Examples are Website Logs, Sales or Purchase Orders, and Inventory Movement.
  • Master Data Domains are data that came from a Master Data Service or collected across multiple company systems.  This kind of data is used to build dimensions.
  • Dimensions are data elements that enable filtering, grouping and labeling facts (measures).
  • Staging Datasets are data that has been cleansed, enhanced, or conformed and ready for further processing.
  • Atomic Facts are the lowest level of grain for a set of measures.   Measures can be filtered, grouped, and labeled by a predefined set of dimensions
  • Aggregated Facts are an summarized set of measures that have been summed up from an atomic fact.  Used for business snapshot purposes like end month or end of year reporting or can be used for performance tuning.   
  • Reports are usually simple facts (sometimes non-pivotable) that the end user uses as a statement of measure on key performance (KPI) for the business. 
Risk Assessment:  In an enterprise data set there can be information the could cause damage to the business or customer if released to the outside world.   Each attribute should be assessed in order to evaluate the risks for the company.

Data Variance Check Definitions:  An attribute to an entity may have a

Attribute Types:  Each attribute in a entity has a specific meaning in a data warehouse.  
  • Dimension Keys are surrogate key that is used to make to a dimension. 
  • Business Keys are values used to lookup the dimension key using in the key mapping phase of an ETL.  These are found in the upstream source streams.
  • Measures Attributes are numeric values representing a count or output of an equation.
  • Regular Attributes are general values used for pass-through processes or user consumption.
Data Lineage:  Tracking data lineage from one data source to another within the warehouse is a critical to track provenance of the data and enables faster diagnostics when data goes bad.  This a fine grain data lineage feature.

Please see (Baseline Conceptual Models Commentary) for further details on what conceptual models are to be used for.

No comments: