Thursday, November 23, 2017

Baseline Conceptual ERP Data Model

Baseline ERP Conceptual Model by Scott Thornton.pdf

This model is a pdf file that is more detailed (down to attributes) that this website already lays out.   It defines the information and informational relationships a business will have to function as a business.   Anyone working with data will like this model, as it will help you to understand how business data ties together into a relational data model.  

Thursday, January 23, 2014

Technology Decision Factors

When making a decision about what technology to use for a project, product, or service architecture there are many factors to consider that will impact the total cost in ownership for a company.

Skilled Workforce

Risk: Availability, Salary Cost, Training Cost, Churn Cost

Commentary: The skilled workforce factor sometimes gets forgotten in the hunt for a technology.  If a technology is highly specialized it will decrease chances of hiring someone with the appropriate skill level to deal with the technology.  It will also increase the cost of training and/or salary.  There can be a high churn rate for the skilled resource as they can get head hunted by other companies or they become concerned about their career as the skill set exercised is too narrow for their long term  career goals.  Skill sets must span both Development, Test, and Operations as well as training for end-users of the technology if necessary.

Technology Support

Risk: Code Stability, Documentation Quality, Tech Support, Market Survival, Backwards Compatibility

Commentary: If a technology is bleeding edge or produced by a one man band, it can suffer from poor documentation, poor technical support, instabilities, and may not survive its incubation period.   It is also important to have the assurance that the company producing the technology are committed to backward compatibility so you don't lose your coding investment.

Operational Cost 

Risk: Deployment and Provisioning Costs, Maintenance Costs, Stability, Tech Support, Hot Patch Support

Commentary: The day to day operational costs of technology is very important to take into consideration.  On-Site hardware costs, installation costs, backup costs, failover costs, global data and processing distribution costs, tuning costs, hardware maintenance costs, Software upgrade costs, hot patch costs, production trouble shooting costs, and operation staff costs is a lot to calculate.  On-Site may have a much hire cost of ownership then subscribing to cloud services.   Cloud Services provide a much simpler and predictable cost model and usually provide elastic provisioning to grow or shrink to your needs.  Something to really contemplate when selecting your technology platform.

Security Cost 

Risk: Customer Safety, Business Safety

Commentary: Data and processes in the hands of the wrong people can cause irreversible damage to a customer or the business.   Personal Identification Information (PII) and Business Critical Information must be secured and not taken lightly.  Is the usage of the technology managing such information?  Can the Technology tightly secure the information?   Can it keep up with latest security features and patches to protect against security breaches?  Can it detect security breaches?  Same questions for any sensitive processes.

Business Functional Requirements

Risk: Functionality,  Elastic Scalability, Reliability, Availability 

Commentary: I placed this last because these are the most obvious factors to consider.  Does the technology provide the proper level of functionality to support the business?   Can it scale with the company as its needs grow and shrink through out time?  Its is stable and reliable?  Can it deal with the 99.9, 99.99, 99.999 uptime availability requirements?   What is the Mean Time To Recovery (MTTR)?


Bottom Line: Remember that you get what you pay for, but don't pay for more than what is warranted.  And in a competitive world, the total cost of ownership of your enterprise technology balanced with the need to enable your company to compete is critical to get right.


Please See the following docs for the cost benefits for cloud technology solutions.
Cloud Economics
Cloud Platforms for Business Owners
Cloud Optimization - A Framework for making business decisions about cloud computing

Sunday, January 19, 2014

On-Site + Azure Cloud = Hybrid

On-Site + Azure Cloud = Hybrid:  There are many reasons why a company will not place there enterprise in the cloud quite yet.
  • The data warehouse may contain business critical information that would be detrimental to the company if exposed to the public and the company is not comfortable with atomic data being hosted in the Cloud yet until security has improved.
  • Or it may be that the ETL process may end up being too brittle due to unreliable internet connection in the area and latency is important.
  • Or the ETL process may require more bandwidth then is available in the area.
So this Hybrid Topology may prove to be useful compromise.  It ETLs a small aggregated set of data into a datamart up in the Azure cloud for an analytics server and a reporting tool, also hosted in Azure, can access.  This leaves the Enterprise Data Warehouse On-Site close to the physical business.     

Thursday, January 16, 2014

Traditional Data Warehouse Design

Traditional Data Warehouse Design:  Before Massive Parallel Processing (MPP) was available to the masses, a data warehouse was typically housed in a single Server.  It maybe clustered with another server for fail over proposes.  This design is good for your first steps in data warehousing, but is limited to the volume of data it can service.   SQL Server 2012 database size limit is about 500 TB.   But saying that you are still limited by the number of CPUs and the amount of memory your service has which limits your performance.    Its not a scalable solution in a world which the volume of data is accelerating exponentially and the demand for low level adhoc discovery analysis is on the rise.  

Features It Should Have:
  • Conformed Dimension Definitions
  • Conformed Measure Definitions
  • Conformed Atomic Level Fact Definitions
  • Managed Dimensional Data
  • Documented Data SLAs going In and Out of the Warehouse
  • Documented Data Lineage
  • Uniform ETL Pipeline Process Management
  • Uniform handling late arriving data
  • Uniform Error Handling
  • Uniform Variance Checks
  • Uniform Security Policies and Management
  • Shared Codebase and Framework
  • Shared Environment
  • Uniform backup and replication of data
  • Uniform Data Life Cycle Management:  Data Archival and Deletion as it ages 

  • Keep in mind that most businesses are in a business information warfare.  More than likely you are competing in a shrinking market as more and more businesses are competing for the same customers.  So any leverage you get is critical to your success.  If you don't leverage big data then your competition will.

    Commentary: The Case of the Missing Enterprise Data Warehouse

    Symptom: You have 100 reports that are unreliably generated for you and the reported numbers conflict from report to report.

    Root Cause:  Your company has no Enterprise Wide Data Warehouse.

    How Did that Happen?   Here is the Typical Evolving Scenario: 
    You or your management commissions someone to provide a specific report.  That report was quickly put together and placed on a server under someone's office desk.  Rinse and Repeat...   Eventually you realize you have a series of reports that give conflicting numbers.  So you have someone investigate why and you get the following answer:  You have 100 Reports coming from 10 Reporting servers under 10 different office desks across 3 office sites.   There are 20 different Extraction, Transformation, and Loading Pipelines all with a different codebase and with different logic.  Looking even deeper you may find that the ETL Pipeline's error handling, late arriving data processes, cleaning processes, data enrichment processes, the data conformation processes, and data variance checks all vary in design or just plain absent.   Then the real bad news hits you:  What you thought were the same measures in different reports are not the same.  They technically have different  meanings even though the have the same name.

    No Data Warehouse Diagram:

    Hidden Costs in Not Having a Enterprise Wide Data Warehouse:
    • Increased Development Costs for maintaining multiple and fragmented codebases doing same thing.
    • Increased Operations and Development Costs in troubleshooting late or missing reports and erroneous numbers.
    • Increased Business Risks in data landing in the hands of the wrong people due to your report servers not being properly secured.  This results in these servers being hacked or just plain walking out the door.
    • Decreased Business Opportunities and Optimization caused by relying on in accurate and inconsistent reports.
    Solution:  Build a Enterprise Wide Data Warehouse by providing the following.
    • Conformed Dimension Definitions 
    • Conformed Measure Definitions
    • Conformed Atomic Level Fact Definitions
    • Managed Dimensional Data
    • Documented Data SLAs going In and Out of the Warehouse
    • Documented Data Lineage
    • Symmetrical ETL Pipeline Process Management
    • Uniform handling late arriving data
    • Uniform Error Handling
    • Uniform Variance Checks
    • Uniform Security Policies and Management
    • Shared Codebase and Framework
    • Shared Environment
    • Uniform backup and replication of data
    • Uniform Data Life Cycle Management:  Data Archival and Deletion as it ages
    Traditional Data Warehouse Diagram:

    Thursday, January 09, 2014

    Pipeline Framework Fundamentals

    An ETL framework provides and enforces pipeline standards across the data warehouse for the purposes of increasing stability and reliability.   The following features should be part of the framework.
    • Pipeline Coordination and Scheduling: Automatically manage parallel processes against data that are located at specific compute nodes.   Also deal with late arriving data.
    • Reentry: Ability to restart a process at the point of failure.
    • Data Life Cycle Management: Automatically archive aging data and the removal of expired data from the warehouse.
    • Standard Import and Export: Standard handling of exporting and importing data from and to external landing locations and related error handling and protocols. 
    • Data Lineage and Traceability: Tracking the flow of data from its initial entry into the warehouse through the pipelines to its final destinations.  This also needs to enable the ability to determine downstream impacts when data is missing or processes fail. 
    • Data Service Level Agreement Management: Track the scheduled window of time a dataset should be available for downstream usage and what internal and external parties depend on the data.
    • Data Replication Management: Automatic replication of specific data for backup purposes. 
    Checkout http://hortonworks.com/hadoop/falcon/ for Hadoop based data warehouse pipelines.  This sounds awfully promising and looks like it covers most everything.  Due out Q1 2014.  Preview is available.
    Checkout http://pragmaticworks.com/Products/BI-xPress.aspx for SSIS based data warehouse pipelines.
    Checkout http://ssisetlframework.codeplex.com/ for SSIS based data warehouse pipelines.  It doesn't cover everything but its a great place to start and its free.

    Baseline Conceptual Models: Warehouse Shard Configuration Model

    Warehouse Shard Configuration Model:  Data is partitioned or distributed across compute nodes.  This information is critical for bringing processes to the data located on the each compute node and executing it.  This model is dependent on Pipeline Configuration Model.

    Please see (Baseline Conceptual Models Commentary) for further details on what conceptual models are to be used for.

    Wednesday, January 08, 2014

    Baseline Conceptual Models: Pipeline Configuration Model

     
    Pipeline Configuration Model:  Pipelines in a data warehouse will have many scheduled jobs that process data from raw to atomic fact to a final report.  Some processes are a dependent one or more datasets and can output one or more datasets.  Were these processes are executed is dependent on where the data is located.   This model is dependent on Data & Schema Model and Warehouse Shared Configuration Model.

    Types of Pipelines:  There are different pipelines for different stages of the data.
    • 3rd Party Dataset Pipelines are extracting, minor transformation, and loading processes that pull data from external companies.
    • Raw Log Pipelines are extracting, minor transformation, and loading processes that pull data provided from other systems within the company.  Examples are Website Logs, Sales or Purchase Orders, and Inventory Movement.
    • Master Data Domains Pipelines are extracting, minor transformation, and loading processes that pull from a Master Data Service or collected from across multiple company systems. 
    • Dimension Pipelines are transformation processes to transform Master Data Domains into dimensions.
    • Staging Pipelines are transformation processes that cleanse, enhance, or conform data.
    • Atomic Fact Pipelines  are transformation processes that transform logs into atomic facts.
    • Aggregated Fact Pipelines  are transformation processes that transform atomic into aggregate facts.   
    • Report Pipelines  are transformation processes for transforming atomic or aggregate facts into reports. 
    Note:  Pipeline Data entity provides information about the input and output data going into or out of a process and is important for in tracking course grain data lineage.

    Please see (Baseline Conceptual Models Commentary) for further details on what conceptual models are to be used for.

    Baseline Conceptual Models: Data Service Level Agreement (SLA) Model

    Data Service Level Agreement (SLA) Model:  When sourcing data it is a good idea for the data provider and the downstream customer to document and agree upon the specifics of the data that is to be provided and the scheduled window of time it is expected for the data to be available for download.  This helps provide two things:  Set expectations properly between the data provider and the downstream customer and second it enables the ability for the data provider to see who is impacted by delays in the pipeline and communicate to the downstream customers accordingly.  This model is dependent on Data Catalog & Schema Model and the Party Model.

    Data SLA:  SLA's should be predetermined by the data provider based on the average time it takes for the pipeline to process the necessary data and make it available at a drop point plus some buffer time for the outliers.  A customer then agrees upon the SLA or make a request for a modified SLA.  Modifying the SLA may require new development or tuning of the pipeline.

    Data SLA Landing Locations:  SLA's also includes one or more drop locations on a specific reoccurring schedule.

    Please see (Baseline Conceptual Models Commentary) for further details on what conceptual models are to be used for.

    Baseline Conceptual Models: Data Catalog and Schema Model

    Data Catalog & Schema Model:   In a data driven pipeline for a data warehouse the data definitions stored in a meta data repository and is used to speed up pipeline development as well as enforce design conformity.  Entities are organized into a category hierarchy for simple exploratory navigation.   Throughout the live cycle of an entity can be revised several times.  Each revision of the entity will have one or more attributes.   Attributes are first predefined independently from any entity.  Then each entity will bind to the appropriate predefined attributes.  This enforces a solid naming convention and definitions throughout the pipelines in the data warehouse.

    Entity Types:  In a data warehouse data can be classified into different types to signify the value it brings to the warehouse.
    • 3rd Party Datasets are upstream data that are provided by external companies.
    • Raw Logs are upstream data that are provided by other systems within the company.  Examples are Website Logs, Sales or Purchase Orders, and Inventory Movement.
    • Master Data Domains are data that came from a Master Data Service or collected across multiple company systems.  This kind of data is used to build dimensions.
    • Dimensions are data elements that enable filtering, grouping and labeling facts (measures).
    • Staging Datasets are data that has been cleansed, enhanced, or conformed and ready for further processing.
    • Atomic Facts are the lowest level of grain for a set of measures.   Measures can be filtered, grouped, and labeled by a predefined set of dimensions
    • Aggregated Facts are an summarized set of measures that have been summed up from an atomic fact.  Used for business snapshot purposes like end month or end of year reporting or can be used for performance tuning.   
    • Reports are usually simple facts (sometimes non-pivotable) that the end user uses as a statement of measure on key performance (KPI) for the business. 
    Risk Assessment:  In an enterprise data set there can be information the could cause damage to the business or customer if released to the outside world.   Each attribute should be assessed in order to evaluate the risks for the company.

    Data Variance Check Definitions:  An attribute to an entity may have a

    Attribute Types:  Each attribute in a entity has a specific meaning in a data warehouse.  
    • Dimension Keys are surrogate key that is used to make to a dimension. 
    • Business Keys are values used to lookup the dimension key using in the key mapping phase of an ETL.  These are found in the upstream source streams.
    • Measures Attributes are numeric values representing a count or output of an equation.
    • Regular Attributes are general values used for pass-through processes or user consumption.
    Data Lineage:  Tracking data lineage from one data source to another within the warehouse is a critical to track provenance of the data and enables faster diagnostics when data goes bad.  This a fine grain data lineage feature.

    Please see (Baseline Conceptual Models Commentary) for further details on what conceptual models are to be used for.