Thursday, January 09, 2014

Pipeline Framework Fundamentals

An ETL framework provides and enforces pipeline standards across the data warehouse for the purposes of increasing stability and reliability.   The following features should be part of the framework.
  • Pipeline Coordination and Scheduling: Automatically manage parallel processes against data that are located at specific compute nodes.   Also deal with late arriving data.
  • Reentry: Ability to restart a process at the point of failure.
  • Data Life Cycle Management: Automatically archive aging data and the removal of expired data from the warehouse.
  • Standard Import and Export: Standard handling of exporting and importing data from and to external landing locations and related error handling and protocols. 
  • Data Lineage and Traceability: Tracking the flow of data from its initial entry into the warehouse through the pipelines to its final destinations.  This also needs to enable the ability to determine downstream impacts when data is missing or processes fail. 
  • Data Service Level Agreement Management: Track the scheduled window of time a dataset should be available for downstream usage and what internal and external parties depend on the data.
  • Data Replication Management: Automatic replication of specific data for backup purposes. 
Checkout http://hortonworks.com/hadoop/falcon/ for Hadoop based data warehouse pipelines.  This sounds awfully promising and looks like it covers most everything.  Due out Q1 2014.  Preview is available.
Checkout http://pragmaticworks.com/Products/BI-xPress.aspx for SSIS based data warehouse pipelines.
Checkout http://ssisetlframework.codeplex.com/ for SSIS based data warehouse pipelines.  It doesn't cover everything but its a great place to start and its free.

No comments: