Friday, December 27, 2013

Azure HDInsight and Azure VM:SQL Server Analysis Services

This high level diagram lays out a data warehouse design using SQL Server MDS, HDInsight, Pig, Hive, HBase, SQOOP, FLUME, SQL Server Analysis Services (SSAS), SQL Server Reporting Services (SSRS), and Azure Virtual Machines.  Its theoretical at this time as I'm doing research on all the Azure possibilities.
  • Extract data from company sources to file and copy to HDInsight.  If SQL Server use BCP.
  • It is easiest to use the delimiter of ^A (0x01).  This is the default delimiter for Hive.
  • Process and enrich your data into atomic facts within HDInsight.  
  • Use Hive ODBC Driver or Simba Apache Hive ODBC Driver to enable SSAS to access HDInsight.
  • Use Standard fact and dimension design methods.
  • Use outrigger tables when using extremely large dimensions to cut down on size.
  • Having Operational Data Stores (ODS) within HDInsight enables business to do deep adhoc analysis without disturbing OTLP systems.  I picked HBase as the "NoSQL" database-engine just because its all under the same environment, it is a main project for HortonWorks, its well integrated with Pig and Hive, its well known and supported, its growing in popularity, is favored by Microsoft, and will be in HDInsight sometime in the future.  But this might not be the best fit.  Other Choices Non-HDInsight Environment Choices:
  • Use FLUME to load logs files into Hadoop.
  • Use SQOOP to extract and load data from Relational Data Management Systems (RDMS) into Hadoop.
  • User Defined Functions (UDF) are where the action is at.   Sessionization, Stats, Sampling, and PageRank are a basic needs and can be found in the DataFu set of UDFs provided by LinkedIn
Have fun.

Wednesday, December 25, 2013

SQL Server PDW - Big Data Warehouse

This diagram is a breakdown of how to logically setup a Big Data Warehouse using SQL Server MDS, HDInsight, and SQL Server PDW Appliance.  Its takes advantage of the PDW power and flips the traditional ETL to a ELT process.   This provides less breakpoints and makes the Operational Data Store (ODS) available within the PDW for business to query against when the Atomic Facts are not sufficient.  Business can even query raw logs stored in HDInsight through PolyBase for adhoc analysis.  Cool stuff.  I wish I could have my own personal SQL Server PDW Appliance.  Got a $1,500,000 to donate, anyone?

SQL Server PDW Appliance is apparently the cheapest per TB in the industry. See Comparitive PDW Pricing with Competition.

Here is a great link for understanding Implementing SQL PDW using Kimball Approach .
Here is a great link for how to migrate SQL Server to SQL Server PDW.
Here is a interesting link for a guestimate technical breakdown on hardware and estimated value.
Here is a SSIS vs T-SQL in Proof of Concept PDW Experiences