Friday, December 27, 2013

Azure HDInsight and Azure VM:SQL Server Analysis Services

This high level diagram lays out a data warehouse design using SQL Server MDS, HDInsight, Pig, Hive, HBase, SQOOP, FLUME, SQL Server Analysis Services (SSAS), SQL Server Reporting Services (SSRS), and Azure Virtual Machines.  Its theoretical at this time as I'm doing research on all the Azure possibilities.
  • Extract data from company sources to file and copy to HDInsight.  If SQL Server use BCP.
  • It is easiest to use the delimiter of ^A (0x01).  This is the default delimiter for Hive.
  • Process and enrich your data into atomic facts within HDInsight.  
  • Use Hive ODBC Driver or Simba Apache Hive ODBC Driver to enable SSAS to access HDInsight.
  • Use Standard fact and dimension design methods.
  • Use outrigger tables when using extremely large dimensions to cut down on size.
  • Having Operational Data Stores (ODS) within HDInsight enables business to do deep adhoc analysis without disturbing OTLP systems.  I picked HBase as the "NoSQL" database-engine just because its all under the same environment, it is a main project for HortonWorks, its well integrated with Pig and Hive, its well known and supported, its growing in popularity, is favored by Microsoft, and will be in HDInsight sometime in the future.  But this might not be the best fit.  Other Choices Non-HDInsight Environment Choices:
  • Use FLUME to load logs files into Hadoop.
  • Use SQOOP to extract and load data from Relational Data Management Systems (RDMS) into Hadoop.
  • User Defined Functions (UDF) are where the action is at.   Sessionization, Stats, Sampling, and PageRank are a basic needs and can be found in the DataFu set of UDFs provided by LinkedIn
Have fun.

No comments: