- Extract data from company sources to file and copy to HDInsight. If SQL Server use BCP.
- It is easiest to use the delimiter of ^A (0x01). This is the default delimiter for Hive.
- Process and enrich your data into atomic facts within HDInsight.
- Use Hive ODBC Driver or Simba Apache Hive ODBC Driver to enable SSAS to access HDInsight.
- Use Standard fact and dimension design methods.
- Use outrigger tables when using extremely large dimensions to cut down on size.
- Having Operational Data Stores (ODS) within HDInsight enables business to do deep adhoc analysis without disturbing OTLP systems. I picked HBase as the "NoSQL" database-engine just because its all under the same environment, it is a main project for HortonWorks, its well integrated with Pig and Hive, its well known and supported, its growing in popularity, is favored by Microsoft, and will be in HDInsight sometime in the future. But this might not be the best fit. Other Choices Non-HDInsight Environment Choices:
- Azure Table Storage (Connecting Hive to Azure Table Storage)
- Cassandra (Connecting Hive to Cassandra)
- MongoDB (Connecting Hive to MongoDB)
- Use FLUME to load logs files into Hadoop.
- Use SQOOP to extract and load data from Relational Data Management Systems (RDMS) into Hadoop.
- User Defined Functions (UDF) are where the action is at. Sessionization, Stats, Sampling, and PageRank are a basic needs and can be found in the DataFu set of UDFs provided by LinkedIn.
Friday, December 27, 2013
Wednesday, December 25, 2013
SQL Server PDW Appliance is apparently the cheapest per TB in the industry. See Comparitive PDW Pricing with Competition.
Here is a great link for understanding Implementing SQL PDW using Kimball Approach .
Here is a great link for how to migrate SQL Server to SQL Server PDW.
Here is a interesting link for a guestimate technical breakdown on hardware and estimated value.
Here is a SSIS vs T-SQL in Proof of Concept PDW Experiences