Friday, December 27, 2013

Azure HDInsight and Azure VM:SQL Server Analysis Services

This high level diagram lays out a data warehouse design using SQL Server MDS, HDInsight, Pig, Hive, HBase, SQOOP, FLUME, SQL Server Analysis Services (SSAS), SQL Server Reporting Services (SSRS), and Azure Virtual Machines.  Its theoretical at this time as I'm doing research on all the Azure possibilities.
  • Extract data from company sources to file and copy to HDInsight.  If SQL Server use BCP.
  • It is easiest to use the delimiter of ^A (0x01).  This is the default delimiter for Hive.
  • Process and enrich your data into atomic facts within HDInsight.  
  • Use Hive ODBC Driver or Simba Apache Hive ODBC Driver to enable SSAS to access HDInsight.
  • Use Standard fact and dimension design methods.
  • Use outrigger tables when using extremely large dimensions to cut down on size.
  • Having Operational Data Stores (ODS) within HDInsight enables business to do deep adhoc analysis without disturbing OTLP systems.  I picked HBase as the "NoSQL" database-engine just because its all under the same environment, it is a main project for HortonWorks, its well integrated with Pig and Hive, its well known and supported, its growing in popularity, is favored by Microsoft, and will be in HDInsight sometime in the future.  But this might not be the best fit.  Other Choices Non-HDInsight Environment Choices:
  • Use FLUME to load logs files into Hadoop.
  • Use SQOOP to extract and load data from Relational Data Management Systems (RDMS) into Hadoop.
  • User Defined Functions (UDF) are where the action is at.   Sessionization, Stats, Sampling, and PageRank are a basic needs and can be found in the DataFu set of UDFs provided by LinkedIn
Have fun.

Wednesday, December 25, 2013

SQL Server PDW - Big Data Warehouse

This diagram is a breakdown of how to logically setup a Big Data Warehouse using SQL Server MDS, HDInsight, and SQL Server PDW Appliance.  Its takes advantage of the PDW power and flips the traditional ETL to a ELT process.   This provides less breakpoints and makes the Operational Data Store (ODS) available within the PDW for business to query against when the Atomic Facts are not sufficient.  Business can even query raw logs stored in HDInsight through PolyBase for adhoc analysis.  Cool stuff.  I wish I could have my own personal SQL Server PDW Appliance.  Got a $1,500,000 to donate, anyone?

SQL Server PDW Appliance is apparently the cheapest per TB in the industry. See Comparitive PDW Pricing with Competition.

Here is a great link for understanding Implementing SQL PDW using Kimball Approach .
Here is a great link for how to migrate SQL Server to SQL Server PDW.
Here is a interesting link for a guestimate technical breakdown on hardware and estimated value.
Here is a SSIS vs T-SQL in Proof of Concept PDW Experiences
   

Thursday, April 04, 2013

How To: Sequence SQL Tables based on Dependency

There are many times in which you may need to know the sequence of tables based on dependencies in order to process data without foreign constraint issues and without turning off those constraints.  You'll need a list of tables from the most dependent table to the least dependent table.   The process below provides an example in how to provide such a list.


CREATE FUNCTION dbo.GetTableSequence ()
RETURNS @Temp2 Table
(  Sequence int Primary key Identity
 , ObjectID int
 , SchemaName sysname
 , TableName sysname)
AS
BEGIN
DECLARE @Temp1 Table
(  ForeignObjectID int
 , ForeignSchemaName sysname
 , ForeignTableName sysname
 , ObjectID int
 , SchemaName sysname
 , TableName sysname)

-----------------------------------------------------
-- Identify Tables with Depedency on another table
-----------------------------------------------------
Insert into @Temp1 (ForeignObjectID, ForeignSchemaName, ForeignTableName, ObjectID, SchemaName, TableName)
Select Distinct O.Object_ID as ForeignObjectID, Cast(schema_Name (o.Schema_ID) as sysname) as ForeignSchemaName, cast(o.name as sysname) as ForeignTableName,
                o2.Object_ID as ObjectID, Cast(schema_Name (o2.Schema_ID) as sysname) as SchemaName, cast(o2.Name as sysname) as TableName
From sys.foreign_keys f
Join sys.objects o
 on f.referenced_object_id = o.object_id
Join sys.objects o2
    on f.parent_object_id = o2.object_id
Where O.Name <> O2.Name -- Exclude Child-Parent Relationships (These causes infinite loops)
---------------------------------------------------------------------------------------
-- Recursively Identify Tables from Most Dependent to least Dependent
---------------------------------------------------------------------------------------
;
with RecursionCTE (ObjectId, SchemaName, TableName)
as
(-- First Get all tables that are not dependent on any other table. (Table can depend on them but not the other way a round)
SELECT T.ObjectID, T.SchemaName, T.TableName
FROM (Select   Distinct O.Object_ID as ObjectID
    , Cast(schema_Name (o.Schema_ID) as sysname) as SchemaName
    , Cast(o.name as sysname) as TableName
  From sys.objects o
   Join sys.columns c
     on c.object_id = o.object_id
  Where o.type = 'u') as T
LEFT OUTER JOIN sys.foreign_key_columns sfk
   ON sfk.referenced_object_id = T.ObjectID
Where sfk.referenced_object_id is null
union all
--  Now recursively work through all dependencies
select R1.ForeignObjectID, R1.ForeignSchemaName, R1.ForeignTableName
FROM @Temp1 as R1
join RecursionCTE as R2 on R1.TableName = R2.TableName
)
Insert Into @Temp2(ObjectId, SchemaName, TableName)
 select ObjectID, SchemaName, TableName
  from RecursionCTE
;
----------------------------------------------------
-- Delete Duplicates but keep the very last entry
----------------------------------------------------
WITH Dubs(ROWID, RecordID, SchemaName, TableName) AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY TableName ORDER BY Sequence DESC) as ROWID, Sequence, SchemaName, TableName
FROM @Temp2
)
DELETE FROM Dubs
WHERE ROWID > 1;
RETURN
END

Monday, March 18, 2013

Commentary: Creating a Data Mastering and Publishing Tool (Structured Data)

Here is another experiential post:
 
The last 2 months I’ve spent building a prototype for a Data Mastering and Publishing Tool.   I modeled the data based on the business needs, built the database, ported sample data from some data sources, and auto-generated the UI using http://CodeOnTime.com.  I then proceeded to create the stored procedures to support all the necessary processes.    The 2 months includes data modeling, prototype UI/Database, Powerpoint Presentation, and Functional Specification.   Only one resource was used to do all this work; Me.
 
CodeOnTime (COT) really sped up my time to complete the prototype.  In fact, I would not have even attempted the prototype without the tool as I'm not as strong in UI development as I once was.   I only rat-holed in the tool when trying to get the UI to understand some of the audit and publishing business rules.  I found that COT is very powerful, but lots of areas including the event sequencing  are undocumented and I had to rely on the comments in the COT support community, which was very useful and responsive.  Several times I had to use Visual Studio and step through the code to understand the UI event model.  I ended up going down dead end roads that seemed promising at first, but failed to work within a reasonable amount of time I was granted to work on the project.   But once I simplified the UI design and I just used the Tool to create basic data entry and used database sprocs to perform the business rules it worked great.   I wouldn’t say I would trust COT for a full professional UI release quite yet as I believe it still needs to mature.  But I would say it is totally awesome for producing throw-a-way prototype UIs and really cheap internal IT UI solutions.  I would also say that it is more modular and code efficient than IronSpeed from looking at the code generated from both.    
 
The key features of my project:
Data Mastering and Publishing Website
  Data Management      

       38 Entities to Manage
       All Entities are Highly Searchable
       All Entities have C.R.U.D.
       Import & Export
       Full Change History
  Release Management
       Release C.R.U.D.
       Release Workflow
       Release Status History
       Release Change Navigation
   Localization Management
       Export localization needs
       Import localized text
   Reports
       General Entity Reports
       Release Change Summary
       Specialized Reports
   Security Management      

       User Access and Role assignment
 
This prototype helped define and proof the functional specification.  The spec refers to prototype in many sections in order to keep the spec to a reasonable and readable size.  The prototype was crucial in communicating a very complex concept to a group of PMs, Developers and Testers.  And now I leave the project  in their hands as I move on to other things.