Thursday, December 27, 2007

Free Code: Bulk Loading and Partitioning Tables



ZIP File: TablePartitioningTool.zip

This experimental toolset is a result of my desire to have a simple repeatable solution for a complex problem: Table Loading and Partitioning Management for time based event data.

1. Loading can be a sore spot in any ETL pipeline when it is too slow. This results in an ever-increasing backlog of import data.
2. Partitioning within SQL Server 2005 is an extremely powerful and essential feature for dealing with large tables, but as a result the partitioning logic can be very complex to write at times.
3. Each time you create a partitioning and loading process for a given table you may implement each one differently. This creates unpredictable behavior of your overall service, application, or warehouse thus making it more costly to maintain and improve.

By using a code-generating approach for your loading and partitioning needs, you can quickly create solutions throughout your company. This solution offers the added bonus of creating a general standard to handle this complex problem, thus lowering your company’s overall cost of ownership.

FEATURES:

1. Creates complex stored procedures which handle all the loading, partitioning, switching, and merging logic to the specifications of a given table template.

2. Provides table loading management
  • Ensures minimum transaction logging
  • Manages micro-loading*
  • Keeps the table “Online” while loading the table
3. Provides partitioning management
  • Manages explicit time-based partitioning at grains of Hour, Day, Week, Month, or Year for any given table
  • Manages rolling off partitions older than a configurable maximum
* NOTE: Micro-Loading in this context represents a process of loading a smaller range of data than the explicit partition of the target table. Example: You load files every five minutes that contain data with a time range of five minutes into your table that is explicitly partitioned by day.

Monday, December 10, 2007

How To: Delete Duplicate Data (DeDup)

ETL processes typically will run across the need to delete duplicate records. The fastest and simplist way I know of is to create a Primary Key (or unique index) on the table with the instruction to IGNORE_DUP_KEY=ON . Then when you bulk copy or insert data into the table, SQL quickly ingores the duplicates.

Example:

CREATE TABLE #Temp (ID INT NOT NULL)


ALTER TABLE #Temp
ADD PRIMARY KEY (ID ASC)
WITH (IGNORE_DUP_KEY = ON)
GO

INSERT INTO #Temp (ID)
SELECT 1
UNION ALL
SELECT 2
UNION ALL
SELECT 1

SELECT * FROM #Temp

DROP TABLE #Temp


But sometimes you need to delete duplicate data in a more controlled and auditable way. Here is an approach that does it in a single set statement.

Example:

CREATE TABLE #Temp(ID INT, Name VARCHAR(255))

INSERT INTO #Temp (ID, Name)
SELECT 1, 'Name 1'
UNION ALL
SELECT 2, 'Name 2'
UNION ALL
SELECT 3, 'Name 3'
UNION ALL
SELECT 4, 'Name 4'
UNION ALL
SELECT 5, 'Name 2'
UNION ALL
SELECT 6, 'Name 1'
UNION ALL
SELECT 7, 'Name 3'
UNION ALL
SELECT 8, 'Name 5'
UNION ALL
SELECT 9, 'Name 4'
UNION ALL
SELECT 10, 'Name 4'

SELECT 'Deleted Data Set'

DELETE FROM #TEMP
OUTPUT DELETED.* -- Display the Deleted Rows
FROM #TEMP T
JOIN (SELECT ROW_NUMBER() OVER (PARTITION BY #TEMP.Name ORDER BY #TEMP.ID) as ROWID, #TEMP.*
FROM #TEMP
) AS DupRows -- Find Dubs and assign rowID which resets for each new name
ON DupRows.ID = T.ID
WHERE DupRows.ROWID > 1



-- OR Here is an even more eligant solution --


;
WITH Dubs(ROWID, ID, Name) AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY #TEMP.Name ORDER BY #TEMP.ID) as ROWID, #TEMP.*
FROM #TEMP
)
DELETE FROM Dubs
OUTPUT DELETED.* -- Display the Deleted Rows
WHERE ROWID > 1 ;
SELECT 'Clean Data Set'
SELECT * FROM #TEMP

DROP TABLE #TEMP
GO


Wednesday, November 21, 2007

How To: Advanced Queryable Compression (Continued from previous post)

Reducing I/O for a processing Pipeline is critical when processing high volume of data. By reducing your I/O you reduce the load on the network, Hard Drives, and CPU memory.

The following is a research paper from Vijayshankar Raman called:
“How to Wring a Table Dry: Entropy of Relations and Querying of Compressed Relations”. http://portal.acm.org/citation.cfm?id=1164201

Here is Vijayshankar Raman's website: http://domino.research.ibm.com/comm/research_people.nsf/pages/rshankar.index.html

Summary Coverage:
•Numeric Column compression
•Skew compression
•Correlating columns within a Tuple
•Incidental Tuple Ordering
•Tuple Correlations
•Delta Coding
•Huffman Code Scan Operations
•Segregated Coding
•Query Short Circuiting
•Results in 8x – 40x compression which still can be query-able without being decompressed

This is a continuation from a previous post:
http://dataglass.blogspot.com/2007/08/query-able-compression-no-need-to.html

Tuesday, November 20, 2007

Meta Data VS. Meta Model VS. Data Repository

To clarify some thoughts and clear up some confusion, here is some definitions...

Definition: Meta Data
Information about information. Basically its documentation about information used or contained by various systems. Example: Data Dictionaries & Data Model Diagrams.

This is on excellent thing to do at the beginning of a project, no matter how small a project it is. See my thoughts on data modeling in previous postings.

Pipeline (ETL) processes can also use Meta Data to inform it of what is coming into the pipeline and how to process it.

Definition: Meta Data Model (Meta Model)
A flexable data model schema that can store the definition of the data it contains and may include relationship rules and constraints. This could be as simple name-value pair storage design to an elaberate rule based model such as the system tables (example: sysobjects) that SQL Server uses to store the definition of user defined tables and their constraints.

I have several comments about Meta Data Models in previous postings. And its to this particular definition of which I speak of.

Definition: Data Repository
A storage facility that stores data about data. Used by major corporations to track its many data sources. Its stores the data dictionaries, data models, and data flow. Repositories usually use a Meta Data Model schema design for its storage to organize the data.

Large corporate data warehouses may find this to be useful in keeping all those data dictionaries, diagrams, etc... organized. Usually not necessary for smaller businesses.

Monday, November 05, 2007

Free Code: Stored Procedure Code Generator



The above link provides the following:
1. DBCodeGeneration.zip
Description:
You will find that 80% of all stored procedures in a transaction database are all very predictable and can be auto generated. Using a code generator will save you time and create symmetry within your database which increases predictability and stability for the overall system. This will allow you to focus on the other 20% of the stored procedures which are more complex and critical to the system.

I have found that creating stored procedures that create stored procedures was an easy way to do code generation for your database. Creating a script that depended on Visual Studio is just too bulky and slow to load when you want quick results. This solution I can take with me any were without the need of a complex environment.


Feature List:
1. Creates Get Stored Procedures
2. Creates Set Stored Procedures (Performs Insert or Update process depending on conditions)
3. Creates Delete Stored Procedures

Tuesday, August 14, 2007

Free Code: Audit Tables and Change Control



The above link provides the following:
1. DBAuditTrackingTool.zip
Description:
This tool audits your database tables without changing the table orginal table schema. Tracks what, when, and who changed a row within a table. Each table has a corrisponding history table and trigger to audit a table.

Feature List:
1. Track all who, what, and when changes have been made to any table.
2. Auto Generates the "Audit Trigger" for tables being audited.

3. Auto Generates the "Audit Table" for tables being audited.
4. Has a function called the "TableTardis" to reconstruct what the data looked like for any period of time in the past or future.

Current Requirements:
1. PK must exists.
2. Audit tables and sprocs are all in the same database as the tables being audited.

2. DBChangeControlAndTrackingTool.zip
Description:
This tool creates a staging environment for change tracking a database. Tracks what, when, and who changed a row within a table and group them within a batch.

Technique:
Uses minimum logging techique to save disk space. If 10% of 1 billion rows change over a period of a year, you will have 100 million rows of log audits. Batches are auto-managed and a batch window is for one full day starting at midnight.

Feature List:
1. Track all who, what, and when changes have been made to any table.
2. Auto Generates the "Audit Trigger" for tables being audited.
3. Auto Generates the "Audit Views" for the table to view current data or history.
4. Has a function called the "TableTardis" to reconstruct what the data looked like for any period of time in the past or future.
5. Automatically track groups of changes into a batch. Batches can be manually or auto controlled. Auto batch tracking will group all changes for a day. This is useful for change release to production database.

Current Requirements:
1. PK must exists.
2. Audit tables views and sprocs are all in the same database as the tables being audited.

3. ServiceControlCenter.zip
Desciption:
Set of free code that demonstrates how to create a Service Control Center for any application, service, or utility that need a central database to control and monitor them. This an essential to manage complex systems that contain numerous services. Pipelines found in data warehouses is one example. This is complete with data model and stored procedures.

Wednesday, August 01, 2007

How To: Simple Query-able Compression (No need to decompress to read file)

Do you wish to compress your data without having to decompress it to read it? Most people will think of RAR or ZIP compression when they need to save space on the hard drive. This may save storage and I/O load, but the side effect of this approach will require you to decompress the file every time you need to access it. The following is a means to compress your data without requiring decompression to read it.

Normalization: Compression is a natural byproduct when normalizing your data (Please see articles below on normalization and modeling). By normalizing your data you remove redundant data. It’s an effective non-destructive means of compressing your data into a query enabled format.
Binary Conversion: Converting from a string formatted file into a binary formatted file is another natural means of compressing your data. Reducing a string value of “1002000032” in to 4 bytes saves 6 bytes. The strongly data typed binary file can be trusted and read by other business processes without the need to do string conversion.
Hashing Long String Values: Hashing long string values into a binary hash value and placing the string and corresponding hash value into a lookup table is another natural means of compressing your data. URL links are common storage hogs. Reducing a 255 byte URL string to a 64 bit hash can save lots of space if that URL string occurs multiple times within the file. (NOTE: Make sure you select the most appropriate Hashing algorithm and the right hash bit length to reduce your odds of collisions.)
Roll Ups (Aggregation): By only recording one unique row and placing an aggregation count for each time it was recorded within a unit of time you can reduce the amount of data being recorded(Example: John Doe hit your website home page 3 times in 1 hr. In the log there would be one record with an aggregate count value of 3). This is destructive to your data set, because you lose the retreading of a user’s event path. But this may be a minimal and acceptable loss of data depending on your business.

Or Get Up To 36x Compression
The typical compression results from using one or more of the above suggestions can result in 2x-6x compression ratio. The above suggestions are extremely valuable even if you don’t care about having query enabled compression. If you add RAR on top of it all you can save another 6x compression which can give you between 12x-36x compression. Not bad for saving space eh!