Thursday, December 27, 2007

Free Code: Bulk Loading and Partitioning Tables



ZIP File: TablePartitioningTool.zip

This experimental toolset is a result of my desire to have a simple repeatable solution for a complex problem: Table Loading and Partitioning Management for time based event data.

1. Loading can be a sore spot in any ETL pipeline when it is too slow. This results in an ever-increasing backlog of import data.
2. Partitioning within SQL Server 2005 is an extremely powerful and essential feature for dealing with large tables, but as a result the partitioning logic can be very complex to write at times.
3. Each time you create a partitioning and loading process for a given table you may implement each one differently. This creates unpredictable behavior of your overall service, application, or warehouse thus making it more costly to maintain and improve.

By using a code-generating approach for your loading and partitioning needs, you can quickly create solutions throughout your company. This solution offers the added bonus of creating a general standard to handle this complex problem, thus lowering your company’s overall cost of ownership.

FEATURES:

1. Creates complex stored procedures which handle all the loading, partitioning, switching, and merging logic to the specifications of a given table template.

2. Provides table loading management
  • Ensures minimum transaction logging
  • Manages micro-loading*
  • Keeps the table “Online” while loading the table
3. Provides partitioning management
  • Manages explicit time-based partitioning at grains of Hour, Day, Week, Month, or Year for any given table
  • Manages rolling off partitions older than a configurable maximum
* NOTE: Micro-Loading in this context represents a process of loading a smaller range of data than the explicit partition of the target table. Example: You load files every five minutes that contain data with a time range of five minutes into your table that is explicitly partitioned by day.

Monday, December 10, 2007

How To: Delete Duplicate Data (DeDup)

ETL processes typically will run across the need to delete duplicate records. The fastest and simplist way I know of is to create a Primary Key (or unique index) on the table with the instruction to IGNORE_DUP_KEY=ON . Then when you bulk copy or insert data into the table, SQL quickly ingores the duplicates.

Example:

CREATE TABLE #Temp (ID INT NOT NULL)


ALTER TABLE #Temp
ADD PRIMARY KEY (ID ASC)
WITH (IGNORE_DUP_KEY = ON)
GO

INSERT INTO #Temp (ID)
SELECT 1
UNION ALL
SELECT 2
UNION ALL
SELECT 1

SELECT * FROM #Temp

DROP TABLE #Temp


But sometimes you need to delete duplicate data in a more controlled and auditable way. Here is an approach that does it in a single set statement.

Example:

CREATE TABLE #Temp(ID INT, Name VARCHAR(255))

INSERT INTO #Temp (ID, Name)
SELECT 1, 'Name 1'
UNION ALL
SELECT 2, 'Name 2'
UNION ALL
SELECT 3, 'Name 3'
UNION ALL
SELECT 4, 'Name 4'
UNION ALL
SELECT 5, 'Name 2'
UNION ALL
SELECT 6, 'Name 1'
UNION ALL
SELECT 7, 'Name 3'
UNION ALL
SELECT 8, 'Name 5'
UNION ALL
SELECT 9, 'Name 4'
UNION ALL
SELECT 10, 'Name 4'

SELECT 'Deleted Data Set'

DELETE FROM #TEMP
OUTPUT DELETED.* -- Display the Deleted Rows
FROM #TEMP T
JOIN (SELECT ROW_NUMBER() OVER (PARTITION BY #TEMP.Name ORDER BY #TEMP.ID) as ROWID, #TEMP.*
FROM #TEMP
) AS DupRows -- Find Dubs and assign rowID which resets for each new name
ON DupRows.ID = T.ID
WHERE DupRows.ROWID > 1



-- OR Here is an even more eligant solution --


;
WITH Dubs(ROWID, ID, Name) AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY #TEMP.Name ORDER BY #TEMP.ID) as ROWID, #TEMP.*
FROM #TEMP
)
DELETE FROM Dubs
OUTPUT DELETED.* -- Display the Deleted Rows
WHERE ROWID > 1 ;
SELECT 'Clean Data Set'
SELECT * FROM #TEMP

DROP TABLE #TEMP
GO