Data Glass: 2007

Monday, December 10, 2007

How To: Delete Duplicate Data (DeDup)

ETL processes typically will run across the need to delete duplicate records. The fastest and simplist way I know of is to create a Primary Key (or unique index) on the table with the instruction to IGNORE_DUP_KEY=ON . Then when you bulk copy or insert data into the table, SQL quickly ingores the duplicates.

Example:

CREATE TABLE #Temp (ID INT NOT NULL)

ALTER TABLE #Temp
ADD PRIMARY KEY (ID ASC)
WITH (IGNORE_DUP_KEY = ON)
GO

INSERT INTO #Temp (ID)
SELECT 1
UNION ALL
SELECT 2
UNION ALL
SELECT 1

SELECT * FROM #Temp

DROP TABLE #Temp

But sometimes you need to delete duplicate data in a more controlled and auditable way. Here is an approach that does it in a single set statement.

Example:

CREATE TABLE #Temp(ID INT, Name VARCHAR(255))
INSERT INTO #Temp (ID, Name)
SELECT 1, 'Name 1'
UNION ALL
SELECT 2, 'Name 2'
UNION ALL
SELECT 3, 'Name 3'
UNION ALL
SELECT 4, 'Name 4'
UNION ALL
SELECT 5, 'Name 2'
UNION ALL
SELECT 6, 'Name 1'
UNION ALL
SELECT 7, 'Name 3'
UNION ALL
SELECT 8, 'Name 5'
UNION ALL
SELECT 9, 'Name 4'
UNION ALL
SELECT 10, 'Name 4'

SELECT 'Deleted Data Set'
DELETE FROM #TEMP
OUTPUT DELETED.* -- Display the Deleted Rows
FROM #TEMP T
JOIN (SELECT ROW_NUMBER() OVER (PARTITION BY #TEMP.Name ORDER BY #TEMP.ID) as ROWID, #TEMP.*
FROM #TEMP
) AS DupRows -- Find Dubs and assign rowID which resets for each new name
ON DupRows.ID = T.ID
WHERE DupRows.ROWID > 1

-- OR Here is an even more eligant solution --

;
WITH Dubs(ROWID, ID, Name) AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY #TEMP.Name ORDER BY #TEMP.ID) as ROWID, #TEMP.*
FROM #TEMP
)
DELETE FROM Dubs
OUTPUT DELETED.* -- Display the Deleted Rows
WHERE ROWID > 1 ;
SELECT 'Clean Data Set'
SELECT * FROM #TEMP
DROP TABLE #TEMP
GO

Monday, November 05, 2007

Free Code: Stored Procedure Code Generator

The above link provides the following:
1. DBCodeGeneration.zip
Description:
You will find that 80% of all stored procedures in a transaction database are all very predictable and can be auto generated. Using a code generator will save you time and create symmetry within your database which increases predictability and stability for the overall system. This will allow you to focus on the other 20% of the stored procedures which are more complex and critical to the system.

I have found that creating stored procedures that create stored procedures was an easy way to do code generation for your database. Creating a script that depended on Visual Studio is just too bulky and slow to load when you want quick results. This solution I can take with me any were without the need of a complex environment.

Feature List:
1. Creates Get Stored Procedures
2. Creates Set Stored Procedures (Performs Insert or Update process depending on conditions)
3. Creates Delete Stored Procedures

Wednesday, August 01, 2007

How To: Simple Query-able Compression (No need to decompress to read file)

Do you wish to compress your data without having to decompress it to read it? Most people will think of RAR or ZIP compression when they need to save space on the hard drive. This may save storage and I/O load, but the side effect of this approach will require you to decompress the file every time you need to access it. The following is a means to compress your data without requiring decompression to read it.

Normalization: Compression is a natural byproduct when normalizing your data (Please see articles below on normalization and modeling). By normalizing your data you remove redundant data. It’s an effective non-destructive means of compressing your data into a query enabled format.
Binary Conversion: Converting from a string formatted file into a binary formatted file is another natural means of compressing your data. Reducing a string value of “1002000032” in to 4 bytes saves 6 bytes. The strongly data typed binary file can be trusted and read by other business processes without the need to do string conversion.
Hashing Long String Values: Hashing long string values into a binary hash value and placing the string and corresponding hash value into a lookup table is another natural means of compressing your data. URL links are common storage hogs. Reducing a 255 byte URL string to a 64 bit hash can save lots of space if that URL string occurs multiple times within the file. (NOTE: Make sure you select the most appropriate Hashing algorithm and the right hash bit length to reduce your odds of collisions.)
Roll Ups (Aggregation): By only recording one unique row and placing an aggregation count for each time it was recorded within a unit of time you can reduce the amount of data being recorded(Example: John Doe hit your website home page 3 times in 1 hr. In the log there would be one record with an aggregate count value of 3). This is destructive to your data set, because you lose the retreading of a user’s event path. But this may be a minimal and acceptable loss of data depending on your business.

Or Get Up To 36x Compression
The typical compression results from using one or more of the above suggestions can result in 2x-6x compression ratio. The above suggestions are extremely valuable even if you don’t care about having query enabled compression. If you add RAR on top of it all you can save another 6x compression which can give you between 12x-36x compression. Not bad for saving space eh!

Data Glass

Pages