Monday, December 10, 2007

How To: Delete Duplicate Data (DeDup)

ETL processes typically will run across the need to delete duplicate records. The fastest and simplist way I know of is to create a Primary Key (or unique index) on the table with the instruction to IGNORE_DUP_KEY=ON . Then when you bulk copy or insert data into the table, SQL quickly ingores the duplicates.

Example:

CREATE TABLE #Temp (ID INT NOT NULL)


ALTER TABLE #Temp
ADD PRIMARY KEY (ID ASC)
WITH (IGNORE_DUP_KEY = ON)
GO

INSERT INTO #Temp (ID)
SELECT 1
UNION ALL
SELECT 2
UNION ALL
SELECT 1

SELECT * FROM #Temp

DROP TABLE #Temp


But sometimes you need to delete duplicate data in a more controlled and auditable way. Here is an approach that does it in a single set statement.

Example:

CREATE TABLE #Temp(ID INT, Name VARCHAR(255))

INSERT INTO #Temp (ID, Name)
SELECT 1, 'Name 1'
UNION ALL
SELECT 2, 'Name 2'
UNION ALL
SELECT 3, 'Name 3'
UNION ALL
SELECT 4, 'Name 4'
UNION ALL
SELECT 5, 'Name 2'
UNION ALL
SELECT 6, 'Name 1'
UNION ALL
SELECT 7, 'Name 3'
UNION ALL
SELECT 8, 'Name 5'
UNION ALL
SELECT 9, 'Name 4'
UNION ALL
SELECT 10, 'Name 4'

SELECT 'Deleted Data Set'

DELETE FROM #TEMP
OUTPUT DELETED.* -- Display the Deleted Rows
FROM #TEMP T
JOIN (SELECT ROW_NUMBER() OVER (PARTITION BY #TEMP.Name ORDER BY #TEMP.ID) as ROWID, #TEMP.*
FROM #TEMP
) AS DupRows -- Find Dubs and assign rowID which resets for each new name
ON DupRows.ID = T.ID
WHERE DupRows.ROWID > 1



-- OR Here is an even more eligant solution --


;
WITH Dubs(ROWID, ID, Name) AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY #TEMP.Name ORDER BY #TEMP.ID) as ROWID, #TEMP.*
FROM #TEMP
)
DELETE FROM Dubs
OUTPUT DELETED.* -- Display the Deleted Rows
WHERE ROWID > 1 ;
SELECT 'Clean Data Set'
SELECT * FROM #TEMP

DROP TABLE #TEMP
GO