ETL processes typically will run across the need to delete duplicate records. The fastest and simplist way I know of is to create a Primary Key (or unique index) on the table with the instruction to IGNORE_DUP_KEY=ON . Then when you bulk copy or insert data into the table, SQL quickly ingores the duplicates.
Example:
CREATE TABLE #Temp (ID INT NOT NULL)
ALTER TABLE #Temp
ADD PRIMARY KEY (ID ASC)
WITH (IGNORE_DUP_KEY = ON)
GO
INSERT INTO #Temp (ID)
SELECT 1
UNION ALL
SELECT 2
UNION ALL
SELECT 1
SELECT * FROM #Temp
DROP TABLE #Temp
But sometimes you need to delete duplicate data in a more controlled and auditable way. Here is an approach that does it in a single set statement.
Example:
CREATE TABLE #Temp(ID INT, Name VARCHAR(255))
INSERT INTO #Temp (ID, Name)
SELECT 1, 'Name 1'
UNION ALL
SELECT 2, 'Name 2'
UNION ALL
SELECT 3, 'Name 3'
UNION ALL
SELECT 4, 'Name 4'
UNION ALL
SELECT 5, 'Name 2'
UNION ALL
SELECT 6, 'Name 1'
UNION ALL
SELECT 7, 'Name 3'
UNION ALL
SELECT 8, 'Name 5'
UNION ALL
SELECT 9, 'Name 4'
UNION ALL
SELECT 10, 'Name 4'
SELECT 'Deleted Data Set'
DELETE FROM #TEMP
OUTPUT DELETED.* -- Display the Deleted Rows
FROM #TEMP T
JOIN (SELECT ROW_NUMBER() OVER (PARTITION BY #TEMP.Name ORDER BY #TEMP.ID) as ROWID, #TEMP.*
FROM #TEMP
) AS DupRows -- Find Dubs and assign rowID which resets for each new name
ON DupRows.ID = T.ID
WHERE DupRows.ROWID > 1
-- OR Here is an even more eligant solution --
;
WITH Dubs(ROWID, ID, Name) AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY #TEMP.Name ORDER BY #TEMP.ID) as ROWID, #TEMP.*
FROM #TEMP
)
DELETE FROM Dubs
OUTPUT DELETED.* -- Display the Deleted Rows
WHERE ROWID > 1 ;
SELECT 'Clean Data Set'
SELECT * FROM #TEMP
DROP TABLE #TEMP
GO
1 comment:
very good post! most elegant dedup solution i've ever seen. thank you.
Post a Comment