Sometimes when you are performing queries, you have data that does not have a primary key and contains duplicate records. If it had had a primary key, you could group on the columns that constitute a duplicate and just keep the MIN or MAX of the primary key, deleting the others. However, lacking that primary key, you have no way of uniquely identifying a row. Additionally, UNIQUEIDENTIFIER types don’t work with MAX and MIN, so you’d have a hard time picking a row in that case as well.
One option is of course to add an identity column to the table and do the operation that I described. However, sometimes you don’t have the appropriate permissions or just aren’t allowed to modify the table to do that. You can use the ROW_NUMBER() function to assign a unique row number to each row by partition. I’ll talk a little more about partition later, but when not explicitly specified in the query, the partition refers to the entire table.
Examine the script below. (All Sql Code contained here will only work on Sql Server 2005 and later)
USE tempdb GO CREATE TABLE HasDupes (ProductName VARCHAR(50), UpcCode VARCHAR(10)) GO INSERT HasDupes (ProductName, UpcCode) SELECT 'Diapers', '0123456780' UNION ALL SELECT 'Diapers', '0123456780' UNION ALL SELECT 'Diapers', '0123456780' UNION ALL SELECT 'Diapers', '0123456780' UNION ALL SELECT 'Formula', '0987654320' UNION ALL SELECT 'Formula', '0987654320' UNION ALL SELECT 'Formula', '0987654320' UNION ALL SELECT 'Diet Coke', '0564534230' UNION ALL SELECT 'Chair', '0872372530' GO SELECT * FROM HasDupes GO DROP TABLE HasDupes GO
You get the results shown here:
If we want to add a rownumber to the table, we would alter the part of the query that is
SELECT * FROM HasDupes GO
to say this
SELECT ProductName, UpcCode, ROW_NUMBER() OVER (ORDER BY ProductName ASC) AS RowNum FROM HasDupes GO
and get these results
That helps some, but wouldn’t it be better if we could get a row number assigned within groups? That is what the PARTITION BY clause does. Change the SELECT sql again so that it now looks like this (note: you can PARTITION BY and ORDER BY any columns you want, and they can be the same or different from each other)
SELECT ProductName, UpcCode, ROW_NUMBER() OVER (PARTITION BY UpcCode ORDER BY ProductName ASC) AS RowNum FROM HasDupes GO
Now, we get this
Okay, now all we have to do is keep the records that get assigned a one, forsaking all others. Change that same section of code again to read like this and then run the script. Note, I am using the CTE syntax here.
WITH ProductDupes AS ( SELECT ProductName, UpcCode, ROW_NUMBER() OVER (PARTITION BY UpcCode ORDER BY ProductName ASC) AS RowNum FROM HasDupes ) DELETE FROM ProductDupes WHERE RowNum > 1 GO SELECT * FROM HasDupes GO
Now we have a table with no duplicates.
That’s it, that is all there is to it. In my next blog, I will cover how we can use this trick in joins to ensure that we only get a 1 to 1 join, even when the data is 1 to Many.