Month: January 2010

Code Tips

Row_Number() Part One

Sometimes when you are performing queries, you have data that does not have a primary key and contains duplicate records. If it had had a primary key, you could group on the columns that constitute a duplicate and just keep the MIN or MAX of the primary key, deleting the others. However, lacking that primary key, you have no way of uniquely identifying a row. Additionally, UNIQUEIDENTIFIER types don’t work with MAX and MIN, so you’d have a hard time picking a row in that case as well.

One option is of course to add an identity column to the table and do the operation that I described. However, sometimes you don’t have the appropriate permissions or just aren’t allowed to modify the table to do that. You can use the ROW_NUMBER() function to assign a unique row number to each row by partition. I’ll talk a little more about partition later, but when not explicitly specified in the query, the partition refers to the entire table.

Examine the script below. (All Sql Code contained here will only work on Sql Server 2005 and later)

USE tempdb
GO

CREATE TABLE HasDupes (ProductName VARCHAR(50), UpcCode VARCHAR(10))
GO

INSERT HasDupes (ProductName, UpcCode)
SELECT 'Diapers', '0123456780'
UNION ALL
SELECT 'Diapers', '0123456780'
UNION ALL
SELECT 'Diapers', '0123456780'
UNION ALL
SELECT 'Diapers', '0123456780'
UNION ALL
SELECT 'Formula', '0987654320'
UNION ALL
SELECT 'Formula', '0987654320'
UNION ALL
SELECT 'Formula', '0987654320'
UNION ALL
SELECT 'Diet Coke', '0564534230'
UNION ALL
SELECT 'Chair', '0872372530'
GO

SELECT *
FROM HasDupes
GO

DROP TABLE HasDupes
GO

You get the results shown here:
Initial duplicates.

If we want to add a rownumber to the table, we would alter the part of the query that is

SELECT *
FROM HasDupes
GO

to say this

SELECT ProductName, UpcCode, ROW_NUMBER() OVER (ORDER BY ProductName ASC) AS RowNum
FROM HasDupes
GO

and get these results
Initial row number.

That helps some, but wouldn’t it be better if we could get a row number assigned within groups? That is what the PARTITION BY clause does. Change the SELECT sql again so that it now looks like this (note: you can PARTITION BY and ORDER BY any columns you want, and they can be the same or different from each other)

SELECT ProductName, UpcCode, ROW_NUMBER() OVER (PARTITION BY UpcCode ORDER BY ProductName ASC) AS RowNum
FROM HasDupes
GO

Now, we get this
Row Number with Partition

Okay, now all we have to do is keep the records that get assigned a one, forsaking all others. Change that same section of code again to read like this and then run the script. Note, I am using the CTE syntax here.

WITH ProductDupes
AS
(
	SELECT ProductName, UpcCode, ROW_NUMBER() OVER (PARTITION BY UpcCode ORDER BY ProductName ASC) AS RowNum
	FROM HasDupes
)

DELETE FROM ProductDupes WHERE RowNum > 1
GO

SELECT *
FROM HasDupes
GO

Now we have a table with no duplicates.
No more duplicates

That’s it, that is all there is to it. In my next blog, I will cover how we can use this trick in joins to ensure that we only get a 1 to 1 join, even when the data is 1 to Many.

Code Tips

Common Table Expressions

CTE (Common Table Expression) syntax can be a little tricky at first, but once you have a little primer, you will love using them as much as I do. Examine the following code below that uses a derived table from which to select. (Please realize that this is a contrived example and you could of course just join the tables in the query directly.)

USE tempdb
GO

SELECT t.TableName, c.ColumnName
FROM (SELECT OBJECT_ID AS Id, [Name] AS TableName FROM sys.tables) t
INNER JOIN (SELECT OBJECT_ID AS Id, [Name] AS ColumnName FROM sys.columns) c ON t.Id = c.Id

The query between the parentheses creates a table structure (that I aliased as “c” and “t” for my two) that only lasts for the life of the query. I can then reference those tables in the query in any way that I would a regular table. However, that syntax can make it very difficult to read, especially in long queries with several complicated derived tables.

Sql Server 2005 introduced a new syntax to be used instead called Common Table Expressions (CTEs). Using a CTE, the above query would be written like this

USE tempdb
GO

WITH Tbls AS
(
	SELECT OBJECT_ID AS Id, [Name] AS TableName
	FROM sys.tables
),
Clmns AS
(
	SELECT OBJECT_ID AS Id, [Name] AS ColumnName
	FROM sys.columns
)

SELECT t.TableName, c.ColumnName
FROM Tbls t
INNER JOIN Clmns c ON t.Id = c.Id

Alternatively, you can also explictly declare you column names in the CTE definition and not have to alias the columns in the query itself.

USE tempdb
GO

WITH Tbls (Id, TableName) AS
(
	SELECT OBJECT_ID, [Name]
	FROM sys.tables
),
Clmns (Id, ColumnName) AS
(
	SELECT OBJECT_ID, [Name]
	FROM sys.columns
)

SELECT t.TableName, c.ColumnName
FROM Tbls t
INNER JOIN Clmns c ON t.Id = c.Id

One “gotcha” is that the CTE – beginning with the WITH keyword – must be the first statement that is encountered. All previous statements above it must be terminated with a GO or a semicolon. If you ran the following

USE tempdb
GO

SELECT 'Here Comes the Error'

WITH Tbls (Id, TableName) AS
(
	SELECT OBJECT_ID, [Name]
	FROM sys.tables
),
Clmns (Id, ColumnName) AS
(
	SELECT OBJECT_ID, [Name]
	FROM sys.columns
)

SELECT t.TableName, c.ColumnName
FROM Tbls t
INNER JOIN Clmns c ON t.Id = c.Id

You’d get the errors shown below (or something very close)

Msg 102, Level 15, State 1, Line 4
Incorrect syntax near 'Tbls'.
Msg 102, Level 15, State 1, Line 8
Incorrect syntax near ','.

Some people even just make a habit of actually prefixing the WITH with a semicolon every time to make sure that everything goes off without a hitch. If I do run this script instead there are no issues.

USE tempdb
GO

SELECT 'No Problems Here'

;WITH Tbls (Id, TableName) AS
(
	SELECT OBJECT_ID, [Name]
	FROM sys.tables
),
Clmns (Id, ColumnName) AS
(
	SELECT OBJECT_ID, [Name]
	FROM sys.columns
)

SELECT t.TableName, c.ColumnName
FROM Tbls t
INNER JOIN Clmns c ON t.Id = c.Id

This is just a quick primer on CTEs, but I hope that you can find it useful. I really find that it cleans my code up quite a bit.

Pete on Software