Common scenario, you want a subset of data which is duplicate across some attributes but does contain additional attributes being not duplicate at all. For a quick sample, I used the new function of table row constructors in SQL Server 2008 to generate a small result set. (The inner part of the query with the table constructors shows the data pulled from a table in a real world scenario, but this is easier for you to repro)
The sample is about staffing a project with SQL Server people from partners. As you do want to spread your team across the companies and only want one (the best) from each of your partners team you create a query to determine the people.
From the data mentioned above you probably want to have Don,Pete and Candy on your team, right ? But how to get those people from a query ?
The solution I often saw on the forums and discussion groups was the following: You try to group by the CompanyName but get an error that if you want to get the columns in addition to the one you already grouped by, that you will have to include them either in the Group clause or in an aggregate function. So you decide to use the MAX function to get the highest skill level and some employee from the set:
But wait a minute, this will bring back the following results:
The second and the third row seem to be ok, but why is Steven Executor selected ? Did we specify the query for the maximum skill level ? Yes sure, but also the Max EmpName, which is Steven as Steven is upfront in the alphabet. See we get a mess of compound data which isn´t valid at all. Steven is not even the wrong person, he also got a Skill Level of 1000 although he only has proven his skills to spell the word S-Q-L. What a mess ! You can try example by mixing and matching the different aggregate functions and might be able to get the wanted result but imagine you have more than three companies and mix all the data together, you sure won´t be able to get the right results for every kind of data.
Ok, so what is the best solution to get the data in a consistent state ? You will have to determine the whole row and not just fragments of it. E.g. you can use one of the new RANK functions to get the data in a consistent state:
WITH TheEmpList( CompanyName,Skill,EmpName,SkillLevel,RankNumber)
DENSE_RANK() OVER(PARTITION BY CompanyName ORDER BY SkillLevel DESC) AS RankNumber FROM
(‘(A) (c)ompany (m)aking (e)verything’,’SQL Server’, ‘Don DenseRank’, 1000),
(‘(A) (c)ompany (m)aking (e)verything’,’SQL Server’, ‘Bob the Builder’, 50),
(‘(A) (c)ompany (m)aking (e)verything’,’SQL Server’, ‘Steven Executor’, 5),
(‘SQL Server Cowboys’,’SQL Server’, ‘Pete McQuery’,50),
(‘Fabulous Cartesian Guys’,’SQL Server’,’Candy Builder’,60)
SELECT CompanyName, Skill, EmpName, SkillLevel
WHERE RankNumber = 1
By ordering the result with DENSE_RANK and by the attribute Skilllevel descending, you will get one of the best employees from the partners, voilá !
Next time you´d better check your data twice 😉 !