In this post we will explore a couple of other advances in the Windows functions and how they might be applied to other scenarios.

The first scenario uses the AdvnetureWorksDWDenali database located at http://msftdbprodsamples.codeplex.com/releases/view/55330

In this example, let’s look at the Sales.InternetSales table. Some common reporting requests might be:

1) What is the year to date sales by month for a given sales territory?

2) What were the average sales over the last 3 and 6 months shown as moving averages?

3) What is a given month’s sales as a percentage of the year’s total sales?

4) What is the year to date wales by month as a percentage of the tear’s total sales?

5) What is a given sales territory’s sales as a percentage for all sales in a given month?

Now, prior to the advancements in the window functions, some of these queries would be extremely difficult to do against a reporting database in an efficient manner. Chances are you would need to resort to temporary tables, creating work tables, scanning the base table multiple times, or bringing huge amounts of data into a client app where you could process the data more efficiently but incur large transfer times to move the data to your application tier.

So, what does this look like in SQL Server “Denali”?

WITH Sales_CTE AS (

SELECT SalesTerritoryKey

, OrderYear = YEAR(orderdate)

, OrderMonth = MONTH(orderdate)

, SalesAmount = SUM(SalesAmount)

FROM dbo.FactInternetSales

WHERE CurrencyKey = 100

GROUP BY SalesTerritoryKey, year(orderdate), month(orderdate)

)

SELECT SalesTerritoryKey

, OrderYear

, OrderMonth

, SalesAmount

, YTDSales = SUM(SalesAmount)

OVER (

PARTITION BY SalesTerritoryKey, OrderYear

ORDER BY OrderYear, OrderMonth

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

)

, Last3MonthAverageSales = Avg(SalesAmount)

OVER (

PARTITION BY SalesTerritoryKey

ORDER BY OrderYear, OrderMonth

ROWS BETWEEN 2 PRECEDING AND CURRENT ROW

)

, Last6MonthAverageSales = Avg(SalesAmount)

OVER (

PARTITION BY SalesTerritoryKey

ORDER BY OrderYear, OrderMonth

ROWS BETWEEN 5 PRECEDING AND CURRENT ROW

)

, CurrentMonthPctOfYear = SUM(SalesAmount)

OVER (

PARTITION BY SalesTerritoryKey, OrderYear

ORDER BY OrderYear, OrderMonth

ROWS CURRENT ROW

)

/ sum(SalesAmount)

OVER (

PARTITION BY SalesTerritoryKey, OrderYear

)

, YTDPCTOfYear = SUM(SalesAmount)

OVER (

PARTITION BY SalesTerritoryKey, OrderYear

ORDER BY OrderYear, OrderMonth

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

)

/ sum(SalesAmount)

OVER (

PARTITION BY SalesTerritoryKey, OrderYear

)

, SalesTerritoryAsPctOfMonthlyTotal = SUM(SalesAmount)

OVER (

PARTITION BY SalesTerritoryKey

ORDER BY OrderYear, OrderMonth

ROWS CURRENT ROW

)

/ sum(SalesAmount)

OVER (

PARTITION BY OrderYear, OrderMonth

)

FROM Sales_CTE

ORDER BY OrderYear, OrdermOnth, SalesTerritoryKey

Let’s break this statement apart…..

In the first section we create a common table expression that gives us a view of the orders grouped by year, month and sales territory:

WITH Sales_CTE AS (

SELECT SalesTerritoryKey

, OrderYear = YEAR(orderdate)

, OrderMonth = MONTH(orderdate)

, SalesAmount = SUM(SalesAmount)

FROM dbo.FactInternetSales

WHERE CurrencyKey = 100

GROUP BY SalesTerritoryKey, year(orderdate), month(orderdate)

)

This could have been accomplished with a derived table or a view… I have gotten in the habit of using common table expressions as this is the way that I think about the problem - first, group the data; then, query the data.

SELECT SalesTerritoryKey

, OrderYear

, OrderMonth

, SalesAmount

…

…

…

FROM Sales_CTE

ORDER BY SalesTerritoryKey, OrderYear, OrdermOnth

Here I am just writing the bulk of the query… leaving out the fun bits so that we can talk about them 1 by 1.

1) Year to Date Sales by month and territory.

, YTDSales = SUM(SalesAmount)

OVER (

PARTITION BY SalesTerritoryKey, OrderYear

ORDER BY OrderMonth

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

)

In this example we are creating a set of windows on the data defined by SalesTerritoryKey and OrderYear. Within this window we will order the data month. What we present to the aggregate function is the set or rows in that window from the first row and the row that is being evaluated.

2) 3 and 6 Month Moving Average

, Last3MonthAverageSales = Avg(SalesAmount)

OVER (

PARTITION BY SalesTerritoryKey

ORDER BY OrderYear, OrderMonth

ROWS BETWEEN 2 PRECEDING AND CURRENT ROW

)

, Last6MonthAverageSales = Avg(SalesAmount)

OVER (

PARTITION BY SalesTerritoryKey

ORDER BY OrderYear, OrderMonth

ROWS BETWEEN 5 PRECEDING AND CURRENT ROW

)

In this case we would like to present a different window to the aggregate. In Year to Date calculations we want to create a window bounded by the order year, in this case, we would like to create a larger window that encompasses all of the order for the sales territory. We order that data by year and month and then return the current row + 2 or 5 months prior depending on which average we are calculating.

3 and 4) Current Month and YTD Sales as a Percentage of Total Year’s Sales

, CurrentMonthPctOfYear = SUM(SalesAmount)

OVER (

PARTITION BY SalesTerritoryKey, OrderYear

ORDER BY OrderYear, OrderMonth

ROWS CURRENT ROW

)

/ SUM(SalesAmount)

OVER (

PARTITION BY SalesTerritoryKey, OrderYear

)

, YTDPCTOfYear = SUM(SalesAmount)

OVER (

PARTITION BY SalesTerritoryKey, OrderYear

ORDER BY OrderYear, OrderMonth

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

)

/ SUM(SalesAmount)

OVER (

PARTITION BY SalesTerritoryKey, OrderYear

)

In this case you can see that the numerator and the denominator of these calculations are both sums of sales amounts just over different windows.

5) Sales Territory Sales as a Pertcentage of the Current Month’s sales

, SalesTerritoryAsPctOfMonthlyTotal = SUM(SalesAmount)

OVER (

PARTITION BY SalesTerritoryKey

ORDER BY OrderYear, OrderMonth

ROWS CURRENT ROW

)

/ sum(SalesAmount)

OVER (

PARTITION BY OrderYear, OrderMonth

)

This calculation acts orthogonally to the other calculations in the list! All the other calculations looked at the calculations within a sales territory, this looks at all of the sales territories. Notice that the denominator of this calculation partitions by OrderYear and OrderMonth while all the other functions partition by SalesTerritoryKey.

This is great for running totals which work for sales – but what about account balances? How do I find the opening or closing balance of a period? What if I would like to see year to date growth based on current balance compared to the year’s opening balance? For this, we will shift to the dbo.FactFinance table that tracks account balances for AdventureWorks.

In this query we will look at the AdventureWorks Northeast Organization’s Cash Account Balances.

SELECT Date

, Amount

, YTDGrowth = (Amount

/ FIRST_VALUE(Amount)

OVER (

PARTITION BY YEAR(Date)

ORDER BY MONTH(Date)

)

) - 1

, Last3MonthGrowth = (Amount

/ FIRST_VALUE(Amount)

OVER (

PARTITION BY AccountKey

ORDER BY YEAR(Date), MONTH(Date)

ROWS BETWEEN 2 PRECEDING AND CURRENT ROW

)

) - 1

, Last6MonthGrowth = (Amount

/ FIRST_VALUE(Amount)

OVER (

PARTITION BY AccountKey

ORDER BY YEAr(Date), MONTH(Date)

ROWS BETWEEN 5 PRECEDING AND CURRENT ROW

)

) - 1

FROM dbo.FactFinance

WHERE AccountKey = 4

AND OrganizationKey = 3

Here we use the FIRST_VALUE function to obtain the first cash balance for the year (in year to date growth) or the first balance in a window of 3 or 6 months for the other calculations.

There are other new window functions lag, lead and last_value that we did not cover today, but I hope that you start to see the patterns around how to use these functions and the general usage pattern to apply them to the problems you face today.

]]>Last month I did a write up on a common scenario that I have seen with my customers where data needs to be re-summarized and the customer would like to avoid writing data into intermediate tables. For example, they may receive a flat file from an external source that has sales amount by month – but instead of each row containing a month’s worth of data, each row contains a year’s worth of data with one month in each column.

The first portion of the post we talked about unpivoting the data. This can be done in many ways – the most common using either SSIS or T-SQL’s unpivot function. This would be similar in SQL Server Denali.

What has changed is our approach to re-summarizing the data. What if we wanted to look at YTD sales or 3 Month rolling totals. SQL Server has expanded the capabilities of the window functions beyond the partition and order clauses that we had in SQL Server 2008 R2.

To start with, create a table and populate that table with data as we did in the previous post.

In SQL Server 2008 R2 we created a CTE that unpivoted the data, and then we joined that CTE on itself multiple times to create the offsets that we needed. This works OK for small-medium sized sets of data but we really need to do a lot of work to get these results.

This is a bit of an eye-chart. The intent here is not to go over every step of the execution plan but to show that there is a LOT of work that needs to take place to generate the results we need. Specifically, the base table is accessed, filtered and aggregated 5 times.

So, what improvements have been made in the next release of SQL Server and how is it going to help us here?

The next version of SQL Server includes a “row or range” clause to the “over” clause of a select statement.

The new statement looks like this:

WITH unpivoted_cte as (

SELECT VendorId = VendorId

, MonthAbbr = MonthAbbr

, Amount = Amount

, MonthId = row_number() over(partition by VendorId order by MonthAbbr)

FROM

(SELECT VendorId

, _01_Jan = Jan

, _02_Feb = Feb

, _03_Mar = Mar

, _04_Apr = Apr

, _05_May = May

, _06_Jun = Jun

, _07_Jul = Jul

, _08_Aug = Aug

, _09_Sep = Sep

, _10_Oct = Oct

, _11_Nov = Nov

, _12_Dec = Dec

FROM dbo.MonthlyData) AS p

UNPIVOT

(Amount FOR MonthAbbr IN

(_01_Jan, _02_Feb, _03_Mar, _04_Apr, _05_May, _06_Jun

, _07_Jul, _08_Aug, _09_Sep, _10_Oct, _11_Nov, _12_Dec

)

)AS unpvt

)

SELECT VendorId

, MonthId

, Amount

, Total_Ytd = SUM(Amount) OVER (

PARTITION BY VendorId

ORDER BY MonthId

ROWS UNBOUNDED PRECEDING

)

, Total_Rolling3 = SUM(Amount) OVER (

PARTITION BY VendorId

ORDER BY MonthId

ROWS BETWEEN 2 PRECEDING AND CURRENT ROW

)

FROM unpivoted_cte

WHERE VendorId = 999

The execution plan of this statement?

OK, Still a fair number of steps but it is a much more linear plan and the base table is only accessed the one time.

If I remove the where clause (where vendorId = 999) this query returns 120,000 rows based on the population script from last month. The old plan executes 6x slower than the new plan.

In future posts we will dive into some of the details of how this query works and look at some other improvements to the window functions that we can take advantage of in SQL Server Denali!

]]>This was a “one time” request so we wanted to get results quickly and then refine the process once the customer decided how/if they wanted to report on the data more frequently. In my quest to eliminate cursors, loops and temp tables form my set-based life I wanted to present a solution that was contained in one SQL Query. As we will see below, there are some drawbacks to my approach, but I think this is a good illustration of using common table expressions, ranking functions and the UNPIVOT operator to solve a fairly common problem.

The structure of the source table was similar to:

CREATE TABLE dbo.MonthlyData ( VendorId int , Jan int , Feb int , Mar int , Apr int , May int , Jun int , Jul int , Aug int , Sep int , Oct int , Nov int , Dec int CONSTRAINT pkc_MonthlyData PRIMARY KEY CLUSTERED (VendorId) );

Let’s populate some sample data:

SET NOCOUNT ON DECLARE @i INT = 0 WHILE @i < 10000 BEGIN INSERT INTO dbo.MonthlyData VALUES (@i, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) SELECT @i = @i + 1 END GO

The first step is to break out the data by month and vendor. We can use the UNPIVOT operator to flip the month’s columns to rows.

SELECT * FROM ( SELECT VendorId , _01_Jan = Jan , _02_Feb = Feb , _03_Mar = Mar , _04_Apr = Apr , _05_May = May , _06_Jun = Jun , _07_Jul = Jul , _08_Aug = Aug , _09_Sep = Sep , _10_Oct = Oct , _11_Nov = Nov , _12_Dec = Dec FROM dbo.MonthlyData ) AS p UNPIVOT (Amount FOR MonthAbbr IN (_01_Jan, _02_Feb, _03_Mar, _04_Apr, _05_May, _06_Jun , _07_Jul, _08_Aug, _09_Sep, _10_Oct, _11_Nov, _12_Dec ) ) a

We have prefixed each month name with a sortable number so that we can order the results when we calculate our totals. The result of the above looks like:

VendorId | Amount | MonthAbbr |

1 | 1 | _01_Jan |

2 | 1 | _02_Feb |

3 | 1 | _03_Mar |

4 | 1 | _04_Apr |

… | … | … |

We can now use this query as the input to a common table expression:

WITH unpivoted_cte as ( SELECT VendorId , MonthAbbr , Amount , MonthId = row_number() over(partition by VendorId order by MonthAbbr) FROM (SELECT VendorId , _01_Jan = Jan , _02_Feb = Feb , _03_Mar = Mar , _04_Apr = Apr , _05_May = May , _06_Jun = Jun , _07_Jul = Jul , _08_Aug = Aug , _09_Sep = Sep , _10_Oct = Oct , _11_Nov = Nov , _12_Dec = Dec FROM dbo.MonthlyData) AS p UNPIVOT (Amount FOR MonthAbbr IN (_01_Jan, _02_Feb, _03_Mar, _04_Apr, _05_May, _06_Jun , _07_Jul, _08_Aug, _09_Sep, _10_Oct, _11_Nov, _12_Dec ) )AS unpvt ) SELECT base.MonthId , base.VendorId , base.Amount FROM unpivoted_cte base

The MonthAbbr is converted to an ID of 1-12.

Now we can calculate a year to date and a rolling three month calculation for a given vendor…

WITH unpivoted_cte as ( SELECT VendorId , MonthAbbr , Amount , MonthId = row_number() over(partition by VendorId order by MonthAbbr) FROM (SELECT VendorId , _01_Jan = Jan , _02_Feb = Feb , _03_Mar = Mar , _04_Apr = Apr , _05_May = May , _06_Jun = Jun , _07_Jul = Jul , _08_Aug = Aug , _09_Sep = Sep , _10_Oct = Oct , _11_Nov = Nov , _12_Dec = Dec FROM dbo.MonthlyData) AS p UNPIVOT (Amount FOR MonthAbbr IN (_01_Jan, _02_Feb, _03_Mar, _04_Apr, _05_May, _06_Jun , _07_Jul, _08_Aug, _09_Sep, _10_Oct, _11_Nov, _12_Dec ) )AS unpvt ) SELECT base.MonthId , base.VendorId , base.Amount , YTD = ytd.Amount , Rolling3 = rolling3.Amount FROM unpivoted_cte base LEFT JOIN ( SELECT base.VendorId , base.MonthId , Amount = sum(compare.amount) FROM unpivoted_cte base JOIN unpivoted_cte compare ON base.vendorId = compare.vendorId AND base.monthId >= compare.monthId GROUP BY base.VendorId , base.MonthId ) ytd ON base.VendorId = ytd.VendorId AND base.MonthId = ytd.MonthId LEFT JOIN ( SELECT base.VendorId , base.MonthId , Amount = sum(compare.amount) FROM unpivoted_cte base JOIN unpivoted_cte compare ON base.vendorId = compare.vendorId AND compare.monthId between base.MonthId - 2 and base.MonthId GROUP BY base.VendorId , base.MonthId ) rolling3 ON base.VendorId = rolling3.VendorId AND base.MonthId = rolling3.MonthId WHERE base.VendorId = 999 ORDER BY base.VendorId, base.MonthId GO

Here we add two left joins that each access the CTE. In the case of YTD we have:

LEFT JOIN ( SELECT base.VendorId , base.MonthId , Amount = sum(compare.amount) FROM unpivoted_cte base JOIN unpivoted_cte compare ON base.vendorId = compare.vendorId AND base.monthId >= compare.monthId GROUP BY base.VendorId , base.MonthId ) ytd ON base.VendorId = ytd.VendorId AND base.MonthId = ytd.MonthId

We use the CTE as a base and join the CTE to itself where the “compare” month is less than the base month. So, for MondId 4 in the “base” we will return months 1, 2, 3 and 4. We then sum the amounts grouping by vendor and base month.

The result of this query is:

MonthId | VendorId | Amount | YTD | Rolling3 |

1 | 999 | 1 | 1 | 1 |

2 | 999 | 1 | 2 | 2 |

3 | 999 | 1 | 3 | 3 |

4 | 999 | 1 | 4 | 3 |

… | … | … | … | … |

There are a couple of open issues with this approach:

- We are assuming that there is only one per vendor. If we have more than one year we would need to modify the unpivot and use the year in the rolling 3 month and the YTD calculations.

- This is great for a one time re-summarization; however, there are other approaches including using SSIS to unpivot if you are loading data from a flat file and breaking the query into multiple steps and populating a work table that holds the unpivoted data.

In my opinion, if you find yourself performing these types of queries in the context of an application or reporting system and not just for one time data analysis then you should look closely at your data structures and update your schema to give you access to the data in a more usable form. That said, this may be a good stop gap as you are re-architecting an existing solution or you have encountered a solution where you cannot change these data structures.

]]>Each time a print job needs to run they may need to find a “hole” in the list of reserved POs. I have seen a number of methods to accomplish this mostly using temp tables, table variables or cursors. In SQL Server Code Name "Denali” there is a solution using a the new Sequence construct that will help out here.

I thought this might be a good topic to show a use of SQL Server ranking functions and common table expressions. Ranking functions were introduced in SQL Server 2005 and include a function ROW_NUMBER that can be used to assign a row number to the result of a query. Common Table Expressions were also introduced in SQL Server 2005 and may be thought of as a temporary result set that is scoped to a single query.

If we simplify this problem, we have a table called ListOfNumber with some number that we will call a TrackingNumber

-- -- Create a table with a list of numbers. -- CREATE TABLE ListOfNumber (TrackingNumber int PRIMARY KEY CLUSTERED) -- -- Insert numbers into the list … add some holes -- SET NOCOUNT ON DECLARE @i int = 1 WHILE @i < 1000 BEGIN IF (rand() < 0.25) BEGIn INSERT INTO ListOfNumber VALUES (@i) END SELECT @i = @i + 1 END

So, at this point we have a sparsely populated table with values between 1 and 1000.

Maybe something that looks like:

TrackingNumber |

4 |

5 |

25 |

50 |

… |

If we can join the table to itself, and offset the rows by one we would have:

TrackingNumber | TrackingNumber |

4 | |

4 | 5 |

5 | 25 |

25 | 50 |

50 | … |

… |

And we see that we have holes from 0-4, 5-25. The numbers 4-5 are adjacent so we do not want to report this as a hole.

The following query gets us part way there. Here we get an ordered list of numbers and assign a row number to each row returned. We join the list to itself offsetting the rownum by one to shift the join by one row.

;WITH List_CTE AS ( SELECT *, rownum = row_number() OVER(ORDER BY TrackingNumber) FROM ListOfNumber ) SELECT s.RowNum, s.TrackingNumber, e.RowNum, e.TrackingNumber FROM List_Cte s JOIN List_cte e ON s.rownum = e.rownum - 1

RowNum |
TrackingNumber | RowNum | TrackingNumber |

1 | 4 | 2 | 5 |

2 | 5 | 3 | 25 |

3 | 25 | 4 | 50 |

The problem here is that we are missing two holes: 1-4 and the largest number in the table (let’s say it is 50) to 1000. We can modify our CTE to add a start and end that are one number outside the bounds of our sequence. We will also filter our results to remove the adjacent numbers (such as 4-5).

;WITH List_CTE AS ( SELECT *, rownum = row_number() OVER(ORDER BY TrackingNumber) FROM ( SELECT 0 AS TrackingNumber UNION SELECT TrackingNumber FROM ListOfNumber UNION SELECT 1001 AS TrackingNumber ) ListOfNumber ) SELECT s.RowNum, s.TrackingNumber, e.RowNum, e.TrackingNumber FROM List_Cte s JOIN List_cte e ON s.rownum = e.rownum - 1 WHERE s.TrackingNumber <> e.TrackingNumber - 1

RowNum | TrackingNumber | RowNum | TrackingNumber |

1 | 0 | 2 | 4 |

3 | 5 | 4 | 25 |

4 | 25 | 5 | 50 |

5 | 50 | 6 | 1001 |

We now have accounted for the gaps at the start and end of our sequence and removed the non-holes caused by adjacent numbers. We can put it all together with:

;WITH List_CTE AS ( SELECT *, rownum = row_number() OVER(ORDER BY TrackingNumber) FROM ( SELECT 0 AS TrackingNumber UNION SELECT TrackingNumber FROM ListOfNumber UNION SELECT 1001 AS TrackingNumber ) ListOfNumber ) SELECT StartOfHole = s.TrackingNumber , EndOfHole = e.TrackingNumber , SizeOfHole = (e.TrackingNumber - s.TrackingNumber) - 1 FROM List_Cte s JOIN List_cte e ON s.rownum = e.rownum - 1 WHERE s.TrackingNumber <> e.TrackingNumber - 1

StartOfHole | EndOfHole | SizeOfHole |

0 | 4 | 3 |

5 | 25 | 19 |

25 | 50 | 24 |

50 | 1001 | 950 |

What if the values start and end of the valid range are consumed? In this case, because we pegged our query at 1 less than the min and 1 greater than the max of the range we would have adjacencies at the start and end of the range that would drop out. The result would simply be:

StartOfHole | EndOfHole | SizeOfHole |

1 | 4 | 2 |

5 | 25 | 19 |

25 | 50 | 24 |

50 | 1000 | 949 |

You can make adjustments if you would only like to see only gaps that are larger than a certain size, your query for the “ListOfNumber” looks differently than what is shown, or you need to parameterize these values. There may be alternatives to showing the gaps that appear at the start and end of the sequence but I have found that this approach works for many of the cases where I have needed it.

I hope that this helps both give a good solution for finding gaps in number sequences and demonstrates a use of ranking functions and common table expressions.

]]>