You've Got Framing
The framing clause brings the "window" to window functions. Through it you define a window, or frame of reference if you prefer, that slides along a result set generating values such as running totals and moving averages. The framing clause is ideal when you can arrange a business question such that an answer comes from applying an aggregate function to a range of rows that slides or stretches smoothly as focus moves from one row to the next.
Example Data: This is the fifth in a series of articles exploring window functions and their enhanced implementation in SQL Server 2012. If you would like to follow along, execute this script to create the example table and data. All articles in this series use the same data set.
If you haven't read the preceding four articles, I strongly encourage you to take the time and read them. Especially read Interlude: the Mechanics of Window Functions and How They Work, which explains window function operation in painful detail. This article doesn't rehash the previous four; it assumes familiarity with the partition by and order by clauses as applied inside the over clause.
Running sums (or running totals) are common in business. They also happen to be a good place to begin when exploring the framing clause and what it can do. Imagine that you wish to produce a list of quarterly sales showing a running total by department. Following is a query to do that, and some results from executing the query. Glance over the query now; then read the detailed explanation following the output.
/*------------------------ select department, year, quarter, sales, sum(sales) over ( partition by department order by year, quarter) running from quarterly_sales order by year, quarter, department; ------------------------*/ department year quarter sales running ---------------- ------ ------- ----------- ----------- Appliances 2000 1 235956.23 235956.23 Electronics 2000 1 345873.14 345873.14 Jewelry 2000 1 240939.33 240939.33 Mattresses 2000 134873.3334873.33 Vacuums 2000 165873.3265873.32 Appliances 2000 2 129871.95 365828.18 Electronics 2000 2 401144.56 747017.70 Jewelry 2000 2 304984.38 545923.71 Mattresses 2000 247383.9382257.26 Vacuums 2000 256983.43 122856.75 Appliances 2000 3 398213.39 764041.57 Electronics 2000 3 454987.331202005.03 ...
Look at the three rows in these results for the Appliances Department. Notice how the running sum progresses from 235,956.23 to 365,828.18, to 764,041.57. Each subsequent Appliances row adds that row's sales amount to the running sum. What's the magic? How does the query operate? Let's walk through it:
sum(sales) over ( partition by department
sum(sales) over ( partition by department order by year, quarter
sum(sales) over ( partition by department order by year, quarter range between unbounded preceding
sum(sales) over ( partition by department order by year, quarter range between unbounded preceding and current row) running
sum(sales) over ( partition by department order by year, quarter) running
We want a running sum, so we begin by invoking the aggregate function to sum the sales amounts:
The running sum should be computed separately for each department:
And of course, the running sum should be computed in order by year and quarter:
Now for a tricky bit. The frame of reference for each row should begin at the beginning:
...and end at the current row:
But! It turns out that range between unbounded preceding and current row is default behavior triggered by the order by clause. It is common to omit the phrasing and simplify the function call to just:
You can execute the preceding example query using the invocation of sum from either of Step 5 or Step 6. You'll get the same result both ways. A simple call to sum(sales) over (partition by department) results insum being invoked for all years and quarters for a each department. But the moment you add order by year, quarter, you trigger the default framing clause.
Tip! The default framing clause sometimes trips people up and leads to confusion. There is nothing inherent in the words "order by" to cause one to expect that adding an order by clause to a window function invocation would change the rows over which the function results are computed. Keep in mind when writing window functions that adding an order by clause does more than just sort the values feeding into the function's execution. It also triggers the default framing clause, thereby changing the function's results.
The moving average is another widely-used computation in business. Imagine looking at quarterly sales data for, say, the Appliances Department. You might see some pretty wild swings from quarter to quarter, perhaps due to seasonal or other reasons. For example:
/*------------------------ select year, quarter, sales from quarterly_sales where department = 'Appliances' order by year, quarter ------------------------*/ year quarter sales ------ ------- ---------- 2000 1235956.23 2000 2129871.95 2000 3398213.39 2000 4 54111.05 2001 1 87230.33 2001 2105231.94 ...
The moving average is a way to smooth out the numbers and get a better sense of the trend over time. You might decide to compute a moving average involving five quarters: the current quarter of focus plus two quarters in either direction. Thus, the following five rows would determine the moving average reported for year 2000, quarter 3:
2000 1235956.23 2000 2129871.95 2000 3398213.39 2000 4 54111.05 2001 1 87230.33 (235956.23 + 129871.95 + 398213.39 + 54111.05 + 87230.33) / 5 = 181076.59 = Moving Average for year 2000, quarter 3, centered over five quarters
Following is the window function invocation to compute such an average:
avg(sales) over ( partition by department order by year, quarter rows between 2 preceding and 2 following)
The framing clause here is: rows between 2 preceding and 2 following. For each current row, the average sales amount is computed using the sales values from the two prior and two subsequent rows. The result is a five-value moving average centered on the row for which the function is being evaluated.
Note: The order by clause gives meaning to the framing clause. Centering the average over five values means nothing when the values are in random order. You must sort. You must also ensure no gaps. In this case, you must ensure a continuous run of year and quarter results, with no missing rows.
Following is a query making use of the above function to generate a five-quarter moving average of sales by department. There are a few embellishments that I'll explain after the results:
/*------------------------ select * from ( select department, year, quarter, sales, round(avg(sales) over ( partition by department order by year, quarter rows between 2 preceding and 2 following),2) moving, count(sales) over ( partition by department order by year, quarter rows between 2 preceding and 2 following) counting from quarterly_sales ) quarterly_sales_moving where counting = 5 order by department, year, quarter; ------------------------*/ department year quarter sales movingcounting ---------- ------ ------- ---------- -------------- -------- Appliances 2000 3398213.39181076.590000 5 Appliances 2000 4 54111.05154931.730000 5 Appliances 2001 1 87230.33162421.990000 5 ... Appliances 2010 3300174.32297853.530000 5 Appliances 2010 4289756.12312093.540000 5 Appliances 2011 1330174.34324555.370000 5 ...
You should be able to follow the logic of this query if you keep the following in mind:
The round function rounds off the moving average values to two decimal places.
A call to count(sales), having the same framing clause as used for avg(sales), returns the number of discrete values that went into each of the moving average values.
The outer query filters on the count to eliminate any row for which five values centered over that row are not available.
My example data consists of a complete run of quarterly sales data from quarter 1 of 2000 through quarter 3 of 2011. It is not possible to compute a five-row moving average centered on the first two quarters of 2000, nor on quarters 2 and 3 of 2011. That's because the requisite number of preceding and following rows do not exist for those cases. I use count(sales) in conjunction with an enclosing query to eliminate those edge-case rows from the results.
Rows versus Range
You've seen that a framing clause can be written with range between, or rows between. What's the difference? Here it is in a nutshell:
Use to specify an exact number of rows preceding or rows following. rows between(You cannot specify an ordinal number of rows without writing rows between). Take care to sort input values deterministically! Write an order by clause into your window function that leaves no room for doubt, that will generate the same ordering each time it is executed. Endpoints such as unbounded preceding, unbounded following, and current row refer to specific, single rows. For example, the expression current row refers to only the one row.
Use when values are sorted ambiguously range between--for example, when your sort key is such that multiple rows map to the same position, and thus you cannot guarantee the ordering of those rows from one sort to the next. Endpoints such as unbounded preceding, unbounded following, andcurrent row refer to groups of rows that all sort to the same position because they share the same sort key values. For example, the expression current row will include all rows having the same order byvalues as the current row.
Work through the example and explanation to follow. Then circle back and reread the preceding bullets. Do that, and the distinction between range and rows should begin to make sense.
Imagine that you're reporting on sales by department. You want to show a running total of sales by department, by quarter. You already know you can do that easily by writing:
sum(sales) over ( partition by department order by year, quarter)
You've seen this invocation before. You know how it works.
Computing a running total of sales across all departments is as simple as removing the partitioning by department. Do that, and you'll get:
sum(sales) over ( order by year, quarter)
Pause here to recall the default framing clause, which I'll now add explicitly:
sum(sales) over ( order by year, quarter range between unbounded preceding and current row)
Now change range to row, giving a fourth function to consider:
sum(sales) over ( order by year, quarter rows between unbounded preceding and current row)
Put all four functions together and execute them as part of a query restricted to the Appliances and Electronics departments, and you'll get results similar to the following. I say "similar to", because one of the function invocations is non-deterministic. Your results in the far right-hand column might differ from my own.
/*------------------------ select department, year, quarter, sales, sum(sales) over ( partition by department order by year, quarter) running1, sum(sales) over ( order by year, quarter) running2, sum(sales) over ( order by year, quarter range between unbounded preceding and current row) running3, sum(sales) over ( order by year, quarter rows between unbounded preceding and current row) running4 from quarterly_sales where department in ('Appliances', 'Electronics') order by department, year, quarter; ------------------------*/ departmentyear quarter sales running1 running2 running3 running4 ----------- ---- ------- --------- ---------- ---------- ---------- ---------- Appliances2000 1 235956.23 235956.23581829.37581829.37235956.23 Appliances2000 2 129871.95365828.18 1112845.88 1112845.88 1112845.88 Appliances2000 3 398213.39764041.57 1966046.60 1966046.60 1511059.27 ... Electronics 2000 1 345873.14 345873.14581829.37581829.37581829.37 Electronics 2000 2 401144.56747017.70 1112845.88 1112845.88982973.93 Electronics 2000 3 454987.33 1202005.03 1966046.60 1966046.60 1966046.60 ...
The bold and underlined values in these results pretty much tell the tale. First though, here's a summary of the key columns in this query:
sales gives the quarterly sales by department.
running1 gives a running sum of sales by department.
running2 gives a running sum of sales by year and quarter.
running3 is the same as running2, but with the framing clause explicitly written out.
running4 gives an incorrect result due to the use of rows between rather than range between.
The following diagram illustrates the effect of invoking the rows between logic for the running4 column. The sort order shown is what must have occurred on my own system when I executed the query. You can readily see how the scope of rows between unbounded preceding and current row changes as focus moves from Appliances to Electronics in year 2000, quarter 1.
Sales values sorted by year and quarter => 235956.23 => current row, Appliances, year 2000, quarter 1 345873.14 => current row, Electronics, year 2000, quarter 1 129871.95 401144.56 398213.39 454987.33
When the window function behind running4 is evaluated for the Appliances Department, year 2000, quarter 1, the only value in scope is 235956.23. Thus, that value becomes the first of the running sums that you see in the column's output. Later, when the same function is evaluated for the Electronics Department in the same year and quarter, the scope is expanded to also include 345873.14, and hence the running sum for that row of output is given as 235956.23 + 345873.14 = 581829.37.
By contrast, use of range between changes the semantics such that "current row" refers not to a single row, but to all rows that can sort to the same position as the current row:
Sales values sorted by year and quarter => 235956.23 345873.14 => These two rows for year 2000, quarter 1 129871.95 401144.56 => These two rows for year 2000, quarter 2 398213.39 454987.33 => These two rows for year 2000, quarter 3
When the window function behind running3 is evaluated for Appliances, the meaning of "current row" will encompass both values in the same year and quarter. Ditto for Electronics. Both evaluations work out to: 235956.23 + 345873.14 = 581829.37. And the same is true of the running2 column, because the default semantics is the range between behavior.
Caution! There is no sense in applying rows between when the sort key is not determinate. The two values per quarter in the preceding example could just as easily be flipped around in the opposite order. You can't depend upon any particular ordering within a quarter. Thus you cannot depend upon any particular result from using rows between. The range between behavior is the only behavior that can make sense in such a case.
Don't allow your thinking to stop at running sums and moving averages. Explore what the other functions have to offer. You can get creative with max and min, for example, to identify new highs and lows. Imagine being asked the following question:
In what quarters do the sales represent a new high for the given department?
To identify a new high, one must compare the current quarter's sales amount against the highest amount from all past quarters. As the current quarter advances, the list of past quarters stretches and grows longer. This shifting frame of reference is your clue to think about the framing clause.
Begin with the max function to identify the highest sales amount:
Partition by department, because the goal is to find new highs within a given department's sales history:
max(sales) over ( partition by department
Sort by year and quarter so the history can be examined in chronological order:
max(sales) over ( partition by department order by year, quarter
Then apply a framing clause so that each evaluation of the function considers all the history up to, but not including, the current row:
max(sales) over ( partition by department order by year, quarter rows between unbounded preceding and 1 preceding
Execute this function as part of a query, and review the results:
/*------------------------ select department, year, quarter, sales, max(sales) over ( partition by department order by year, quarter rows between unbounded preceding and 1 preceding ) prev_max from quarterly_sales order by department, year, quarter; ------------------------*/ departmentyear quarter sales prev_max ----------- ------ ------- ---------- ---------- Appliances2000 1235956.23 NULL Appliances2000 2129871.95235956.23 Appliances2000 3398213.39235956.23 Appliances2000 4 54111.05398213.39 Appliances2001 1 87230.33398213.39 ... Electronics 2000 1345873.14 NULL Electronics 2000 2401144.56345873.14 Electronics 2000 3454987.33401144.56 Electronics 2000 4476539.23454987.33 Electronics 2001 1467043.14476539.23 ...
You can see that it's now a simple matter to wrap this query in an outer query that filters on the condition(sales > prev_max) or (prev_max is null). If you want to look at new highs over, say, a five year period rather than over the entire history, you can rewrite the framing clause as: rows between 60 preceding and 1 preceding. Replace max with min, and you can search out new lows. The framing clause is a great tool to have in your kit. Its expressive power makes many previously difficult problems trivial and even fun to solve.
Finis! You've reached the end of my five-part series on window functions and their implementation in SQL Server 2012. Thank you for reading this far. I hope the articles are helpful. Fire an email my way if they have been. I'd love to hear from you. And if you haven't already done so, be sure to read the other articles in the series.