You've Got Framing!

The framing clause brings the window to window functions. Through it you define a window, or frame of reference if you prefer, that slides along a result set generating values such as running totals and moving averages. The framing clause is ideal when you can arrange a business question such that an answer comes from applying an aggregate function to a range of rows that slides or stretches smoothly as focus moves from one row to the next.

Example Data: This is the fifth in a series of articles exploring window functions and their enhanced implementation in SQL Server 2012. If you would like to follow along, execute this script to create the example table and data. All articles in this series use the same data set.

If you haven't read the preceding four articles, I strongly encourage you to take the time and read them. Especially read Interlude: the Mechanics of Window Functions and How They Work, which explains window function operation in painful detail. This article doesn't rehash the previous four; it assumes familiarity with the partition by and order by clauses as applied inside the over clause.

Running Sums

Running sums (or running totals) are common in business. They also happen to be a good place to begin when exploring the framing clause and what it can do. Imagine that you wish to produce a list of quarterly sales showing a running total by department. Following is a query to do that, and some results from executing the query. Glance over the query now; then read the detailed explanation following the output.

/*------------------------
select department, year, quarter, sales,
   sum(sales) over (
	  partition by department
	  order by year, quarter) running
from quarterly_sales
order by year, quarter, department;
------------------------*/

department         year quarter       sales     running
---------------- ------ ------- ----------- -----------
Appliances         2000       1   235956.23   235956.23
Electronics        2000       1   345873.14   345873.14
Jewelry            2000       1   240939.33   240939.33
Mattresses         2000       1    34873.33    34873.33
Vacuums            2000       1    65873.32    65873.32
Appliances         2000       2   129871.95   365828.18
Electronics        2000       2   401144.56   747017.70
Jewelry            2000       2   304984.38   545923.71
Mattresses         2000       2    47383.93    82257.26
Vacuums            2000       2    56983.43   122856.75
Appliances         2000       3   398213.39   764041.57
Electronics        2000       3   454987.33  1202005.03
...

Look at the three rows in these results for the Appliances Department. Notice how the running sum progresses from 235,956.23 to 365,828.18, to 764,041.57. Each subsequent Appliances row adds that row's sales amount to the running sum. What's the magic? How does the query operate? Let's walk through it:

We want a running sum, so we begin by invoking the aggregate function to sum the sales amounts:

sum(sales)

The running sum should be computed separately for each department:

sum(sales) over (
   partition by department

And of course, the running sum should be computed in order by year and quarter:

sum(sales) over (
   partition by department
   order by year, quarter

Now for a tricky bit. The frame of reference for each row should begin at the beginning:

sum(sales) over (
   partition by department
   order by year, quarter
   range between
      unbounded preceding

...and end at the current row:

sum(sales) over (
   partition by department
   order by year, quarter
   range between
      unbounded preceding and current row) running

But! It turns out that range between unbounded preceding and current row is default behavior triggered by the order by clause. It is common to omit the phrasing and simplify the function call to just:

sum(sales) over (
   partition by department
   order by year, quarter) running

You can execute the preceding example query using the invocation of sum from either of Step 5 or Step 6. You'll get the same result both ways. A simple call to sum(sales) over (partition by department) results in sum being invoked for all years and quarters for each department. But the moment you add order by year, quarter, you trigger the default framing clause.

Tip! The default framing clause sometimes trips people up and leads to confusion. There is nothing inherent in the words “order by” to cause one to expect that adding an order by clause to a window function invocation would change the rows over which the function results are computed. Keep in mind when writing window functions that adding an order by clause does more than just sort the values feeding into the function's execution. It also triggers the default framing clause, thereby changing the function's results.

Moving Average

The moving average is another widely-used computation in business. Imagine looking at quarterly sales data for, say, the Appliances Department. You might see some pretty wild swings from quarter to quarter, perhaps due to seasonal or other reasons. For example:

/*------------------------
select year, quarter, sales
from quarterly_sales
where department = 'Appliances'
order by year, quarter
------------------------*/

  year quarter  sales
------ ------- ----------
  2000       1  235956.23
  2000       2  129871.95
  2000       3  398213.39
  2000       4   54111.05
  2001       1   87230.33
  2001       2  105231.94
...

The moving average is a way to smooth out the numbers and get a better sense of the trend over time. You might decide to compute a moving average involving five quarters: the current quarter of focus plus two quarters in either direction. Thus, the following five rows would determine the moving average reported for year 2000, quarter 3:

2000   1  235956.23
2000   2  129871.95
2000   3  398213.39
2000   4   54111.05
2001   1   87230.33

   (235956.23 + 129871.95 + 398213.39 + 54111.05 + 87230.33) / 5 
  = 181076.59 = Moving Average for year 2000, quarter 3,
              centered over five quarters

Following is the window function invocation to compute such an average:

avg(sales) over (
      partition by department
      order by year, quarter
      rows between 2 preceding and 2 following)

The framing clause here is: rows between 2 preceding and 2 following. For each current row, the average sales amount is computed using the sales values from the two prior and two subsequent rows. The result is a five-value moving average centered on the row for which the function is being evaluated.

Note: The order by clause gives meaning to the framing clause. Centering the average over five values means nothing when the values are in random order. You must sort. You must also ensure no gaps. In this case, you must ensure a continuous run of year and quarter results, with no missing rows.

Following is a query making use of the above function to generate a five-quarter moving average of sales by department. There are a few embellishments that I'll explain after the results:

/*------------------------
select * from
(
select department, year, quarter, sales,
   round(avg(sales) over (
      partition by department
      order by year, quarter
      rows between 2 preceding and 2 following),2) moving,
   count(sales) over (
      partition by department
      order by year, quarter
      rows between 2 preceding and 2 following) counting
from quarterly_sales
) quarterly_sales_moving
where counting = 5
order by department, year, quarter;
------------------------*/

department   year quarter      sales         moving  counting
---------- ------ ------- ---------- --------------  --------
Appliances   2000       3  398213.39  181076.590000         5
Appliances   2000       4   54111.05  154931.730000         5
Appliances   2001       1   87230.33  162421.990000         5
...
Appliances   2010       3  300174.32  297853.530000         5
Appliances   2010       4  289756.12  312093.540000         5
Appliances   2011       1  330174.34  324555.370000         5
...

You should be able to follow the logic of this query if you keep the following in mind:

The round function rounds off the moving average values to two decimal places.
A call to count(sales), having the same framing clause as used for avg(sales), returns the number of discrete values that went into each of the moving average values.
The outer query filters on the count to eliminate any row for which five values centered over that row are not available.

My example data consists of a complete run of quarterly sales data from quarter 1 of 2000 through quarter 3 of 2011. It is not possible to compute a five-row moving average centered on the first two quarters of 2000, nor on quarters 2 and 3 of 2011. That's because the requisite number of preceding and following rows do not exist for those cases. I use count(sales) in conjunction with an enclosing query to eliminate those edge-case rows from the results.

Rows versus Range

You've seen that a framing clause can be written with range between, or rows between. What's the difference? Here it is in a nutshell:

Use rows between to specify an exact number of rows preceding or rows following. (You cannot specify an ordinal number of rows without writing rows between). Take care to sort input values deterministically! Write an order by clause into your window function that leaves no room for doubt, that will generate the same ordering each time it is executed. Endpoints such as unbounded preceding, unbounded following, and current row refer to specific, single rows. For example, the expression current row refers to only the one row.
Use range between when values are sorted ambiguously–for example, when your sort key is such that multiple rows map to the same position, and thus you cannot guarantee the ordering of those rows from one sort to the next. Endpoints such as unbounded preceding, unbounded following, and current row refer to groups of rows that all sort to the same position because they share the same sort key values. For example, the expression current row will include all rows having the same order by values as the current row.

Work through the example and explanation to follow. Then circle back and reread the preceding bullets. Do that, and the distinction between range and rows should begin to make sense.

Imagine that you're reporting on sales by department. You want to show a running total of sales by department, by quarter. You already know you can do that easily by writing:

sum(sales) over (
   partition by department
   order by year, quarter)

You've seen this invocation before. You know how it works. Computing a running total of sales across all departments is as simple as removing the partitioning by department. Do that, and you'll get:

sum(sales) over (
   order by year, quarter)

Pause here to recall the default framing clause, which I'll now add explicitly:

sum(sales) over (
   order by year, quarter
   range between unbounded preceding and current row)

Now change range between to rows between, giving a fourth function to consider:

sum(sales) over (
   order by year, quarter
   rows between unbounded preceding and current row)

Put all four functions together and execute them as part of a query restricted to the Appliances and Electronics departments, and you'll get results similar to the following. I say “similar to”, because one of the function invocations is non-deterministic. Your results in the far right-hand column might differ from my own.

/*------------------------
select department, year, quarter, sales,
   sum(sales) over (
	  partition by department
	  order by year, quarter) running1,
   sum(sales) over (
	  order by year, quarter) running2,
   sum(sales) over (
	  order by year, quarter
	  range between unbounded preceding and current row) running3,
   sum(sales) over (
	  order by year, quarter
	  rows between unbounded preceding and current row) running4
from quarterly_sales
where department in ('Appliances', 'Electronics')
order by department, year, quarter;
------------------------*/

department  year quarter     sales   running1   running2   running3   running4
----------- ---- ------- --------- ---------- ---------- ---------- ----------
Appliances  2000       1 235956.23  235956.23  581829.37  581829.37  235956.23
Appliances  2000       2 129871.95  365828.18 1112845.88 1112845.88 1112845.88
Appliances  2000       3 398213.39  764041.57 1966046.60 1966046.60 1511059.27
...
Electronics 2000       1 345873.14  345873.14  581829.37 581829.37   581829.37
Electronics 2000       2 401144.56  747017.70 1112845.88 1112845.88  982973.93
Electronics 2000       3 454987.33 1202005.03 1966046.60 1966046.60 1966046.60
...

The bold values in these results pretty much tell the tale. First though, here's a summary of the key columns in the output:

sales gives the quarterly sales by department.
running1 gives a running sum of sales by department.
running2 gives a running sum of sales by year and quarter.
running3 is the same as running2, but with the framing clause explicitly written out.
running4 gives an incorrect result due to the use of rows between rather than range between.

The following diagram illustrates the effect of invoking the rows between logic for the running4 column. The sort order shown is what must have occurred on my own system when I executed the query. You can readily see how the scope of rows between unbounded preceding and current row changes as focus moves from Appliances to Electronics in year 2000, quarter 1.

Sales values sorted
by year and quarter => 235956.23 => current row, Appliances, year 2000, quarter 1
                       345873.14 => current row, Electronics, year 2000, quarter 1
                       129871.95
                       401144.56
                       398213.39
                       454987.33

When the window function behind running4 is evaluated for the Appliances Department, year 2000, quarter 1, the only value in scope is 235956.23. Thus, that value becomes the first of the running sums that you see in the column's output. Later, when the same function is evaluated for the Electronics Department in the same year and quarter, the scope is expanded to also include 345873.14, and hence the running sum for that row of output is given as 235956.23 + 345873.14 = 581829.37.

By contrast, use of range between changes the semantics such that current row refers not to a single row, but to all rows that can sort to the same position as the current row:

Sales values sorted
by year and quarter => 235956.23
                       345873.14 => These two rows for year 2000, quarter 1
                       129871.95
                       401144.56 => These two rows for year 2000, quarter 2
                       398213.39
                       454987.33 => These two rows for year 2000, quarter 3

When the window function behind running3 is evaluated for Appliances, the meaning of current row will encompass both values in the same year and quarter. Ditto for Electronics. Both evaluations work out to: 235956.23 + 345873.14 = 581829.37. And the same is true of the running2 column, because the default semantics is the range between behavior.

Caution! There is no sense in applying rows between when the sort key is not determinate. The two values per quarter in the preceding example could just as easily be flipped around in the opposite order. You can't depend upon any particular ordering within a quarter. Thus you cannot depend upon any particular result from using rows between. The range between behavior is the only behavior that can make sense in such a case.

New Maximums

Don't allow your thinking to stop at running sums and moving averages. Explore what the other functions have to offer. You can get creative with max and min, for example, to identify new highs and lows. Imagine being asked the following question:

In what quarters do the sales represent a new high for the given department?

To identify a new high, one must compare the current quarter's sales amount against the highest amount from all past quarters. As the current quarter advances, the list of past quarters stretches and grows longer. This shifting frame of reference is your clue to think about the framing clause.

Begin with the max function to identify the highest sales amount:

max(sales)

Partition by department, because the goal is to find new highs within a given department's sales history:

max(sales) over (
   partition by department

Sort by year and quarter so the history can be examined in chronological order:

max(sales) over (
   partition by department
   order by year, quarter

Then apply a framing clause so that each evaluation of the function considers all the history up to, but not including, the current row:

max(sales) over (
   partition by department
   order by year, quarter
   rows between unbounded preceding and 1 preceding

Execute this function as part of a query, and review the results:

/*------------------------
select department, year, quarter, sales,
   max(sales) over (
      partition by department
      order by year, quarter
      rows between unbounded preceding and 1 preceding
   ) prev_max
from quarterly_sales
order by department, year, quarter;
------------------------*/

department    year quarter      sales   prev_max
----------- ------ ------- ---------- ----------
Appliances    2000       1  235956.23       NULL
Appliances    2000       2  129871.95  235956.23
Appliances    2000       3  398213.39  235956.23
Appliances    2000       4   54111.05  398213.39
Appliances    2001       1   87230.33  398213.39
...
Electronics   2000      1   345873.14       NULL
Electronics   2000      2   401144.56  345873.14
Electronics   2000      3   454987.33  401144.56
Electronics   2000      4   476539.23  454987.33
Electronics   2001      1   467043.14  476539.23
...

You can see that it's now a simple matter to wrap this query in an outer query that filters on the condition (sales > prev_max) or (prev_max is null). If you want to look at new highs over, say, a five year period rather than over the entire history, you can rewrite the framing clause as: rows between 60 preceding and 1 preceding. Replace max with min, and you can search out new lows. The framing clause is a great tool to have in your kit. Its expressive power makes many previously difficult problems trivial and even fun to solve.

Finis!

You've reached the end of my five-part series on window functions and their implementation in SQL Server 2012. Thank you for reading this far. I hope the articles are helpful. Fire an email my way if they have been. I'd love to hear from you. And if you haven't already done so, be sure to read the other articles in the series.

Published on Jan 11, 2012