MEDIAN: For When You Don't Really Mean It
Median is a measure of central tendency useful in describing the typical experience, or the typical case. It's a type of average along with the mean, but less susceptible to skew from outlying values. Mean is useful when you need the math to work out, such as when you want your average daily sales to sum to your total monthly sales. Median is useful in describing, say, what the typical sale looks like, or what the typical customer is likely to spend.
SQL Statistic Series: You are reading part of an ongoing series covering the statistical functions available in the Oracle Database implementation of SQL.
Your Bike Shop Dream
Fatigued from the drudgery of DBA life you decide to quit databases. Raiding your 401k plan for startup funding, you fulfill your lifelong dream of opening a bike shop. You love bikes and are handy with the wrenches, so you hire your daughter as salesperson while you keep to the back room and spend your first week rehabbing big-box, Huffy and Next-brand mountain-bikes. It doesn't get any better than that.
Wait! It does get better. Yes, it does. Your daughter turns out to be a master salesperson. She has moved a lotof bikes and the showroom floor is looking a bit empty. Time to order up some fresh stock.
Example Data: Download and execute medianExamples.sql to create the example table and data used in this article.
As you sit down with your coffee and think about what new bikes you'd like to stock, the thought occurs to consider the average sale price from the preceding week. You didn't totally forget those database skills. Every bicycle sale is carefully recorded in a database of your own design. Firing up good old SQL*Plus, you hammer out a query:
SELECT COUNT(bprice), AVG(bprice) FROM bsales WHERE sdate >= DATE '2013-04-01' AND sdate < DATE '2013-04-06'; COUNT(BPRICE) AVG(BPRICE) ------------- ----------- 25 2004.80
These are great results! Twenty-five bikes sold, at an average price of just over $2000 each. You pause to make a note about paying your daughter a bonus. Then you turn toward your dealer book and begin looking over the many well-equipped bikes having prices near the $2000 mark. You decide to focus on the $1500-2500 range, figuring that gives some options for those wanting to spend slightly higher or lower than average.
Just as you're about to click the submit button to commit your new stocking order on the dealer website for the brand you sell, your daughter walks by and comments that she's sold only one $2000 bike all week. Spewing coffee and choking at the same time, you frantically click at the cancel button instead.
Close call. Order canceled. But what happened? What did you really sell last week?
Median and How to Compute It
You applied the AVG function in your query earlier to generate the mean bicycle price, but the mean is highly affected by outliers. A small handful of expensive purchases can skew the mean bicycle price to a value painting an erroneous picture of the more numerous and less-expensive transactions that typify your daily sales. Figure 1 takes a subset of the prices and contrasts the effect of a single large purchase on the mean versus the median.
Most of the purchases in Figure 1 fall into the $500 to $600 range. That's the picture the median paints too, without being affected by the one outlier from an atypical customer. (It is more reasonable to tell someone that the typical sale price is $580 than to claim it is $2240). The mean on the other hand does not represent the typical purchase, nor does it even manage to represent the one outlier. Mean is meaningless in this context.
Offsite Reading: John Myles White provides excellent perspective on the three most common "averages" in his blog entry Modes, Medians and Means: A Unifying Perspective. His blog entry is a must-read if you enjoy the math behind the statistics.
Figure 2 illustrates how the median is computed. Simply line up the values in order from least to greatest. Choose whatever value lies in the middle. (If you have an even number of values, then take the mean of the two middle ones). Simple, and done! The result is termed the median, and will divide your data such that you have the same number of values below as above.
Median is referred to by statisticians as being more robust with respect to outliers than is the mean. And that is true. Yet no matter the statistic, there are always corner cases to contend with. The final example in Figure 2 represents one such case: An even number of values having a large gap between the two mid-point values can result in a median that is essentially a mean painting a poor picture of the typical case represented by the data.
Tip: In any large data set, you can test the robustness of the median by seeing whether and by how much it changes after discarding a single value.
To avoid being caught out by corner cases, good statisticians will pay attention to distribution, to sample size, and to trend over time. See the section on "Distribution" in my earlier article STDDEV: Standing Sentinal on Your Data for a discussion of distribution and why it matters. Keep sample size in mind, because anomalies such as the bottom result in Figure 2 often are the result of tiny sample sizes. Pay attention to trend too. A sudden result of $4760 after many weeks of median sales in the $500 to $600 range should not be accepted easily, and should instead be treated as a suspicious result motivating you to dig deeper and take a closer look at the underlying data to see why that result has occurred.
Historical tidbit: It is not as easy to compute the median from code for an arbitrary number of input values as to visualize the concept in a figure. Early implementations of SQL omitted support for a MEDIAN function, and early books and articles on SQL are replete with often inscrutable approaches for computing the median without direct support from the language. Partly for that reason, and because it's less resource-intensive to compute, the mean is sometimes found applied in cases where median might be more appropriate.
You realize now that median provides a better indication of your typical customer and amount of purchase. Changing your query to execute the MEDIAN function, you run it once more:
SELECT COUNT(bprice), MEDIAN(bprice) FROM bsales WHERE sdate >= DATE '2013-04-01' AND sdate < DATE '2013-04-06'; COUNT(BPRICE) MEDIAN(BPRICE) ------------- -------------- 25 580
These are much different results. Deciding to test the robustness of your median, you compute a second median that arbitrarily excludes one value:
SELECT COUNT(bprice), MEDIAN(bprice), MEDIAN(CASE WHEN ROWNUM > 1 THEN bprice END) AS robust_test FROM bsales WHERE sdate >= DATE '2013-04-01' AND sdate < DATE '2013-04-06'; COUNT(BPRICE) MEDIAN(BPRICE) ROBUST_TEST ------------- -------------- ----------- 25 580 580
You end up with one median computed from an odd quantity of values, and another from an even quantity of values. The two medians in this case are the same, which is as good as it gets. A small difference would be no cause for alarm. However, a large difference such as the change in magnitude from 580 to 4760 shown earlier in Figure 2 certainly should make you sit up and take notice. Applying this test of arbitrarily discarding one value to generate a second median helps protect you from being mislead by a corner case.
Note: My use of ROWNUM in the most recent query essentially discards a random value. More precisely, the value discarded is simply the first value that happens to be considered by the query. The discarded value will not necessarily be the least or the greatest in the column being queried. Nor does that matter. The goal is to flip the number of values from even to odd, or vice-versa. You can discard any value arbitrarily to achieve that goal.
Curious to know a bit more, you query for sales of bikes priced over the $1000 mark:
SELECT bdesc, bprice FROM bsales WHERE sdate >= DATE '2013-04-01' AND sdate < DATE '2013-04-06' AND bprice >= 1000; BDESC BPRICE ----------------------------------- ---------- S-Works Fate Carbon 29 7500 S-Works Enduro Carbon 9000 Rockhopper Comp 29 1100 S-Works Epic Carbon 29 XTR 10000 S-Works Demo 8 Carbon Team Replica 10000 Camber 29 1900
Only six out of the twenty-five bikes sold last week were at prices above $1000, and it's really just four very expensive bikes that have skewed the mean to such a high value as to mislead you into almost restocking at the wrong price point. The median sales price of $580 is a much better representation of your typical customer. Relaxing a bit, you sit down again to look at your dealer book, this time at models falling into the $400 to $1000 range.
Note: Indeed, most bike shop owners will tell you that their bread-and-butter comes from selling large quantities of low- to moderately-priced bikes. In fact, the bread-and-butter often comes more from repair work than from sales of new bikes. To sell four bikes in excess of $7500 in a five-day period would be highly unusual. The daughter most certainly deserves her bonus.
Acknowledgments: My thanks to Stéphane Faroult, Dr. Jon Westfall, and John Myles White for reviewing a pre-publication draft and helping me to clarify some of the passages and figures in this article. Special thanks to John Myles White for the tip on discarding a single value to test the robustness of the median on a given data set.