STDDEV: Standing Sentinel on Your Data
Oracle Database implements a family of STDDEV functions for computing the standard deviation from the mean. If you think of the mean as beginning to paint a picture of the underlying data, then standard deviation is another brush-stroke toward a fuller picture that will help you draw meaning from the data you're studying.
SQL Statistic Series: You are reading part of an ongoing series covering the statistical functions available in the Oracle Database implementation of SQL.
A Measure of Dispersion
Standard deviation is termed a measure of dispersion, and Figure 1 illustrates what that means. The arithmetic mean in each case is 4. If you had only the mean to look at, you wouldn't "see" how the distribution of values was shifting underneath you. However, the standard deviation in Figure 1 becomes ever larger as values become more widely dispersed on either side of the mean. Mean gives you a center point. Standard deviation gives an indication of how tightly or broadly values are dispersed around that center.
Figure 1 shows a population standard deviation. There is also a sample standard deviation. I'll talk about the difference later. Right now, it's time to dive into an example. It'll be fun. I promise.
Example Data: Download and execute stddevExamples.sql to create the example tables and data used in this article.
Standard Deviation as Sentinel
One use for standard deviation is as a kind of sentinel, watching data and alerting you to unusual events. The mean can also serve as sentinel, but not all events in the data will be caught by the mean. Standard deviation will pick up on events that the mean lets slip by. And the reverse is also true! Two sentinels are better than one, and that's why standard deviation and mean are often deployed together.
Picture yourself a limnologist. You're looking at lake level data for Lake Superior, and you execute the following query to review the trend during the month of July 1995:
COLUMN "Date" FORMAT A10 COLUMN "Mean Meters" FORMAT 999.999 SELECT observation_date "Date", AVG(verified_meters) "Mean Meters" FROM lake_level WHERE station_id = 9099018 AND observation_date BETWEEN DATE '1995-07-01' AND DATE '1995-07-31' GROUP BY observation_date ORDER BY observation_date;
Let's walk through some aspects of this query. You're looking at lake level data from Meteorological Station 9099018. Equipment for that station is located on site at the Coast Guard Station in Marquette, Michigan. (There are numerous meterological stations around the lake; you've chosen the one in Marquette). The query looks at data between July 1 and July 31, and summarizes (i.e., groups) the results by day of the month.
Level measurements were made once per hour in 1995. The hourly intervals are recorded in a column namedobservation_time that is not referenced in the query. You apply the AVG function to return a single, mean water level per day. (It is accepted practice to report Great Lakes water levels as a daily mean).
Here are the results from the query:
... 09-JUL-95 183.360 10-JUL-95 183.360 11-JUL-95 183.370 Date Mean Meters ---------- ----------- 12-JUL-95 183.357 13-JUL-95 183.379 14-JUL-95 183.372 15-JUL-95 183.390 ...
These results are largely unremarkable. You do see a slight upward trend in the mean water level shown in theMean Meters column. This is expected behavior and is consistent throughout the month. The Great Lakes generally hit their peak seasonal water levels during the months of June through August. (See Lake-Level Variability and Water Availability in the Great Lakes, Page 5, and especially Paragraph 3 in the section on "Great Lakes Water Levels"). Lake Superior tends to peak in August.
Fortunately, you've read this article! Otherwise, you would walk away and miss a huge and noteworthy event. Knowing that some events escape the mean, but figuring on still finding nothing – because after all, it's just a lake – you yawn and modify the query to include standard deviation:
COLUMN "Std Dev" FORMAT 0.999 SELECT observation_date "Date", AVG(verified_meters) "Mean Meters", STDDEV_SAMP(verified_meters) "Std Dev" FROM lake_level WHERE station_id = 9099018 AND observation_date BETWEEN DATE '1995-07-01' AND DATE '1995-07-31' GROUP BY observation_date ORDER BY observation_date;
You're asking for the sample standard deviation (STDDEV_SAMP). That's appropriate in this case, because those hourly readings are merely samples taken at intervals along a continuum of change.
Here are the new results:
... 09-JUL-95 183.360 0.019 10-JUL-95 183.360 0.026 11-JUL-95 183.370 0.025 Date Mean Meters Std Dev ---------- ----------- ------- 12-JUL-95 183.357 0.041 13-JUL-95 183.379 0.096 14-JUL-95 183.372 0.130 15-JUL-95 183.390 0.076 ...
Wow! Now these numbers should make you sit up and take notice. Ok, perhaps not you, but a limnologist would find their pulse racing by now. Look at the standard deviation on the 14th. It has jumped by an order of magnitude. It looks like something began to happen possibly on the 12th, and peaked on the 14th.
Visions of being published in the Journal Limnology dance in your head as you pour a cup of coffee and decide to skip your lunch break while digging deeper into the data. The jump in deviation suggests some wild swings in lake level. You decide to also look at the minimum and maximum levels per day:
COLUMN "Daily Min" FORMAT 999.999 COLUMN "Daily Max" FORMAT 999.999 COLUMN "Difference" FORMAT 9.999 SELECT observation_date "Date", AVG(verified_meters) "Mean Meters", STDDEV_SAMP(verified_meters) "Std Dev", MIN(verified_meters) "Daily Min", MAX(verified_meters) "Daily Max", MAX(verified_meters) - MIN(verified_meters) "Difference" FROM lake_level WHERE station_id = 9099018 AND observation_date BETWEEN DATE '1995-07-01' AND DATE '1995-07-31' GROUP BY observation_date ORDER BY observation_date;
Clearly something unusual has occurred. Just look at these results, especially at the right-hand column showing the difference between each day's minimum and maximum water level:
... 09-JUL-95 183.360 0.019 183.327 183.400 .073 10-JUL-95 183.360 0.026 183.315 183.409 .094 11-JUL-95 183.370 0.025 183.309 183.416 .107 Date Mean Meters Std Dev Daily Min Daily Max Difference ---------- ----------- ------- --------- --------- ---------- 12-JUL-95 183.357 0.041 183.278 183.440 .162 13-JUL-95 183.379 0.096 183.248 183.681 .433 14-JUL-95 183.372 0.130 183.160 183.620 .460 15-JUL-95 183.390 0.076 183.269 183.571 .302 ...
July 13 and 14 show close to a half-meter difference between high and low water levels on each day. Have you just discovered tides on the Great Lakes? That'd mean a fellowship for sure. Maybe you'd even get your own chair.
Forget lunch. Let's get on with it. What about the other lakes? There are five altogether. Let's check them all. Following is a longish, but not really as complex as it looks query to return the sample standard deviation from the daily mean water level for each of the five Great Lakes:
COLUMN "Date" FORMAT A10 COLUMN "Superior" FORMAT 0.999 COLUMN "Huron" FORMAT 0.999 COLUMN "Michigan" FORMAT 0.999 COLUMN "Erie" FORMAT 0.999 COLUMN "Ontario" FORMAT 0.999 SELECT observation_date "Date", STDDEV_SAMP( CASE WHEN station_id = 9099018 THEN verified_meters END) "Superior", STDDEV_SAMP( CASE WHEN station_id = 9075035 THEN verified_meters END) "Huron", STDDEV_SAMP( CASE WHEN station_id = 9087031 THEN verified_meters END) "Michigan", STDDEV_SAMP( CASE WHEN station_id = 9063090 THEN verified_meters END) "Erie", STDDEV_SAMP( CASE WHEN station_id = 9052000 THEN verified_meters END) "Ontario" FROM lake_level WHERE observation_date BETWEEN DATE '1995-07-01' AND DATE '1995-07-31' GROUP BY observation_date ORDER BY observation_date;
This query returns a column for each lake. Each column's CASE expression selects for a specific reporting station on the lake for that column. Here are the results:
... 09-JUL-95 0.019 0.035 0.019 0.068 0.010 10-JUL-95 0.026 0.026 0.022 0.045 0.009 11-JUL-95 0.025 0.024 0.015 0.041 0.009 Date Superior Huron Michigan Erie Ontario ---------- -------- ------ -------- ------ ------- 12-JUL-95 0.041 0.028 0.028 0.069 0.015 13-JUL-95 0.096 0.117 0.030 0.118 0.020 14-JUL-95 0.130 0.096 0.025 0.101 0.022 15-JUL-95 0.076 0.119 0.033 0.050 0.035 ...
Whatever happened seems to have affected lakes Superior, Huron, and Erie. Those lakes show notable swings in standard deviation from their mean lake level on the dates July 13 and 14. As you contemplate this data, your stomach growling over the missed lunch, your colleague walks by and spots the July 1995 dates on your screen. Pausing with half-eaten sandwich, he recounts the tale of how he was caught out in his boat that month during a storm and almost died, and then back at the harbor almost lost his boat to a significant seiche event.
You missed lunch for a seiche.
But at least it was a good one! The seiche that is, not your lunch. July 1995 is notable for a series of damaging windstorms known as derechos. A series of four occurred that month in the short span from July 11 through 15. These are notable because the typical rate of occurrence in the area these events took place is one derecho every two to four years, whereas in July 1995 there were four in the span of four days. The July 12-13 derecho is known as "The Right Turn Derecho" due to its path crossing over Lake Superior heading eastward and then veering to the right and flowing down a line including eastern Michigan and the western part of Lake Huron, crossing Lake Erie, and finally ending more or less at the Ohio River.
The Right Turn Derecho produced significant storm surge and a seiche on lakes Superior, Huron, and Erie. That seiche is reflected in the example data, which I've downloaded for this article from the National Oceanic and Atmospheric Administration's (NOAA's) archive of Historic Great Lakes Water Level Data.
The seiche event generates a strong signal in the standard deviation from the mean water level. However, the seiche produces at best a very weak signal in the mean, if indeed there's any signal in the mean at all.
Note: Many seiche events are small in impact and go by unnoticed. Other times the water movement can be dramatic, even dangerous. The Lake Erie Seiche Disaster of 1844 saw a seiche that raised water level in Buffalo, NY some 22 feet, leaving wide swaths of the city under between two to eight feet of water. More recently in May 2012, a seiche near Cleveland overtook a breakwall and swept three children into the lake. Thankfully, the three were rescued by a nearby jet-skier.
Seiches represent a "sloshing" of water back and forth in an enclosed body of water such as a lake or a swimming pool. One would expect the highs and the lows from the back-and-forth water movement to cancel each other out, and one would expect those same highs and lows to disperse the hourly-readings further away from the mean in both directions. The example data bears out both these expectations.
The daily mean is a sentinel capturing rises and falls in lake level largely due to increased or decreased water volume. The standard deviation captures rises and falls due to water movement back and forth in the basin. Mean and standard deviation work together in this instance to signal significant events relating to Great Lakes water level.
Note: Researchers monitoring the Great Lakes probably do not use standard deviation to detect when seiches are occurring. They know ahead of time that seiches occur, and surely have more specific monitoring methods in place than to review standard deviation on past data. My point however, is that if you did not know ahead of time, if you truly had never heard of a seiche, then applying the standard deviation to your data would lead to your discovering something new that previously you had no knowledge about. Discovery sometimes involves noting that a particular statistical signal has changed, and then drilling down into the data to investigate further and learn about the event driving the change.
Population Versus Sample
Statisticians complicate our lives by computing standard deviation differently depending upon whether the data represents a sample, or whether it represents all possible values. Hence the terms sample standard deviation and population standard deviation. Figure 2 illustrates the concept of a sample versus the whole population. The six members of a Bible study group represent a complete population, whereas three selected members represent a sample from that larger group.
Oracle Database provides a function appropriate for each case – sample and population. These functions are:
Choose this function when your input data represents a sampling of all possible values from the set of data you are studying. For the mathematically inclined, this function applies what is known as Bessel's Correction. The sample standard deviation in Figure 2 is based upon the ages 46, 48, and 51, and is given as 2.5 years.
Choose this function when you are studying a data set and have all possible values at hand. Since you have all possible values, you can know the standard deviation with certainty; there is no need for any correction. The population standard deviation in Figure 2 encompasses all the group members, and is given as 2.4 years.
Avoid this function in Oracle Database unless you need: a) sample standard deviation, and b) the specific behavior of this function returning the mathematically incorrect value zero when your data set contains just one value. (The other two functions correctly return NULL in that instance; a deviation implies a difference, and a difference requires at least two values).
Note: Most other database brands support the same functions or their equivalent. Some, like MySQL, make STDDEVsynonymous with STDDEV_SAMP. Microsoft SQL Server spells the names using only one D, implementing STDEV andSTDEVP for sample and population respectively.
Making the decision between having an entire population or a sample hinges upon what question you are posing to the data. The same set of values can be a population sometimes, and a sample at other times. In Figure 3 you see again the Bible study group from Figure 2. Before, the group was the entire population. Now the group is merely a sample. The difference is the context. If you are asking questions about the Bible study group, then you have a population. If you are asking questions about the full congregation, then in that context the study group is but a sample.
Deciding whether you have a sample or a population isn't easy! Experienced statisticians having years of graduate-level study sometimes struggle to make the correct call.
The Great Lakes Seiche scenario provides an example of the subtleties involved in deciding between sample and population. I have all the hourly-readings. Why then did I choose the sample standard deviation? I chose it because I was studying lake level, and the lakes fluctuate continously. Other water levels surely existed between those hourly readings, and from that point of view the hourly readings represent a sampling from a continuum of all possible values. However! Had I been studying the equipment used in making those readings, then I might have had a case for considering those hourly readings as a population.
A further complication is that I really do not have all the hourly readings. As I worked with the data, I discovered a several-hour period during which the gauging station located at the Fermi Power Plant on Lake Erie failed to record lake level data. So if I did choose to treat my hourly data as a population, I would need to make an exception and treat the Lake Erie data as a sample.
Note: Figure 3 hints at the possibility that a given sample may give a skewed picture of the larger population. You don't know the ages of the teens listed in Figure 3, but presumably they are on the order of three decades younger than the adults, and thus the population standard deviation, if you could know it, woud be quite different from the reported sample standard deviation. Taking a random sampling from the congregation would be a more statistically sound approach, and then the Law of Large Numbers comes into play: as the sample size increases, the sample standard deviation will tend to converge toward the population standard deviation.
Statisticians talk of distribution. The term refers to the shape of the curve created by your data values as they are laid out along a number line in the manner shown in Figure 4. The so-called normal distribution is the common bell curve that many of us are familiar with from our school days, when teachers would manage student scores so as to fit the normal distribution. The practice was known as grading on the curve.
Figure 4 shows mean homework scores from the first week of an Algebra II class having four students. For example, the student Sean has the following five scores:
SELECT * FROM homework WHERE student_name = 'Sean' ORDER BY homework_date; STUDENT HOMEWORK_DATE HOMEWORK_SCORE ------- ------------------ -------------- Sean 07-JAN-13 50 Sean 08-JAN-13 40 Sean 09-JAN-13 67 Sean 10-JAN-13 33 Sean 11-JAN-13 60
Scores are averaged for each student across the entire week to generate the distribution in Figure 4:
SELECT student_name AS "Student", AVG(homework_score) AS "Mean Score" FROM homework GROUP BY student_name ORDER BY AVG(homework_score); Student Mean Score ------- ---------- Forest 37 Justin 50 Sean 50 Fayth 63
You may have picked up already from the previous sections that standard deviations are in the same units as their underlying data. The lake level data is in meters, and so is the standard deviation. The church member ages are in years, and so is their standard deviation. The beauty of this characteristic is that you can mark the number line at standard deviation intervals and slot your data into buckets corresponding to those intervals. You can write a query to bucketize the data via an algorithm such as the following:
Compute the difference between each student's weekly mean and the mean of all weekly means.
Take the absolute value of that difference. The result now is a distance to be spanned by some number of standard deviations.
Divide by the standard deviation rounded to two decimals.
Truncate the result to an integer, and add 1 to get a proper bucket number.
Following is a query implementing the above logic:
SELECT student_name AS "Student", AVG(homework_score) AS "Mean Score", STDDEV_POP(AVG(homework_score)) OVER () AS "Pop Dev", TRUNC( ABS( AVG(homework_score) - AVG(AVG(homework_score)) OVER ()) / ROUND(STDDEV_POP(AVG(homework_score)) OVER (), 2) ) + 1 AS "Bucket" FROM homework GROUP BY student_name ORDER BY AVG(homework_score);
You can see that each student is assigned to the proper standard deviation bucket depending upon how far the student's weekly mean score is from the mean of all the weekly means. Here are the results:
Student Mean Score Pop Dev Bucket ------- ---------- ---------- ---------- Forest 37 9.19238816 2 Justin 50 9.19238816 1 Sean 50 9.19238816 1 Fayth 63 9.19238816 2
The query does look a bit complicated due to the window function syntax, but it really isn't so bad once you get the gist of how it operates. If you like, you can express the query as a parent query against a subquery, and possibly find that version easier to read:
SELECT student_name AS "Student", weekly_mean AS "Mean Score", pop_dev AS "Pop Dev", TRUNC(ABS(weekly_mean - mean_all_means) / pop_dev) + 1 AS "Bucket" FROM ( SELECT student_name, AVG(homework_score) AS weekly_mean, ROUND( STDDEV_POP(AVG(homework_score)) OVER (), 2) AS pop_dev, AVG(AVG(homework_score)) OVER () AS mean_all_means FROM homework GROUP BY student_name ) ORDER BY weekly_mean;
The mean of the weekly mean scores is 50%. The population standard deviation is appropriate because we know all the scores for all the students, and it is 9.19%. The colored boxes on the numberline in Figure 4 denote increasing standard deviations from the mean. Two students fall into the orange bucket representing one standard deviation from the mean. The other two students fall into the green bucket representing two standard deviations from the mean.
The normal distribution is well-defined mathematically, and it's possible to specify precisely the percentage of values that will fall within a given number of standard deviations from the mean. For example, the second column in Table 1 shows that a range of three standard deviations on either side of the mean will encompass 99.73% the data. (Values in that column are from the so-called Empirical Rule). If your data is not normal, or you do not know for sure one way or the other, you can rely upon the percentage ranges given by Chebyshev's Inequality. Table 1 shows these also.
|Table 1. Percentage of data points within N standard deviations from the mean|
|Within...||Normal Distribution||Any Distribution, per Chebyshev|
|...one std. deviation||68.27%||At least zero percent|
|...two std. deviations||95.45%||At least 75%|
|...three std. deviations||99.73%||At least 88.89%|
It's common to presume that one's data is distributed normally. Don't fall into that trap. David M. Erceg-Hurn and Vikki M. Mirosevich, in their paper Modern Robust Statiscal Methods, talk about examining 440 data sets from the fields of psychology and education. None were normally distributed. Not one set. Do not rely upon normal distribution without verification that you have it. Do not, for example, apply the Empirical Rule unless you have verified normal data. Plotting your data points and visually inspecting to see whether they fall under the normal curve is one way to verify what you have.
Note: Thomas Pyzdek recounts the story of a time he was caught out by presuming normal distrubition in the December 1999 issue of Quality Digest. See his article in that issue entitled Non-Normal Distributions in the Real World, in which he makes the comment: "Normal distributions are not the norm." So be careful!
Skewed Distributions and Outliers
Distributions aren't always nicely bell-shaped. Sometimes they have outliers. Sometimes those outliers are best thought of as belonging in some other distribution than the one you are looking at.
Let's add a new student to our class. His name is Liam, and he's flat-out brilliant. Walking in on Monday morning, he swiftly catches up during class on all five of the previous week's homework assignments:
INSERT INTO homework VALUES ('Liam', TO_DATE('20130107', 'yyyymmdd'), 98); INSERT INTO homework VALUES ('Liam', TO_DATE('20130108', 'yyyymmdd'), 99); INSERT INTO homework VALUES ('Liam', TO_DATE('20130109', 'yyyymmdd'), 100); INSERT INTO homework VALUES ('Liam', TO_DATE('20130110', 'yyyymmdd'), 97); INSERT INTO homework VALUES ('Liam', TO_DATE('20130111', 'yyyymmdd'), 96);
Liam's mean score for the week is 98%. Figure 5 shows the affect on the distribution of scores in the class.
Figure 5 also shows Liam's effect on standard deviation. The upper row of colored boxes shows the previous standard deviation extended by three more buckets to meet Liam; he falls six standard deviations above the mean using the previous value of 9.19%. The bottom row of buckets shows the standard deviation recomputed for the new distribution. The new mean is 59.6%; the new deviation rounds to 21.89%.
Here's a version of the previous section's query that's been modified to return to the new standard deviation and mean:
SELECT student_name AS "Student", AVG(homework_score) AS "Mean Score", STDDEV_POP(AVG(homework_score)) OVER () AS "Pop Dev", AVG(AVG(homework_score)) OVER () AS "Mean of Means", TRUNC( ABS( AVG(homework_score) - AVG(AVG(homework_score)) OVER ()) / ROUND(STDDEV_POP(AVG(homework_score)) OVER (), 2) ) + 1 AS "Bucket" FROM homework GROUP BY student_name ORDER BY AVG(homework_score); Student Mean Score Pop Dev Mean of Means Bucket ------- ---------- ---------- ------------- ---------- Forest 37 20.8863592 59.6 2 Justin 50 20.8863592 59.6 1 Sean 50 20.8863592 59.6 1 Fayth 63 20.8863592 59.6 1 Liam 98 20.8863592 59.6 2
Liam wrecks the curve. You can spark a bit of lateral thinking by asking the question: "Does the outlier belong in a different distribution?" The answer in Liam's case is yes. Liam is better suited towards Advanced Placement Math (AP Math). So let's put him there, as illustrated in Figure 6.
The interesting question is to ask is what sort of query can be written to identify Liam as an outlier. There's no easy answer. If you need to identify outliers by executing a query, the best advice I can give is to seek out someone qualified in statistics to help you choose a method appropriate to your data and purpose. That may seem like a lame answer, but it's the correct answer.
Note: It is perfectly reasonable to plot your data as I did in Figure 5, and then to make a subjective decision as to which data points to treat as outliers. The various math-based methods for outlier detection – and there are many of them – are really attempts to codify heuristics that are essentially subjective.
Presumably there will be a few other students in AP Math with Liam. Their distribution will ideally be a more or less normal bell-curve, but it will be a bell-curve skewed higher than that of the Algebra II course. All students benefit from Liam's move to a more challenging course. Liam is saved from boredom through the opportunity to race ahead and learn to his full potential. The other students avoid being swamped by advanced material they can't handle. Everyone wins.
The Straight Story
Standard deviation has a variety of uses beyond the one I've highlighted in this article. It's widely used in quality control for example, and the article Non-Normal Distributions in the Real World that I referenced earlier steps you through some of the math involved in a real-world, quality-control effort.
A teacher might use standard deviation as a sort of quality control, as a check on student comprehension. Query to compute the standard deviation of the first week's homework scores for each student for example, and you'll find Justin's results to be all over the map. He scores very well, or very poorly, with wide swings back and forth, doing well one day and poorly the next. An astute teacher might see the resulting high deviation and be prompted to wonder what might be done to improve Justin's consistency.
Be wary of applying the empirical rule to your data without first checking to be sure your data follows the normal curve. The empirical rule is the one stating that 68% of values fall into the first standard deviation, and so forth. The rule applies only when the distribution is normal. Yet I frequently see the percentages mentioned without that caveat, which is a very important caveat. If you aren't certain of normal distribution, then rely instead on the Chebyshev Inequality, because Chebyshev is immune to distribution.
Respect the statistic. Using standard deviation as a sentinel as I did for lake level doesn't require a strong base of statistical knowledge. Other uses do require some knowledge. Do your homework. Read up on your specific application. Seek qualified help when you need it.
Acknowledgments: My own base of knowledge is weak, which is why I'm so grateful to Dr. Myron Hlynka, Stéphane Faroult, Gwen Shapira, and Dr. Jon Westfall for their kindness in reading this article and providing feedback to ensure its correctness and accuracy.