A Standard Deviation question.

Flukey · 11-10-2007 06:14PM #1

As I understand it, the standard deviation is basically the average distance of a range of numbers from its mean. So, to get it you get the average of those numbers. For each value in the range, you get it's difference from the mean. Some will be positive and some will be negative. To remove the negatives, we square all of these figures. We then total these squares, divide by one less than the amount of figures and finally get the square root of that figure, to reverse the fact that we squared to get rid of the negative differences.

That's all fine. Now, one query I have is this. Another way to get rid of the negatives would be to get the absolute difference, instead of the actual difference. So if the difference is -3, we note it as 3. This would remove the reason to square all the figures and subsequently get a square root. If we then total these figures, divide by one less than the amount of figures, we should have a standard deviation too. However, when you try this method, you get a slightly different figure.

Standard approach:
For example, we have the figures 2, 3, 4, 5, 6 as our set. The Mean is 4. The respective differences from the mean are -2, -1, 0, 1, 2. Square them to remove the negatives and we get 4, 1, 0, 1, 4. Total those and we get 10. Divide by one less than the amount of figures and we get 2.5. The square root of 2.5 is 1.581139.

Absolute approach:
Again we have the figures 2, 3, 4, 5, 6 as our set. The Mean is 4. The respective absolute differences from the mean are 2, 1, 0, 1, 2. No squaring needed, because they are all positive. We total those and we get 6. Divide by one less than the amount of figures and we get 1.5. There is no need for a square root as we did not square, so our result is 1.5 now.

So from the Standard approach, we get 1.581139 as our standard deviation and from the Absolute approach we get 1.5 as our standard deviation. Which is the more accurate figure? In both we are getting our original definition as the average distance of a set of values from their mean. The Absolute approach would seem to make more sense and be more accurate. So why is the other method used?

hermann · 11-10-2007 08:17PM

Very strange: I have never been on this mathematics forum before, and i just came here now intending to ask EXACTLY this question, and it's the top thread. Madness. Anyway, i too would be interested if somebody could explain this.

Yakuza · 11-10-2007 11:28PM

Strictly speaking, the variance (the sd squared) is computed using n as the divisor, not n-1. (The n-1 is used when you want an unbiased estimate of population variance (i.e. when the data was only a sample from a population, not the entire population). In your case, your numbers were the entire population so you would use n, not n-1). (As n gets big enough it makes almost no difference to the value in any case).

As to your original question, why is it squared....that I'm not so sure. It's been a while since I studied stats formally (in 1999 as a CS undergraduate and even earlier as an actuarial student) so I'm grasping at straws here...but perhaps the values are squared to emphasize the influence of outliers in the data set.

In the original data set (2,3,4,5,6) the variance is 2 and the s.d. is 1.414 (using 5 as the denominator).

If there data set were 2,3,4,5,8 instead, the variance goes up to 4.24 and the s.d. to 2.059

Using the absolute difference method (dividing by 5 not 4), the figures are 1.2 and 1.68, not quite as noticeable a change.

Anyhoo, that's not any formal mathematical treatment, it's just what occurred to me as I read your post

Flukey · 12-10-2007 12:41AM

The squaring is done to make sure we have positive figures to work with on the next step of the procedure. So if it is -2 or +2, squaring either will give us 4. We get the square root later, to reverse that effect.

2Scoops · 12-10-2007 12:47AM

I would imagine it's because variance and the standard deviation relate to the normal distribution and hence are more informative and useful in inferential statistics.

The absolute approach, while maybe being a more intuitively understandable figure for variation around the mean, does not approximate the normal curve and so is otherwise useless.

Edit:
The signs of the deviations are important so if you just ignore them, as in the absolute approach, you lose information about the data. Squaring and dividing by the root is a more algebraically acceptable method to eliminate the negative signs.

MathsManiac · 12-10-2007 02:06AM

The second statistic you describe is referred to as the "mean absolute deviation". It's not used very often because, despite its intuitive appeal, its mathematical properties are not as nice as those of the standard deviation, making it much more difficult to work with in the long run. This is more or less why the standard deviation is the most commonly used measure of dispersion.

It should be noted that when you said that the standard deviation is "basically the average distance of a range of numbers from its mean", this is an ok description what the standard deviation is about, (provided you're loosely interpreting the word "average",) but it's not a definition. The standard deviation is defined to be the number got by doing the requisite calculation, so you don't have the discretion to do a different calculation and call it the standard deviation (at least not if you want to communicate with the rest of the world about it!). So, your alternative can't be described as a different approach to finding the standard deviation, but rather as a different approach to measuring the spread of the data set. It's a perfectly legitimate one, but not very common. Google "mean absolute deviation" for more.

Hope this helps.

Michael Collins · 12-10-2007 08:03PM

2Scoops wrote: »

I would imagine it's because variance and the standard deviation relate to the normal distribution and hence are more informative and useful in inferential statistics.

I think that's exactly it. As MathsManiac said above, there's plenty of methods to describe the "average" deviation from the mean, but the standard deviation that we all know and love which is defined using the squares of the difference is by far the most common occuring in statistics - it pops up all the time as an important characteristic figure, along with it's close cousin variance - that is, the square of the standard deviation.

LeixlipRed · 14-10-2007 05:00PM

The reason you square is because if you don't the sum of the differences will always be zero. For example the mean of 1,2,3 is 2. Apply the SD formula without squaring and you'll end up with sqrt{[(1-2) + (2-2) + (3-2)]/2} which is zero

Thomas_S_Hunterson · 14-10-2007 05:01PM

LeixlipRed wrote: »

The reason you square is because if you don't the sum of the differences will always be zero. For example the mean of 1,2,3 is 2. Apply the SD formula without squaring and you'll end up with sqrt{[(1-2) + (2-2) + (3-2)]/2} which is zero

If you read the original post, you'll find that that wasn't the query at all

LeixlipRed · 14-10-2007 10:18PM

Yes I know that. But the assumption made by a few people was that squaring was to "correct" the negatives which isn't the case

A Standard Deviation question.

Comments