Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Variance - sample v population

  • 02-03-2012 3:22pm
    #1
    Closed Accounts Posts: 910 ✭✭✭


    I'm trying to get fully comfortable with sample v population in both variance & standard deviation.

    I'm pulling data from a database, showing a list of nightly processes (around 200-300) and their execution times in minutes, going back the last 30 days. Some of the processes have a fairly consistent running time (i.e. to chart them results in a relatively level line), others vary wildly (where plotting them looks like a SIN curve)

    I want to get the VARIANCE & STD_DEV to track which jobs are not consistent, so I can find out why and take action where necessary.

    In pulling the data from the database, I am able to run a STD_DEV function, but there are two separate functions STDDEV_SAMP and STDDEV_POP.

    Which should I use?

    The confusing thing for me, is that the last 30 days of data - for the purpose of generating a "current" average is all I am interested in, i.e. to me it's the population, whereas in fact it is only a sample.....

    To add further, some jobs have running times between 1-2 minutes and would results in a small VARIANCE & STD_DEV, others run between 60-80 minutes - so I need to level these out by making the VARIANCE as a % of the AVERAGE - right?


Comments

  • Closed Accounts Posts: 910 ✭✭✭Jagera


    Just bumping this one.. If there's anything I need to clarify let me know. I have probably not worded the question too well.


  • Moderators, Science, Health & Environment Moderators Posts: 1,846 Mod ✭✭✭✭Michael Collins


    It probably isn't going to make much of a difference in practise if your sample size is anyway large, and particularly not for your application where you only want to see relative changes.

    If you are not trying to estimate the variance (or standard deviation) of some larger population from a smaller sample, then you may as well use the population formula. This seems to be the case from what you have said.

    The sample standard deviation is always a somewhat biased esitimator for the population standard deviation anyway.

    Incidentally, the only difference between the two (assuming the functions you mention follow the standard convention) is a reduction of one in the denominator.

    The sample variance formula used to estimate the population variance is

    [latex] \displaystyle \hat{\sigma^2} = \frac{1}{n-1}\sum_{i=1}^{n}(X_i-\mu)[/latex]

    whereas the actual population variance is

    [latex] \displaystyle \sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(X_i-\mu).[/latex]


  • Registered Users Posts: 1,163 ✭✭✭hivizman


    Arithmetically, there is very little difference between the sample standard deviation and the population standard deviation. If you have sets of 30 observations, the "sample standard deviation" (that is, treating the 30 observations as a sample from a larger population) will be 1.017 times the population standard deviation (that is, treating the 30 observations as a population). In general, the sample standard deviation is [N/(N-1)]^0.5 times the population standard deviation. I've put inverted commas around "sample standard deviation" because this is different from the "standard deviation of the sample" (the formula for which is the same as for the "population standard deviation"). The "sample standard deviation" s is a less biased estimate of the population standard deviation σ than the "standard deviation of the sample" [strictly speaking, the square of the sample standard deviation is an unbiased estimate of the population variance].

    So with your data, I don't think it really matters which measure you use, but my own inclination would be to use the population standard deviation.

    In Excel, the sample standard deviation is given by the function STDEV and the population standard deviation is given by the function STDEVP.

    Given your data, it may be useful to caclulate the coefficient of variation. This is defined as σ/μ - the population standard deviation divided by the population mean. If you calculate this for each set of observations, you can compare how dispersed the observations are relative to the mean for that set of observations. Treat the run times for each process as a separate set of observations - I don't think it is useful to aggregate observations for different processes given the wide range of variation you suggest.

    By the way, I see that Michael Collins has also answered this question, including the formulae for the two forms of standard deviation.


  • Closed Accounts Posts: 910 ✭✭✭Jagera


    Thanks for the replies.

    I will add some further info on what I am doing.

    I'm trying to find real outliers in the processing times. As suggested above, I am doing the analysis separately for each "set" as you called it - which to me is a "job". I am actually further segregating it down to day of the week.

    The reason being, the jobs process data, and often the day of the week is important because Monday may involve a bulk load, whereas the other days are quiet.

    The drawback is, going back 30 days, only gives a maximum of 5 occurences of each day of the week. Not good numbers for generating averages. Yet - going back further is not an option because the quantities of data change over time, so I want to use the last 30 days as my basis.

    My main aim is to find outliers - and eliminate them from calculating the AVG and STD DEV. Here is an example:

    Job |Day of Week |Run Time |Mean|Variation|Variation %|Std Dev|CV
    Job A|Sunday|21|22.25|-1.25|-6%|1.26|6%
    Job A|Sunday|22|22.25|-0.25|-1%|1.26|6%
    Job A|Sunday|22|22.25|-0.25|-%|1.26|6%
    Job A|Sunday|24|22.25|1.75|7%|1.26|6%
    Job A|Monday|3|6.5|-3.5|-117%|6.35|98%
    Job A|Monday|4|6.5|-2.5|-63%|6.35|98%
    Job A|Monday|3|6.5|-3.5|-116%|6.35|98%
    Job A|Monday|16|6.5|9.5|59%|6.35|98%


    Based on the above, I want to be able to identify the 16 minute job on Monday, and eliminate it from calculating the Avg. The reason for that, the running time is, under normal circumstances, around 3-4 minutes. And if I'm comparing to the average, they are all coming in UNDER the average, because the 16 minute one pushed up the average.

    So the three other jobs running on Monday came in as 60-115% UNDER average, which is not what I want to see.

    One other method I have been using to identify is to calculate the PERCENTILE. However, with only 4 rows, is doesn't work too well..

    I'm wondering if this can be done at all. Using any combination of AVG, STD_DEV, CV or anything else...?


  • Registered Users Posts: 1,163 ✭✭✭hivizman


    What you want to do is quite difficult given that you have only four or five observations for each run/day. If you had bigger sets of observations, you could identify outliers by determining confidence intervals around the mean, but with so few observations you may be better using the coefficient of variance to identify those cases where there are likely to be outliers. From the two examples you give, the coefficient of variation for Job A on Sunday is much smaller than the coefficient of variation for Job A on Monday, so it is more likely that you will have an outlier when the coefficient of variation is relatively high.

    So a possible approach would be to sort your runs in order of coefficient of variation, with the highest coefficient at the top of the list, and then look down the list. Decide how far down you need to go to eliminate what you consider to be outliers, and use the lowest coefficient of variation where you are identifying outliers (or perhaps the next round number below that - so if your lowest run with outliers has a coefficient of variation of 53%, you might choose 50%) as your criterion for future runs. So any future run whose coefficient of variation is below your criterion would not be inspected for outliers while any run whose coefficient of variation is above your criterion would be checked visually.

    However, it depends what you mean by "outlier" - it's possible to get a high coefficient of variation without just one outlier. If your run times were 1, 1, 12 and 12, you get a mean of 6.5 and a standard deviation (I note that you used the sample standard deviation) of 6.35, so a coefficient of variation of 98%, just as you did in your example, but there isn't a single outlier here (would you consider all four observations as outliers?).

    This seems to be a problem where techniques from the area of statistics known as statistical process control may be applicable, but this takes me out of my comfort zone, so perhaps someone else can advise you further.


  • Advertisement
  • Closed Accounts Posts: 910 ✭✭✭Jagera


    You're right, with such a small set of observations it is going to be difficult to use any sort of statistics. And again right, how can I determine an outlier, with 4 choices - who's to say the 3 aren't the wrong ones.


  • Registered Users Posts: 1,311 ✭✭✭Procasinator


    Could you record or determine the "quantities of data" or input size of the jobs?


  • Closed Accounts Posts: 910 ✭✭✭Jagera


    Could you record or determine the "quantities of data" or input size of the jobs?

    I've been pushing to get that information included from the start. It certainly would give another very helpful measure. For now, I don't have it.


Advertisement