Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie
Hi there,
There is an issue with role permissions that is being worked on at the moment.
If you are having trouble with access or permissions on regional forums please post here to get access: https://www.boards.ie/discussion/2058365403/you-do-not-have-permission-for-that#latest

Normality of distribution of a small set of numbers

  • 26-06-2009 10:50pm
    #1
    Closed Accounts Posts: 2,736 ✭✭✭


    Hi.
    I want to find out whether a small set of numbers is distributed normally.

    Say you have a set of numbers- i dunno.. something like 8, 12, 25, 41.

    Now i've used a few stats programs (gbstat, minitab) to calculate the normality of distribution
    for various small sets of numbers, and every time (using the Shapiro-Wilks test of normality) the numbers are almost a straight line on the percentile graph thing (sorry, can't think of the proper name) which AFAIK denotes normality.

    But intuitively you'd think a small set of numbers is bound to NOT have a normal distribution.

    Or is it the case like in the example above (8, 12, 25, 41) that as long as each number occurs only once there will be no significant skewing (so it does actually have a normal distribution as the stats programs suggest).
    Whereas if you had something like 8, 8, 8, 25 instead there would be skewing and the spread wouldn't be normal (because 8 occurs 3 times).

    I'm confused- i'd appreciate any help and an explanation of this.
    Thanks.


Comments

  • Registered Users, Registered Users 2 Posts: 2,481 ✭✭✭Fremen


    I don't know all that much about stats (though I work in probability theory), but from reading the Wikipedia article, I think it probably makes sense that you can't reject the hypothesis that the data are normally distributed.

    From what I can tell, the SW test tells you whether the data could have been produced by sampling from a normal distribution with mean m and variance v, say. If you only have a few samples, you should have enough freedom in your choice of m and v so that you should be able to fit that data. For one data point, you can always find a normal distribution which could plausibly produce it (set m equal to the data point). Same with two data points.

    You're on the right track with what you said about skewness. I would give good odds that if you put say 2, 3.1, 2.4, 1.1, 1,000,000,000 into the test, it would fail.


  • Registered Users, Registered Users 2 Posts: 195 ✭✭caffrey


    I ask a question:

    Is it possible to test the normality of a distribution of such a small set of numbers? for example the properties that one might use to assess this could be the arithmetic mean and the standard deviation(SD)? in the case of the mean one adds all of the numbers and divides by the number of numbers.

    Now lets say you do a Z-score test of each number in the set, where:
    z=(number-mean)/SD

    then reading from a Z-chart, found in log tables or here:
    http://www.pinkmonkey.com/studyguides/subjects/stats/appendix/s58.gif

    one can find the percentile that each value falls into.

    So the question I ask is as follows, is it possible to decide that a set of small numbers is normally distributed if the division of N(the number of numbers) makes a significant change.

    In your example:

    N=4 giving mean = (8+12+25+ 41)/4

    lets say that you change N to N-1 or N+1, this should not change your estimate of the percentile it falls into by a significant amount.

    In summation I would say that it is not possible to come to the conclusion that a set of numbers that size is normally distributed. can you take more samples?


  • Registered Users, Registered Users 2 Posts: 195 ✭✭caffrey


    By the way i am not an expert. Neither would I say that you should try and prove/disprove in the manner stated above. I am more asking if you think that it is possible to draw any conclusions about the normality of said "small" sets considering the N/N-1 argument above.


  • Closed Accounts Posts: 2,736 ✭✭✭tech77


    caffrey wrote: »
    I ask a question:

    Is it possible to test the normality of a distribution of such a small set of numbers? for example the properties that one might use to assess this could be the arithmetic mean and the standard deviation(SD)? in the case of the mean one adds all of the numbers and divides by the number of numbers.

    Now lets say you do a Z-score test of each number in the set, where:
    z=(number-mean)/SD

    then reading from a Z-chart, found in log tables or here:
    http://www.pinkmonkey.com/studyguides/subjects/stats/appendix/s58.gif

    one can find the percentile that each value falls into.

    So the question I ask is as follows, is it possible to decide that a set of small numbers is normally distributed if the division of N(the number of numbers) makes a significant change.

    In your example:

    N=4 giving mean = (8+12+25+ 41)/4

    lets say that you change N to N-1 or N+1, this should not change your estimate of the percentile it falls into by a significant amount.

    In summation I would say that it is not possible to come to the conclusion that a set of numbers that size is normally distributed. can you take more samples?

    No, those are the only numbers i have.
    Basically i need to compare four groups with 3 or 4 numbers in each group.
    I'm just wondering if i can use ANOVA (parametric stats) to do this.
    Obviously to use parametric stats i need to find out first if the numbers within each group are normally distributed- hence my query.

    If their normality of distribution are untestable, does that mean use non-parametric stats?

    Dunno why the stats programs tell me they are normally distributed.

    I'll consider your answer a bit more, thanks- think i saw something similar online.
    Any other ideas would be great. :)


  • Registered Users, Registered Users 2 Posts: 1,595 ✭✭✭MathsManiac


    Is the experiment design such that it is reasonable to assume that the underlying distributions from which the samples are drawn are normal, and that they have a common variance?

    The samples are so small that I thinks it's unlikely that there will be sufficient evidence within the data alone to reject a normality hypothesis. But the design of the experiment may tell you that the data aren't normal. For example, if the data are ordinal, or discrete, or if they arise from a Poisson process, such as the occurence of accidents.

    If there are no obvious reasons why the data shouldn't be normal, and if the diagnostic plots (normal probability plot and residuals vs fitted values) don't indicate any problem, then you should be ok.


  • Advertisement
  • Closed Accounts Posts: 2,736 ✭✭✭tech77


    Thanks.
    The numbers are actually cell counts in pieces of tissue following different types of treatment to those pieces of tissue.

    They aren't taken from a larger population- What i have is 4 different treatment groups with a few counts (~3 pieces of tissue) in each treatment group.

    But i guess a putative larger population from which such counts could be taken would be normal, wouldn't it.

    What kinda variable is a cell count- it's not continuous, is it discrete so?

    BTW I also need to compare behaviour of animals in each treatment group- ie percentage success at a given task, distance travelled and times spent engaged in a certain behaviour.
    Any idea would those variables be normally distributed.
    The time and distance are continuous variables anyway- so would they be normally distributed.

    Again i'm not sampling from a larger population, i'm using values from all animals in my population (does that make sense, sorry??).
    Any ideas- thanks for your help.


  • Registered Users, Registered Users 2 Posts: 1,595 ✭✭✭MathsManiac


    A cell-count is a discrete variable, but unless the numbers are very small, it can still be close enough to being normal. (For example, the number of heads you get if you toss a fair coin 50 times is a discrete variable but can still be closely approximated by a normal distribution.)

    For ANOVA, the distributions don't need to be completely normal - they just have to be reasonably close to being normal. If you imagine taking lots of samples that had been given the same treatment and measuring the cell count for each one, do you think it likely that you would get a reasonably mound-shaped distribution that is not unduly skewed? (My guess from what you describe is that that probably would be the case; at least, I don't see any obvious reason why it should be a different shape.)

    You should also note that ANOVA assumes that the different treatment groups are from distributions with equal variances. That is, there is an underlying assumption that any treatment effect serves to change the mean of the distribution without unduly affecting its spread.


  • Closed Accounts Posts: 2,736 ✭✭✭tech77


    A cell-count is a discrete variable, but unless the numbers are very small, it can still be close enough to being normal. (For example, the number of heads you get if you toss a fair coin 50 times is a discrete variable but can still be closely approximated by a normal distribution.)

    For ANOVA, the distributions don't need to be completely normal - they just have to be reasonably close to being normal. If you imagine taking lots of samples that had been given the same treatment and measuring the cell count for each one, do you think it likely that you would get a reasonably mound-shaped distribution that is not unduly skewed? (My guess from what you describe is that that probably would be the case; at least, I don't see any obvious reason why it should be a different shape.)

    You should also note that ANOVA assumes that the different treatment groups are from distributions with equal variances. That is, there is an underlying assumption that any treatment effect serves to change the mean of the distribution without unduly affecting its spread.


    Thanks.
    TBH, mathsmaniac, when i saw you replied to the thread, i thought- yes, problem solved :)

    Still it's obviously not as simple as i thought.

    Yeah, i can't see an obvious reason why cell counts (after a given treatment) wouldn't have a normal distribution (like yourself).

    I'm curious though as to why it is that small numbers of counts change something whose data spread is inherently normal to non-normal?
    OK the smaller data set will probably be more skewed in itself than a larger set of data, but does its real nature then become non-normal.
    Number of counts make all the difference?

    Or is it that normality just can't be tested with those low counts and you can't do ANOVA on data of undetermined normality?

    And is there a threshold number of counts where you can safely say yeah they are normally distributed- 10, 15, 30?
    How would you find this threshold.
    Graph it?

    Generally though, what is it in a set of numbers that determines whether its distribution will be normal/skewed?
    Stuff like height, weight when you look at it is normally distributed- why is that?
    You talked about discrete vs continuous variables.
    Continuous are more likely to be normal are they?
    Any other features of a data set's nature that determines normality vs skewness?

    Can you give me a definite example of a skewed/non-parametric distribution.

    Also, any idea about what kinda distribution "percentage success at a task" and time engaged in a given behaviour might have?
    If not any idea about how i'd go about finding this out.
    Just graph everything maybe? :)

    Sorry for all the questions.
    But i'm asking because i get the impression you've a genuinely good understanding of maths concepts, so hopefully you can help me out.


  • Registered Users, Registered Users 2 Posts: 1,595 ✭✭✭MathsManiac


    Flattery, eh? I'm not a professional statistician by any means, so if anyone else feels I'm giving a bum steer here, then please jump in!

    When I said "unless the numbers are very small", I didn't mean that there are few data - I meant the actual counts are small numbers. e.g. suppose I'm counting something that's mostly 0, occasionally 1, and very rarely 2. This is not going to be an approximately normal distribution. But if I'm counting something that's varying between, say, 40 and 60, and mostly bunched around 50, then that could be approximately norma, even though it's discrete.

    Small data sets don't change a thing from being normal to non-normal. But statistics in practice is almost always concerned with drawing inferences about a population from a sample. The reality is that you really have a sample here, even if it appears you have a whole population. Take, for example, the data for the three animals from the first treatment group. Even though there are only three animals, you can conceive of this as a sample from the theoretical population of all conceivable animals of this breed, under these conditions, given this treatment.

    Why do you need to imagine such a population? Well, think about what your experiment is hoping to achieve. Your point is not to determine whether these particular three animals are any healthier than the other sets of three animals. You're actually hoping to make an inference about whether, in general, subjecting animals like these to this treatment has a different effect from subjecting them to other treatments.

    The purpose of ANOVA is to allow you to make inferences about whether the sample data provide evidence that there is a real "treatment effect". That is, the question is: if we observe certain diffferences between the different groups, are we just seeing something that can be explained as random fluctuations, or are we seeing something that's unlikely to have been produced by random fluctuations - and is therefore due to a real effect of the treatments.

    In order to decide what sorts of data might reasonably be seen as a result of random fluctuations, there has to be an underlying assumed mathematical model of the distribution(s) from which the data are drawn. The validity of the ANOVA techniques is built on such a mathematical model, and it involves assuming the data are drawn from underling distributions that are approximately normal and have approximately equal variances.

    If the experiment is such that the data are genuinely drawn from populations that satisfies the underlying assumptions, then the conclusions from the analysis should be sound. But if it turns out that the underlying distributions don't satisfy the assumptions after all, one could have been led to draw incorrect conclusions.

    I'm not familiar enough with the details to know whether there are sample-size thresholds of the kind you refer to. But I've seen the techniques applied in books to data sets that are not much bigger than the one you're describing. In such cases, there seems to be more of an emphasis on careful consideration of the experiment design and the plausibility that the type of data being collected are likely to satisfy the assumptions (while still analysing the plots too).

    Examples of skewed distributions? Well the Poisson distribution is skewed. Distribution of incomes is skewed. The binomial distribution is skewed if p is very small. You can look those up if you're not familiar with them.

    You don't describe distributions as being "non-parametric". That term is used to describe certain statistical methods that are sometimes also referred to as "distribution free", because they are based on mathematical models with fewer assumptions about the nature of the underlying distribution. See http://en.wikipedia.org/wiki/Non-parametric_statistics. Because these methods make fewer assumptions, they are are more widely applicable (more "robust"). The downside is that the parametric tests concerned frequently have less "power" than the coreesponding parametric ones.

    Hope this helps.


Advertisement