Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie
Hi there,
There is an issue with role permissions that is being worked on at the moment.
If you are having trouble with access or permissions on regional forums please post here to get access: https://www.boards.ie/discussion/2058365403/you-do-not-have-permission-for-that#latest

Sampling with replacement problem - how many distinct items sampled?

  • 31-05-2013 7:32pm
    #1
    Registered Users, Registered Users 2 Posts: 962 ✭✭✭


    If I have 10 different numbered balls in an urn, and I carry out n samplings of 1 ball each, replacing the ball each time, how many distinct balls will I have sampled?

    For a situation with m=2 balls, I have a ready formula: 2 - (0.5 ^ (n-1))

    For m>2, I can't easily see how to use the multinomial formula to get at this.

    It's actually a real-life situation, though the balls are chromosomes, and the trials are next-gen sequencing reads. To calculate a particular statistic, I need to weight by the No. of distinct chromosomes surveyed.

    I've ended up simulating this, using a million repeats of 10 samplings and calculating the mean No. of distinct chromosomes. However, I'd like to know if there's a reasonably simple way to calculate directly.

    Thanks!


Comments

  • Registered Users, Registered Users 2 Posts: 338 ✭✭ray giraffe


    On average after n samplings,

    you have [latex] \displaystyle 10 \left( \frac{9}{10} \right)^n[/latex] unsampled, so [latex] \displaystyle 10- 10 \left( \frac{9}{10} \right)^n[/latex] sampled.


  • Registered Users, Registered Users 2 Posts: 962 ✭✭✭darjeeling


    On average after n samplings,

    you have [latex] \displaystyle 10 \left( \frac{9}{10} \right)^n[/latex] unsampled, so [latex] \displaystyle 10- 10 \left( \frac{9}{10} \right)^n[/latex] sampled.

    Thanks - that's very helpful. With hindsight, it looks pretty self-evident, but that's hindsight for you. :)

    To break it down, we can say that when we sample just one ball out of m in the urn, the probability that a given ball is not the one chosen is: [latex] \textstyle \left( \frac{m-1}{m} \right)[/latex]

    Repeating n times (with replacement), the probability that a given ball is never sampled is: [latex] \textstyle \left( \frac{m-1}{m} \right)^n[/latex]

    and when we account for all of the balls, the expectation for the number unsampled is: [latex] \textstyle m \left( \frac{m-1}{m} \right)^n[/latex]

    In the application I mentioned, we have pooled DNA from 5 individuals. The proportions are near enough equal, hence I can assume the probabilities of drawing (or sequencing) any one of the 10 chromosomes to be equal. If this weren't the case, then the number of sampled chromosomes would, I take it, be:

    [latex] \displaystyle m - \sum\limits_{i=1}^m \left(1-p_{i}\right) ^n[/latex]

    where [latex] p_{i}[/latex] is the proportion of the i'th chromosome in the pool; m= 10; n = No. of sequence reads

    .


Advertisement