Advertisement
Help Keep Boards Alive. Support us by going ad free today. See here: https://subscriptions.boards.ie/.
If we do not hit our goal we will be forced to close the site.

Current status: https://keepboardsalive.com/

Annual subs are best for most impact. If you are still undecided on going Ad Free - you can also donate using the Paypal Donate option. All contribution helps. Thank you.
https://www.boards.ie/group/1878-subscribers-forum

Private Group for paid up members of Boards.ie. Join the club.

Sampling with replacement problem - how many distinct items sampled?

  • 31-05-2013 08:32PM
    #1
    Registered Users, Registered Users 2 Posts: 962 ✭✭✭


    If I have 10 different numbered balls in an urn, and I carry out n samplings of 1 ball each, replacing the ball each time, how many distinct balls will I have sampled?

    For a situation with m=2 balls, I have a ready formula: 2 - (0.5 ^ (n-1))

    For m>2, I can't easily see how to use the multinomial formula to get at this.

    It's actually a real-life situation, though the balls are chromosomes, and the trials are next-gen sequencing reads. To calculate a particular statistic, I need to weight by the No. of distinct chromosomes surveyed.

    I've ended up simulating this, using a million repeats of 10 samplings and calculating the mean No. of distinct chromosomes. However, I'd like to know if there's a reasonably simple way to calculate directly.

    Thanks!


Comments

  • Registered Users, Registered Users 2 Posts: 338 ✭✭ray giraffe


    On average after n samplings,

    you have [latex] \displaystyle 10 \left( \frac{9}{10} \right)^n[/latex] unsampled, so [latex] \displaystyle 10- 10 \left( \frac{9}{10} \right)^n[/latex] sampled.


  • Registered Users, Registered Users 2 Posts: 962 ✭✭✭darjeeling


    On average after n samplings,

    you have [latex] \displaystyle 10 \left( \frac{9}{10} \right)^n[/latex] unsampled, so [latex] \displaystyle 10- 10 \left( \frac{9}{10} \right)^n[/latex] sampled.

    Thanks - that's very helpful. With hindsight, it looks pretty self-evident, but that's hindsight for you. :)

    To break it down, we can say that when we sample just one ball out of m in the urn, the probability that a given ball is not the one chosen is: [latex] \textstyle \left( \frac{m-1}{m} \right)[/latex]

    Repeating n times (with replacement), the probability that a given ball is never sampled is: [latex] \textstyle \left( \frac{m-1}{m} \right)^n[/latex]

    and when we account for all of the balls, the expectation for the number unsampled is: [latex] \textstyle m \left( \frac{m-1}{m} \right)^n[/latex]

    In the application I mentioned, we have pooled DNA from 5 individuals. The proportions are near enough equal, hence I can assume the probabilities of drawing (or sequencing) any one of the 10 chromosomes to be equal. If this weren't the case, then the number of sampled chromosomes would, I take it, be:

    [latex] \displaystyle m - \sum\limits_{i=1}^m \left(1-p_{i}\right) ^n[/latex]

    where [latex] p_{i}[/latex] is the proportion of the i'th chromosome in the pool; m= 10; n = No. of sequence reads

    .


Advertisement