Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie
Hi there,
There is an issue with role permissions that is being worked on at the moment.
If you are having trouble with access or permissions on regional forums please post here to get access: https://www.boards.ie/discussion/2058365403/you-do-not-have-permission-for-that#latest

The Binomial Test

  • 03-03-2014 7:01pm
    #1
    Registered Users, Registered Users 2 Posts: 3,915 ✭✭✭


    Hello,

    I am not too sure if this forum or the Researcher one is the best place for this, perhaps a Mod can make that decision. I created a similar topic there a while back, but it was much more "frazzled" then this one.

    I am doing a paired comparison test but I need some help analysing the results to determine what results are significant.

    Would you guys agree with this link?
    http://www.elderlab.yorku.ca/~aaron/Stats2022/BinomialTest.htm

    This seems perfect for me as it tells me whether a result of A versus B is significant or not. But I have a couple of questions.

    1) In steps 3 and 4 he seems to make a mistake about whether the z-score is in the critical region. Am I noticing that correctly? Is the critical region or the z-score calculation wrong?

    2) I have recreated the calculations in Excel and I am getting the same numbers as the writer is. Given that the null hypothesis is that there will be equal preference between the babies rattles, if A versus B is 20 versus 20, then the resulting z-score is 0. This is in the critical region but by being in the critical region, the null hypothesis is rejected so I am confused.

    EDIT: 3) It is likely that some of my tests will have results less than 10, which is a condition of the binomial test linked. The writer says to use a different set of calculations in this case, though I couldnt get my head around them as easy as I could with the first one. It deals with probability more than significance.
    Alternative Methods
    How suitable is the above method for determining whether preference results are significant?

    I have been talking with some university staff to help me with this determination and they recommended the Thurstone Law, but I can not get my head around the mathematics behind it so I can create it in Excel and work the results out myself. To be honest, my course takes people from practical non-maths backgrounds and I am having trouble getting the maths side to be up to scratch with the practical side of the project. Sadly, the staff are either overwhelming me with complex concepts or passing me on to the former staff again. :p

    Thanks for any help you can offer!


Comments

  • Registered Users, Registered Users 2 Posts: 1,595 ✭✭✭MathsManiac


    I think you may be misunderstanding the term "critical region".

    In a two-tailed test, the critical region is the area in the two tails.

    If z is between -2.3263 and +2.3263, the you're NOT in the critical region, so you don't reject H0. If z is less than -2.3263 or greater than +2.3263, then you ARE in the critical region, so you reject H0.

    Does that clear it up?

    By the way, the binomial test, (or the normal approximation to it,) is not a paired comparison test.

    P.S., is it a psychometrics course you're doing?


  • Registered Users, Registered Users 2 Posts: 3,915 ✭✭✭GTE


    I think you may be misunderstanding the term "critical region".

    In a two-tailed test, the critical region is the area in the two tails.

    If z is between -2.3263 and +2.3263, the you're NOT in the critical region, so you don't reject H0. If z is less than -2.3263 or greater than +2.3263, then you ARE in the critical region, so you reject H0.

    Does that clear it up?

    By the way, the binomial test, (or the normal approximation to it,) is not a paired comparison test.

    P.S., is it a psychometrics course you're doing?

    Ahh, okay. That clears up the z-score thing. Should the - and + numbers be calculated with respect to the sample size and/or things like that?

    Regarding the binomial test, the paired comparison tests are nearly done. The reason I mention the binomial test is that from what I can see in this calculator, it allows me to input the number of votes for A and B and then calculate if the difference is significant or not. I want to be able to build a calculator in Excel which will allow me to calculate the significance.

    Thanks!

    It is not psychometrics, I am doing a sound engineering course and comparing recording techniques.


  • Registered Users, Registered Users 2 Posts: 1,595 ✭✭✭MathsManiac


    The example in that link is not a paired comparison test either.

    A paired comparison is when your data are paired - that is, there are two readings or measurements for each unit in the data set.

    For example, people's test scores before and after a course of study; people's blood pressure before and after some medical intervention; people's heart rate when they are awake and asleep. In each of these examples, there are two measurements per person.

    Is that the kind of data you have?

    Perhaps you could describe exactly what experiment you did and what kind of data you have, and then we could help advise you on a suitable statistical test.


  • Registered Users, Registered Users 2 Posts: 3,915 ✭✭✭GTE


    The example in that link is not a paired comparison test either.

    A paired comparison is when your data are paired - that is, there are two readings or measurements for each unit in the data set.

    For example, people's test scores before and after a course of study; people's blood pressure before and after some medical intervention; people's heart rate when they are awake and asleep. In each of these examples, there are two measurements per person.

    Is that the kind of data you have?

    Perhaps you could describe exactly what experiment you did and what kind of data you have, and then we could help advise you on a suitable statistical test.

    Ahh, perhaps I have my lingo mixed up.

    I am presenting a test subject with a pair of audio samples and asking what is their preferred sample. Additionally, I ask separate questions on each of 4 attributes applicable to the audio sample.

    The law of comparative judgement or pairwise comparisons are other terms which I've read to be associated with what I'm doing, though the Thurstone law is difficult for me to understand in terms of the maths. I have a couple of books from the library today which ill have time to examine soon.

    Thanks!


  • Registered Users, Registered Users 2 Posts: 1,595 ✭✭✭MathsManiac


    Ok. I understand what you're at now.

    I appreciate that there are similar phrasings here being used to refer to quite different things. A paired comparison test is the kind of statistical test I referred to. It's different from the business of Thurstone's Law and the use of pairwise comparisons to place perceptions related to a stimulus on a scale.

    I assume you're doing this with lots of test subjects. After you have all the data, I assume you'll want to determine whether there's evidence to suggest that one audio track is generally perceived by people to be preferable to the other.

    If this is so, then there are a few options, and Thurstone scaling might be a bit of an overkill. Binomial test might indeed do the trick for you, and would be much simpler.

    If there was no objective or judge-independent qualitative difference between the audio tracks, you expect each to be judged preferable equally often - like a coin toss. The binomial test you linked to in the first message would be a suitable test. Note that that involves a "normal approximation" to the binomial distribution; with computers, the normal approximation can actually be avoided by doing the exact binomial calculations, and this might be preferable if your sample isn't very large.

    Bear in mind, though, that if you're going to also carry out separate tests on four other dimensions, there's a statistical correction you ought to make that arises in cases where you're carrying out multiple statistical tests as part of a single investigation. I can fill you in on that more if you want.


  • Advertisement
  • Registered Users, Registered Users 2 Posts: 3,915 ✭✭✭GTE


    I assume you're doing this with lots of test subjects. After you have all the data, I assume you'll want to determine whether there's evidence to suggest that one audio track is generally perceived by people to be preferable to the other.

    I will have a total of 20 subjects, as advised by the project supervisor which is a realistic number given the department size/type and the fact it is a masters course.

    Yes, I am looking to see if there is parity or a significance difference in preference between the two audio extracts.
    If this is so, then there are a few options, and Thurstone scaling might be a bit of an overkill. Binomial test might indeed do the trick for you, and would be much simpler.
    I am glad to hear that, as the lecturer who I sought the advice from does tend to go overboard.
    If there was no objective or judge-independent qualitative difference between the audio tracks, you expect each to be judged preferable equally often - like a coin toss. The binomial test you linked to in the first message would be a suitable test. Note that that involves a "normal approximation" to the binomial distribution; with computers, the normal approximation can actually be avoided by doing the exact binomial calculations, and this might be preferable if your sample isn't very large.

    If I can work out how to implement this test in excel then I am on a winner. Though, before I read your reply I have read pages 78 and 79 of "Applied Nonparametric Statstical Methods" by P. Sprent and parts of pages from p.157 of "Practical Nonparametric Statistics" by W.J. Conover.

    Between the two of those, I have discovered that the Sign Test may be applicable to me also, though I can only work it out so far. Please see the attached image.

    The first equation is what I was presented with in one of the books. By following the supplied example, I was able to build the calculations in Excel. The results are given in the middle of the image and match up with the worked example in the book. There are two issues:

    1) I do not understand why or when the normal approximation is applied and I am a bit confused as to whether you add it or subtract it as I think I am getting conflicting info from the two books.

    2) After getting the answer, one book simply says whether the hypothesis is rejected or not and the other says to reference to a table of figures which I don't understand, yet. It is called the Binomial Distribution.

    I then applied my provisional results to this equation, leaving the normal approximation the same as the previous example and got a result, but since I do not know what is going on with the Binomial Distribution.
    Bear in mind, though, that if you're going to also carry out separate tests on four other dimensions, there's a statistical correction you ought to make that arises in cases where you're carrying out multiple statistical tests as part of a single investigation. I can fill you in on that more if you want.

    For the sake of clarity, in my test, participants are asked a total of 5 questions:
    1) Preference
    2 to 5) Which extract has more of attribute 1 to 4

    All answers are given as A or B. I was thinking that significant results in each of the 4 attribute questions could shed light on the preference answers and maybe there is a way to detect correlation.

    The Binomial Test
    I was not able to find anything, obvious, on this type of test in the books. If you would feel that the Sign is better that it or the other way around or maybe that applying both could be beneficial, then I would like to do it as long as I can figure out how to calculate it Excel as using online calculators is not good enough.

    Extra info, probably important only when I get the statistic analysis sorted out

    It should be noted that each participant will be asked the A versus B in the following way.

    Song 1, Extract 1: A versus B
    Song 1, Extract 2: A versus B
    Song 2, Extract 1, A versus B
    Song 2, Extract 2, A versus B.

    The above equals four "sections" to the test and is repeated for A versus C and B versus C totaling 12 sections.

    The different songs are used as they are different genres which may show differences in preferences with respect to genre.The two extracts per song is meant to be used as a post screening to exclude results from the same participant which contradicts. For instance, they can not say A > B and then A < B for the same song but I am feeling that even the extracts are different enough to be used in a similar way as the genre.

    The reason I mention this is because where I could have 9 participants, I will have a maximum of 36 votes for A or B. I am wondering, should I calculate with this bigger figure or calculate it down so each participant has a single vote.

    Thanks for the help. Getting somewhere, thankfully.


  • Registered Users, Registered Users 2 Posts: 1,595 ✭✭✭MathsManiac


    Ah,

    Your experiment is more complicated than you first said. You're not just trying to establish whether one thing can be said to be preferable to another, you're trying to rank order three things relative to each other. I see now why people were mentioning Thurstone methods to you.

    It might seem strange, but you can't treat your A versus B experiment and your B versus C experiment and your A versus C experiment by just using an A versus B technique three times.

    If you're also seeking to find "interaction effects" between your main dimension of concern and genre, then life gets pretty complicated (probably more complicated than I could help you with anyway!)

    Is there someone in the university you're attending who understands statistics well and who'd be prepared to sit down with you with the full details of your experiment and advise you?


  • Registered Users, Registered Users 2 Posts: 1,595 ✭✭✭MathsManiac


    If you want to do some further reading around it, try looking for stuff about "Thurstone Case V model".

    Here's one document that might be relevant, but it looks a bit too technical for your current level of understanding: http://www.dtic.mil/dtic/tr/fulltext/u2/a543806.pdf


  • Registered Users, Registered Users 2 Posts: 3,915 ✭✭✭GTE


    Ah,

    Your experiment is more complicated than you first said. You're not just trying to establish whether one thing can be said to be preferable to another, you're trying to rank order three things relative to each other. I see now why people were mentioning Thurstone methods to you.

    It might seem strange, but you can't treat your A versus B experiment and your B versus C experiment and your A versus C experiment by just using an A versus B technique three times.

    Yeah, I am rank ordering three recording methods relative to each other in terms of five, separate, things: Preference, attribute 1, 2, 3 and 4. At the end, then I want to see if there is a significance for the preference votes and then if an attribute or two shows significance , then it could be tied in with the preference. From what I was shown with Thurstone, it results in a graph which applies significance values to all attributes.

    Why can't the A v B, A v C and B v C work for each participant if all the stimuli are the same for each song? There would be four blocks of those tests all with their respective stimuli for that block.

    Audio engineering society journal papers, often do something similar when dealing with a project like mine though they do not go into massive detail on how results are analysed.

    Is there someone in the university you're attending who understands statistics well and who'd be prepared to sit down with you with the full details of your experiment and advise you?

    Yes, when it comes to it, I can go back to the Thurstone guy but as I alluded to earlier, he tends to be more of an information flood gate rather than a clear deliverer of concepts.


  • Registered Users, Registered Users 2 Posts: 1,595 ✭✭✭MathsManiac


    bbk wrote: »
    ...
    Why can't the A v B, A v C and B v C work for each participant if all the stimuli are the same for each song? There would be four blocks of those tests all with their respective stimuli for that block.
    ...

    Well I guess you could start by saying that your main question is whether there is any noticeable difference in quality between the three different recording methods, and that all other analysis is subsidiary to that.* You could then do the three comparisons above as three separate binomial tests, but you would need to apply the Bonferoni correction for three tests or some similar correction: http://en.wikipedia.org/wiki/Bonferroni_correction

    If you DO establish any statistical significance, then you could use the ratings of the attributes as soft information to postulate what attributes are associated with quality.

    You could probably get a more substantive analysis using the Thurstone methods, but I think you'd need someone who could help you work through it. (I certainly wouldn't be able to do so over the web, and I'm not sufficiently au fait with the methods to be confident enough anyway.) Maybe another poster here can offer you more (or better) advice.

    Good luck with it anyway!

    [*Statistical testing of this type is only kosher if you specify exactly what tests you're going to do in advance of doing them, including how many of them, and then apply the necessary multiplicity corrections (such as Bonferoni). It's NOT legitimate to gather a whole lot of data, eyeball it, say "oh that looks interesting - I'll hypothesis test that bit". If you DID spot something interesting in that way in a data set and decide it's worth testing for, then you must go and generate a fresh independent set of data for the hypothesis test - but this is drifting off point a bit. In your case, if you decide in advance that you're going to do a whole plethora of different hypothesis tests, then your multiplicity corrections will be so extensive that you could end up masking some genuine differences on your big-ticket item - which is, I think, whether there's any noticeable difference in the overall quality.]


  • Advertisement
  • Registered Users, Registered Users 2 Posts: 3,915 ✭✭✭GTE


    Well I guess you could start by saying that your main question is whether there is any noticeable difference in quality between the three different recording methods, and that all other analysis is subsidiary to that.* You could then do the three comparisons above as three separate binomial tests, but you would need to apply the Bonferoni correction for three tests or some similar correction: http://en.wikipedia.org/wiki/Bonferroni_correction

    If you DO establish any statistical significance, then you could use the ratings of the attributes as soft information to postulate what attributes are associated with quality.

    You could probably get a more substantive analysis using the Thurstone methods, but I think you'd need someone who could help you work through it. (I certainly wouldn't be able to do so over the web, and I'm not sufficiently au fait with the methods to be confident enough anyway.) Maybe another poster here can offer you more (or better) advice.

    Good luck with it anyway!

    [*Statistical testing of this type is only kosher if you specify exactly what tests you're going to do in advance of doing them, including how many of them, and then apply the necessary multiplicity corrections (such as Bonferoni). It's NOT legitimate to gather a whole lot of data, eyeball it, say "oh that looks interesting - I'll hypothesis test that bit". If you DID spot something interesting in that way in a data set and decide it's worth testing for, then you must go and generate a fresh independent set of data for the hypothesis test - but this is drifting off point a bit. In your case, if you decide in advance that you're going to do a whole plethora of different hypothesis tests, then your multiplicity corrections will be so extensive that you could end up masking some genuine differences on your big-ticket item - which is, I think, whether there's any noticeable difference in the overall quality.]

    Thanks.

    Since I had the guts of it already in Excel before starting the thread, I finished off the testing I had begun to build in excel earlier in the weekend. It was based on the Sign Test with normal binomial approximation, if I got my terms right.

    Just like you talked about in that post, I tested the overall results for preference and attributes. This was based on the combined results of both musical pieces for each pair of recording techniques which is only so useful. Then I did some more for each piece and compared those. Results showed parity where I expected it and 95% confidence where I expected it too, with some 99% confidence levels to boot too so I am happy in that regard. It also showed a flip in certain attributes between the different genres.

    You hit the nail on the head with the subsidiary analysis point. There is only so deep I can go into this subject before I lose the focus of the project question, which is preference. I will see what the supervisor says, I am still looking into other test as well as Thurstone but at least I have a set of results I know were worked out correctly instead of getting confused at picking the correct option.

    Thanks for the help!


  • Registered Users, Registered Users 2 Posts: 1,595 ✭✭✭MathsManiac


    You indicated earlier that you had 20 subjects. This basically gives you 20 "votes" in each battle of the A v B kind. Let's assume you still have 20 votes after you clean your data.

    If you were only doing one hypothesis test, then a vote of 15 to 5 or stronger would be significant at the 5% level, and a vote of 17 to 3 or stronger would be significant at the 1% level.

    If you are doing three tests, then in any one of them you need to see a vote of 16 to 4 or stronger to be significant at the 5% level, while 17 to 3 or stronger still indicates significance at 1%.

    If you are doing 15 tests (for example, testing overall preference and also four attributes for each of three pairings) then in any one of these tests you need a 17 to 3 vote or stronger to be significant at 5% and 18 to 2 or stronger to be significant at 1%.

    The attached Excel file shows the relevant binomial probability calculations to explain the above assertions.

    This is all premised on an assumption that you intend to follow this kind of hypothesis-testing approach to the analysis rather than the model-fitting approach that would arise from Thurstone scaling.


  • Registered Users, Registered Users 2 Posts: 3,915 ✭✭✭GTE


    Hokie dokie. I can see how the Bonferroni is applied to the tests, as in how to hack it into my Excel spreadsheet, but I am not sure to what degree it should be implemented.

    1 - Taking one question, the main one of preference:
    You are correct in that there are 20 votes from 20 participants for each A v B battle. As it takes 3 battles to compare the 3 recording methods there are 60 votes being cast from 20 participants. This means there are 3 pairwise comparison sections required per musical piece.

    2 - The test layout for preference for one question
    4 musical pieces from 2 songs were treated with the 3 recording methods. This means that the 3 sections are repeated for each musical piece, this results in 12 sections to fully answer the preference questions across all musical pieces.

    3 - Additional Questions
    In each section, 4 attribute questions are asked in addition to preference.

    4 - The Test layout for all questions
    The 12 sections described above are used to ask five questions. In total each single participant is asked 60 questions (12x5) and 60 votes result.

    Just so it makes sense in my head.
    1 Participant gives 60 votes.
    20 participants cast 1,200 votes.
    1,200 votes divided by the 5 questions is 240.
    240 divided by the number of musical pieces is 60

    5 - How to apply the Bonferroni Correction
    I am doing 15 tests per musical piece, and 60 tests for all 4 musical pieces combined.

    Applying the correction of 15 on the preference question cuts 5 significant results (95%) to 2 significant results at 95%.

    Does what I have explained mean that I have to apply a correction for 60 tests instead of 15?

    I would have assumed that the 15 is applied for each question with a particular stimulus.
    Once the test stimulus changes then another 15 tests are being done.

    Would a correction of 3 be more appropriate since each attribute, preference for example, will take 3 test sections to be satisfied?

    If I have to stick with a correction with respect to 15 tests, would I be better off excluding the test questions which as it stands only show parity leaving the three parity providing questions? That said, some of them showing parity makes sense and is expected so in essence I am treating each question with a different hypothesis, even if one or two of them are the same in terms of probability value.

    I can understand why a correction of 3 would be applied in my own head, but I can't get why a correction of 15 would be applied when each question is separate and only the stimulus is the same.

    In other news
    With Thurstone being discounted at this stage due to difficulty, I have managed to implement other tests which, once confirmed as being useful, I can then investigate Thurstone knowing I have a fall back.

    One-Proportion
    http://www.youtube.com/watch?v=xWwsfjZuaRg
    http://www.youtube.com/watch?v=SHLSn3EHVuQ

    Chi-Square
    http://www.wikihow.com/Calculate-P-Value

    Sign Test
    From books.

    MiniTab, the statistical software I am using on trial, shows that the actual p-value that it calculates in One-Proportion tests is the same is my Excel Chi-Square test which I am happy with.

    However, the Sign Test is throwing up a different Z score when the normal approximation is applied. When taking out the calculation for this, the Z score matches the One Proportion test which we know matches the Chi Square test. Seemingly happy days if I just leave that the normal approximation out, but the book tells me to use it for sample sizes larger than 20.

    I have attached a the stat calc sheet I am using with a set of sample results for one question. I thought it may be useful and aid clarity.

    The stats lecturer I know is out for a few days, but I will be sending him a similar sheet to see what he says as Thurstone hasn't worked out, at least yet.

    Thanks!


  • Registered Users, Registered Users 2 Posts: 1,595 ✭✭✭MathsManiac


    Here are a few options you could consider:

    Analysis of the big picture: just seeking evidence for whether any of these methods can be said to be consistently ratable as "better" than any other, irrespective of the music concerned.

    To answer such a question, you could treat your data as follows: in the battle between A and B, you only count votes from people who consistently rated A above B in all extracts from all songs. So, if I rated A above B in some cases, and B above A in others, I'm a "spoiled vote" and none of my A vs B opinions get counted in this battle, (but I might conceivably still have an unspoiled vote in the A vs C or B vs C battle). If I consistently rated A above B every time I heard them up against each other, then that's one vote for A; if the reverse, one vote for B; if neither, no valid vote.

    Another option is that, for each person, you look at whether they rated A above B more often than B above A. If so, that's one vote for A; if the reverse, one vote for B; if a tie, no vote. This method probably leaves you with fewer spoiled votes than the previous option.

    Another option would be to count all the votes. So, even if I preferred A to B on one track, and B to A the next time, both of these count as votes. This obviously yields far more total votes than the previous option, as everyone generates several votes, and none get discounted. I have doubts about how kosher this is - all the hypothesis tests we've been discussing are based on an assumption that the votes are all independent of each other, and this is not true if you do things this way - multiple votes that I cast in this ballot could not be considered independent of each other, even if the extracts I'm listening to are different. Perhaps somebody else here can see a way of arguing that this would still be a valid analysis - I just don't see it clearly as being so.

    As soon as you start conducting further hypothesis tests on different attributes of the music, and/or tests on how the vote goes for different genres, you're rapidly multiplying the number of tests you are doing. You must then count up the total number of hypothesis tests in your entire analysis, and that's the number that applies in the Bonferroni correction that must be applied to all of them. (You could make a case, I think, for applying multiplicity 3 to the main overall results for AvB, BvC and AvC and only applying the bigger number to all the other tests. Some people would take issue with that, but I think it's arguably reasonable in the context.)


  • Registered Users, Registered Users 2 Posts: 3,915 ✭✭✭GTE


    Great stuff, thanks for all your help. I have a much better handle on things since I started the thread.

    My final question is a simple one, I hope :p

    When writing up the results in APA style as below, should the p < 0.05 actually read the Bonferroni corrected value?
    Example wrote:
    SFI versus SFA
    29 out of 80 votes preferred SFI. A binomial test revealed that there is a significant preference for SFA at a 95% confidence level, z = -2.46, p < 0.05

    Thanks again


  • Registered Users, Registered Users 2 Posts: 1,595 ✭✭✭MathsManiac


    This might help:
    http://web.psych.washington.edu/writingcenter/writingguides/pdf/stats.pdf

    There's an example with Bonferroni correction under the second heading on page 2.


  • Registered Users, Registered Users 2 Posts: 3,915 ✭✭✭GTE


    Thanks again for your help, I am writing up the acknowledgements. MathsManiac may not be appropriate, but it could be fun to annoy the more rigid to include :p. But seriously, if you want to PM me, I can put your name in.

    Thanks again.


  • Registered Users, Registered Users 2 Posts: 1,595 ✭✭✭MathsManiac


    Thanks for the thought, but no need.

    (It's just what we do around here!)


Advertisement