Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Stats (Correlation) question from new LCHL Project Maths Sample Paper

Options
  • 17-12-2011 4:30pm
    #1
    Closed Accounts Posts: 57 ✭✭


    Hi guys, not sure if I should have posted this in the Teaching & Lecturing thread, but I opted for the expertise here. Mods may move if necessary.

    So I'm teaching Leaving Cert Higher Level maths and for the first time, I have to teach about scatter plots, correlations and lines of best fit. In preparation for teaching this after xmas, I had a look at some of the sample papers released by the SEC and I came across this one from 2011:

    pmaths.jpg
    Now the solution for this, is that the Cycle V Swim plot has a weak correlation, the Swim V Run plot has a medium correlation and the Cycle V Run plot has a strong correlation. All correlations are positive. Now that is fine, but it was the next question that has me disagreeing with the solution provided at the back of the book.

    (c) The best-fit line for run-time based on swim-time is y = 0.53x + 15.2. The best-fit line for run-time based on cycle-time is y = 0.58x + 0.71. Brian did the swim in 17·6 minutes and the cycle in 35·7 minutes. Give your best estimate of Brian’s time for the run, and justify your answer.

    Now, my solution was to take the plot with the best correlation (the cycle-run plot), sub 35.7 as the value for x and solve for y in the second equation and that was the best answer as the plot had a stronger correlation.

    The back of the book solution is to do the same as me, but also to sub 17.6 for x in the first equation and take the average of the two solutions. I have asked colleagues of mine but have gotten mixed views on this.

    My argument is that if the swim-run plot had an extremely weak correlation then would you not just discard this result as it is skewing your answer by using it to calculate average??

    Thanks in advance for any replies.


Comments

  • Registered Users Posts: 1,595 ✭✭✭MathsManiac


    In relation to part (b), I would say that the difference between correlations in the first two diagrams is hardly perceptible - I'd be inclined to describe both as moderate. The third one is certainly stronger.

    In relation to part (c), the fact that there is some correlation between the run and the swim means that the swim time can contribute some useful information to the estimate. However, the stronger correlation between the run and the cycle does indicate that the cycle is a better predictor of run-time than the swim. So there is certainly a case to be made for basing an estimate on the swim-time only. However, I would say that the best estimate on the basis of the available information would involve a weighted mean of the two estimates, leaning more towards the one based on the cycle.


  • Closed Accounts Posts: 57 ✭✭sullanefc


    In relation to part (b), I would say that the difference between correlations in the first two diagrams is hardly perceptible - I'd be inclined to describe both as moderate. The third one is certainly stronger.

    In relation to part (c), the fact that there is some correlation between the run and the swim means that the swim time can contribute some useful information to the estimate. However, the stronger correlation between the run and the cycle does indicate that the cycle is a better predictor of run-time than the swim. So there is certainly a case to be made for basing an estimate on the swim-time only. However, I would say that the best estimate on the basis of the available information would involve a weighted mean of the two estimates, leaning more towards the one based on the cycle.

    Thanks for the response, but without actually knowing the correlation coefficients then a weighted mean would not be possible. But finding the mean by adding both run times and dividing by 2 (as is done in the solution section) makes no sense to me. If the swim-run scatter plot showed an extremely weak correlation, it would make even less sense wouldn't it?


  • Registered Users Posts: 1,595 ✭✭✭MathsManiac


    If the swim-run scatterplot showed no discernible correlation, then I would not consider the swim as a useful predictor, but that's not the case here.

    You can do a weighted mean by estimating the relative values of the two predictor variables. (Maybe "weighted mean" was too grand a term to use in this context!)

    I'd say 22.5 minutes would be a pretty good point estimate.

    It seems clear from the way the question is phrased that they are at least as interested in your rationale for choosing a particular approach than what actual answer you get. If the person's answer and justification showed a good understanding of the issues we are discussing here, they'd be on the pig's back, I'd say.

    Of course, if you wanted the real answer, you could always download and mess around with the original data set! Assuming they really did use the correct data, then it's available here:
    http://www.corktri.com/events/kinsale-king-of-the-hill-triathlon/
    (Just below the second photo)
    If you ignore rows with missing data, then there really are 224 competitors, so it seems likely the data are real.

    The whole question is pretty interesting, I think. Beats the hell out of the kind of stats that were on the old course.


  • Closed Accounts Posts: 57 ✭✭sullanefc


    If the swim-run scatterplot showed no discernible correlation, then I would not consider the swim as a useful predictor, but that's not the case here.

    You can do a weighted mean by estimating the relative values of the two predictor variables. (Maybe "weighted mean" was too grand a term to use in this context!)

    I'd say 22.5 minutes would be a pretty good point estimate.

    It seems clear from the way the question is phrased that they are at least as interested in your rationale for choosing a particular approach than what actual answer you get. If the person's answer and justification showed a good understanding of the issues we are discussing here, they'd be on the pig's back, I'd say.

    Of course, if you wanted the real answer, you could always download and mess around with the original data set! Assuming they really did use the correct data, then it's available here:
    http://www.corktri.com/events/kinsale-king-of-the-hill-triathlon/
    (Just below the second photo)
    If you ignore rows with missing data, then there really are 224 competitors, so it seems likely the data are real.

    The whole question is pretty interesting, I think. Beats the hell out of the kind of stats that were on the old course.

    Many thanks for that response. I never even considered that there was real data behind the plots. Thought it was made up.

    The part of your post that I've highlighted is on the money I'd say. If your reasoning is sound, then I'd say you'd get the marks.

    Thanks again.


  • Registered Users Posts: 1,163 ✭✭✭hivizman


    Thanks to MathsManiac for identifying the source of the data. Using the data given in the spreadsheet, and applying the Data Analysis module in Excel, the correlation coefficients are:

    Cycle v Swim 0.55
    Run v Swim 0.49
    Run v Cycle 0.77

    So the first two correlations are quite close, and it would be reasonable to describe the Cycle v Swim correlation as "moderate" (interestingly, the Cycle v Swim correlation is actually slightly greater than the Run v Swim correlation).

    Given the data, it's possible to apply multivariate regression analysis. The line of best fit in this case is:

    r = 0.48 + 0.10s + 0.54c

    If Brian did the swim in 17.6 minutes and the cycle in 35.7 minutes, then the best estimate of the run time from the multivariate regression is:

    r = 0.48 + 0.10*17.6 + 0.54*35.7 = 21.5 minutes.

    This compares with the estimates from the univariate regressions:

    r = 15.2 + 0.53s, yielding an estimate of 15.2 + 0.53*17.6 = 24.5 minutes

    r = 0.71 + 0.58c, yielding an estimate of 0.71 + 0.58*35.7 = 21.4 minutes

    In the multivariate regression, the swim time variable contributes very little. One measure of the explanatory power of a regression model is the Adjusted R-squared value. For the univariate model based on cycle time, this is 0.590, implying that the variation in the cycle time explains about 59% of the variation in the run time. However, for the multivariate model, the Adjusted R-squared is 0.595, implying that including swim time as well as cycle time increases the explanatory power of the regression model only marginally.

    As it turns out, therefore, the OP's original instinct to ignore the swim time was reasonable.

    I suspect that much of this is beyond the scope of Leaving Cert Higher Level Maths, though!


  • Advertisement
  • Closed Accounts Posts: 243 ✭✭vallo


    I'm so glad to see this here, because I had the exact same concerns about the question.
    If you have two ways of estimating the run-time, would it not be more accurate to use the better one (the one with the stronger correlation) rather than "diluting" this estimate using the weaker correlation and taking the mean.
    I gave this to my 5th years in their christmas exam and I am just correcting them now. Interestingly, many of them took the mean approach. The book of exam papers gives the result as the mean of the two best-fit formulae.
    I haven't seen an "official" dept of ed solution to it though.


  • Closed Accounts Posts: 57 ✭✭sullanefc


    Sorry guys to bump an old thread, but I posted another concern I had about the syllabus for LCHL on the "Teaching and Lecturing" forum here. I didn't want to post a new thread on it here, but if anyone here is a maths teacher in the know, could you help please. Thanks.


Advertisement