Correlation question

Darkest Horse · 15-05-2016 10:21PM #1

Hi all. Just a quick Q for anyone with a stats head on them. Am I breaking any statistical rules by correlating variables in the following way. Say the letters are all representative of scores on different aptitude tests. We have tests A, B, C, X, Y, Z, so:

The correlation of X with the sum of A, B, C, X, Y and Z.

The bits that concern me are correlating X with itself, and correlating X with the sum of all the other variables.

Hope I've explained that clearly enough.

Cheers,
DH

MathsManiac · 16-05-2016 10:11PM

You're correct to be concerned about this. It's generally not a good idea to correlate a quantity with another one that is partly derived from it.

In testing contexts, it would be more usual to correlate X with the sum of the remaining items.

What you have is a variation on the idea of "item-total correlations", which are used for item analysis in test development. The usual practice is to use the "corrected item-total correlation", which is the correlation between the scores on the item under consideration and the measure formed by summing the remaining items on the test. This is a measure of "item discrimination", which, broadly speaking, is about how good this item is at measuring whatever it is that the overall test measures.

Darkest Horse · 16-05-2016 10:20PM

That's great thanks for that. In short I think you are saying to take X out of the the grouped variables and correlate like so:

X with the sum of A, B, C, Y, Z?

MathsManiac · 19-05-2016 08:06AM

Yes.

You could always calculate both.

If you're writing for an informed audience, they will understand the issues associated with correlating a part with the whole - the important thing is that you communicate clearly what you have done.

If you are reporting on correlations of the type X with A+B+C+Y+Z, you could refer to these as "part with rest" correlations or "adjusted part with whole correlations". If you're reporting on ones of the type X with A+B+C+X+Y+Z, you could refer to them as "part with whole" or "non-adjusted part with whole" correlations. In that case, it would be no harm to note that such correlations can be somewhat artificially inflated by the presence of X in both quantities (this also applies to, for example, correlations of the form X+Y with X+Z).

Even if there are reasons why you want to leave the X in the mix for the second quantity, the different correlations you calculate are still meaningful relative to each other. That is, for example, if you let T=A+B+C+X+Y+Z, and then calculate the correlation between T and each of the six individual variables, these six correlations can be meaningfully compared to each other, even if all of their absolute values are inflated by the issue you identified. If these are scores on sub-tests of a test, you can still use these correlations to see which sub-test is most / least efficient at measuring whatever construct being measured by the test overall.

Correlation question

Comments