How to correlate one line against another?

PGL · 29-09-2013 01:58PM #1

Hi folks

In very simple terms I am trying to replicate the results of Joe Blogg's model, by trying to incorporate as many known input assumptions, modelling methodology etc that Joe Bloggs would have used.

As I result I am left with a line graph coming out of my model, and I want to mathematically determine how well it correlates with Joe Bloggs line graph.

Can someone please point me in the right direction as to how I can deterimine how well my line is correlating with Joe Blogg's line?

Many thanks!

bnt · 29-09-2013 04:52PM

Correlation (R) between two data sets can be calculated in a specific way, using a standard formula. On Wikipedia, it's the last formula in the "Pearson's product-moment coefficient" section - that's the coefficient we usually mean when we talk about "correlation". Excel has the =CORREL() formula for this, MATLAB has corrcoef(x,y), and so on.

PGL · 30-09-2013 01:25PM

Hi bnt.

Thanks for your response. I'm wondering if "correlation" is not entirely what I need to achieve.

It appears that in simple terms correlation means how close i follow the same trend as the target data I'm aiming to validate against.

Lets say for a basic example, the target data has two points, with the first having a Y-value of 5 and the second having a Y-value of 4 i.e. a line sloping to the right. Following my 1st validation attempt I come up with Y-values of 3 and 2. This results in a correlation of 1 with the target data. Following a 2nd validation attempt, lets say I'm getting closer with Y-values of 4 and 3. However the correlation is again 1 with the target data. Hence the same correlation for both the 1st and 2nd attempts, even though the 2nd attempt is much closer to the target data.

Can you point me in the right direction?

Many thanks!

PGL · 02-10-2013 11:17AM

Any takers folks?

Anonymo · 05-10-2013 01:50PM

PGL wrote: »

Any takers folks?

if you dont want to do this in the manner bnt suggests, then just understand that for lines, the correlation is just the cosine of the angle between them. so get the slopes, respectively tan theta1 and tan theta2 and solve for cos(theta1-theta2)

the example you gave are for lines with the exact same slope, so its to be expected that the correlation is 1 (correlation describes the comparison of behaviour, here slope, between two datasets). it sounds like you want a measure of how far one set of data points is from another. a distance measure would then be better. for each target data point measure its distance from the line. the overall measure is sqrt(sum of distances squared). whichever line has a smaller measure is your best fit. I know there is some stat package that covers linear regression and analysis of variance, but what i described is the gist of it

fergalr · 08-10-2013 10:23PM

Anonymo wrote: »

if you dont want to do this in the manner bnt suggests, then just understand that for lines, the correlation is just the cosine of the angle between them. so get the slopes, respectively tan theta1 and tan theta2 and solve for cos(theta1-theta2)

I'm not sure exactly what the OP is trying to do. I think there's many ways of interpreting the question, and some of them get complicated.

But I don't think it makes sense to just compare the slopes of the lines, if you want to compare the two linear models.

Even if we assume that both the OP's model and Joe Bloggs' model have the same dimensions/variables, scaled the same way, two regression lines that have a similar slope, but different intercepts, might be very different models.

If we want to be more general about it, I think the OP is asking for a statistical test that will tell how similar their model is to Bloggs', given a 'similar' regression line fit.
That strikes me as quite a complicated question; and one I'd guess is probably not the best question to ask - I'm not sure its a great idea to be comparing the output of a replication of an experiment based purely on the similarity of the two linear regressions, without having first constrained all the other variables?

On the other hand, I guess its something that people want to do, in effect, frequently. 'We replicated Xs experiment and got a similar correlation co-efficient' would probably be accepted as evidence that the replication of the experiment was sound. But I dont know a principled way to provide confidence bounds on that certainty of replication, without either A) being sure that many of the modelling assumptions were the same in both experiments or

just making a load of assumptions that they were.

The question just seems like its getting into difficult territory.
This isn't an area of statistics I'm very familiar with though. Could anyone with more experience either say 'yeah, I think thats the right way to think about it' or 'actually theres a principled way of doing this, which makes some assumptions, but which we kind of treat as ok'?

It seems like the situation would be a lot easier to solve if you could be sure of the process Bloggs went through, the representation and scaling of data used, and just be looking to check to see if the two datasets were similar.

Anonymo · 10-10-2013 02:48PM

fergalr wrote: »

I'm not sure exactly what the OP is trying to do. I think there's many ways of interpreting the question, and some of them get complicated.

But I don't think it makes sense to just compare the slopes of the lines, if you want to compare the two linear models.

Even if we assume that both the OP's model and Joe Bloggs' model have the same dimensions/variables, scaled the same way, two regression lines that have a similar slope, but different intercepts, might be very different models.

If we want to be more general about it, I think the OP is asking for a statistical test that will tell how similar their model is to Bloggs', given a 'similar' regression line fit.
That strikes me as quite a complicated question; and one I'd guess is probably not the best question to ask - I'm not sure its a great idea to be comparing the output of a replication of an experiment based purely on the similarity of the two linear regressions, without having first constrained all the other variables?

On the other hand, I guess its something that people want to do, in effect, frequently. 'We replicated Xs experiment and got a similar correlation co-efficient' would probably be accepted as evidence that the replication of the experiment was sound. But I dont know a principled way to provide confidence bounds on that certainty of replication, without either A) being sure that many of the modelling assumptions were the same in both experiments or just making a load of assumptions that they were.

The question just seems like its getting into difficult territory.
This isn't an area of statistics I'm very familiar with though. Could anyone with more experience either say 'yeah, I think thats the right way to think about it' or 'actually theres a principled way of doing this, which makes some assumptions, but which we kind of treat as ok'?

It seems like the situation would be a lot easier to solve if you could be sure of the process Bloggs went through, the representation and scaling of data used, and just be looking to check to see if the two datasets were similar.

You should probably read the whole post rather than commenting on one line of it! I agree that it's not clear what the OPs original question was but I did, I think, give a plausible method with which to compare how 'close' one model is to the data set compared to another. I'd be interested in why you think the answer I gave may not work.

fergalr · 10-10-2013 03:45PM

Hmm - we risk talking in circles, unless we get more clarity from the OP what the OP actually wants.

I suspect the OP doesn't quite know what exactly they need, or has asked for the wrong thing, and that's what is causing confusion.

OP?

Anonymo wrote: »

I agree that it's not clear what the OPs original question was but I did, I think, give a plausible method with which to compare how 'close' one model is to the data set compared to another. I'd be interested in why you think the answer I gave may not work.

I think what you described is basically ordinary least squares regression, right?
Which would make sense if the OP was trying to compare two fit lines to a single set of data.

If, instead of two lines, the OP has the two datasets, of the same scaling and dimension, the OP could probably calculate the average pairwise distance between the points, assuming the points are matched.

I'm not sure how you'd turn that into a test statistic, but maybe reporting the average distance would be enough.

PGL · 11-10-2013 03:51PM

Hi guys.

Many thanks for the replies and apologies for not getting back to you sooner.

I can't really explain the situation any clearer than I did in the original post i.e. in very simple terms I am trying to replicate the results of Joe Blogg's model, by trying to incorporate as many known input assumptions, modelling methodology etc that Joe Bloggs would have used.

As I result I am left with a line graph coming out of my model, and I want to mathematically determine how well it correlates with Joe Bloggs line graph.

My aim is to validate my model against a different model by Joe Bloggs - as I said I've tried to replicate as many input assumptions as possible, but due to the fact that the models are different, obviously the results will not be exactly the same.

I need to find a mathematical method of quantifying how well my model results correlate with Joe Bloggs.

Anonymo is heading in the right direction by suggesting the correlation should relate to a comparison of the slopes, while also considering distances between the lines or points.

Obviously the more that lines are similar in terms of slope and distance, the better the correlation is. I want to find a method of quantifying what a good correlation is....

Hope I've explained myself better this time!

fergalr · 11-10-2013 04:55PM

PGL wrote: »

Hi guys.

Many thanks for the replies and apologies for not getting back to you sooner.

I can't really explain the situation any clearer than I did in the original post i.e. in very simple terms I am trying to replicate the results of Joe Blogg's model, by trying to incorporate as many known input assumptions, modelling methodology etc that Joe Bloggs would have used.

As I result I am left with a line graph coming out of my model, and I want to mathematically determine how well it correlates with Joe Bloggs line graph.

I guess from this that you are assuming, or that you know, that both your model and Joes model are 2 dimensional linear models.

I.e. 2D lines. So you can represent them by: y = mx +c

The lines are fitted to pairs of numbers, (x,y)

E.g. (2,4),(3,6)

Correct?

The next question is, do you know whether or not your axes have the same scale as Joe's?

Lets say the dimensions are human height and human weight.
Is there any chance that you are using Metres and Kilograms, whereas Joe is using Inches and Pounds?
Is it possible that Joe has one of his axes using a Log scale, where as both of your axes are linear?

Or are you willing to assume that your axes are the same?

(If you get a line that looks very similar to Joe's line, but one of your axes is logged, while Joe's isnt, then that doesn't necessarily mean you have a model similar to Joe's.)

If you can assume that the axes are the same, it makes your life simpler, and you can measure the difference in a way that makes you more sure of your conclusions.

PGL wrote: »

My aim is to validate my model against a different model by Joe Bloggs - as I said I've tried to replicate as many input assumptions as possible, but due to the fact that the models are different, obviously the results will not be exactly the same.

I need to find a mathematical method of quantifying how well my model results correlate with Joe Bloggs.

Anonymo is heading in the right direction by suggesting the correlation should relate to a comparison of the slopes, while also considering distances between the lines or points.

Another issue with just trying to compare two linear models is: what is the range of the data that you care about the models predicting?

If you are trying to predict human weight, given human height, then what matters is how similar your model is to Joe's in the range of human heights you are likely to see. I.e. what matters is how similar the models are when X, measured in meters, is between 0 and 3.

It doesn't matter much how similar the models are when X is between 300 and 303.

This is why we asked about comparing the data points, rather than just comparing the lines. There's a bit more information in the data points.

If you just want to compare the lines, and if don't know or want to assume anything about the range the model is used for or the scaling of the axes, then you are back to just doing something very simple, like just using the slope between the lines.

If you can assume or know more, then you need to say what your assumptions are to figure out what sort of measurement to use. Most statistical measurements come with a set of assumptions that have to be fulfilled for the output of the test to be useful.

PGL wrote: »

Obviously the more that lines are similar in terms of slope and distance, the better the correlation is. I want to find a method of quantifying what a good correlation is....

Another way of setting up the problem might be if you told us what the models were used for.

If you had a set of representative data from a third party, that you wanted to use the models to predict, which neither you nor Joe had used to fit your models, then you could use your model and Joes model to try and predict the Y value for each of the third parties data, and quantify the difference in predictions.
How well each of the models predicted this third party data could be used to compare the models.

If you can set up a good test like this, its often the best way of comparing models, and the least prone to making an assumption that isn't valid. (Admittedly it can be hard to set up such a good test in many domains.)

PGL · 16-10-2013 10:43AM

fergalr wrote: »

I guess from this that you are assuming, or that you know, that both your model and Joes model are 2 dimensional linear models.

I.e. 2D lines. So you can represent them by: y = mx +c

The lines are fitted to pairs of numbers, (x,y)

E.g. (2,4),(3,6)

Correct?

Yes, both models produce 2d linear models

fergalr wrote: »

The next question is, do you know whether or not your axes have the same scale as Joe's?

Yes both model's axes have the same scale. The X axis is Years and the Y axis is %. Hence the X axis figures will be the same for both models; it is the Y axis % figures that I am trying correlate as closely as possible with Joe Bloggs.

To clarify, Joe Blogg's model results are tried and trusted. I am trying to get my model to correlate with Joe Bloggs as closely as possible. Hence there is no requirement for 3rd party representative data.

Obviously if the results from my model produces a line which has a very similar slope and is positioned close to Joe Bloggs' line, then my model will be generally correlating closely with his. However my question is: How does one quantify what is a good correlation? i.e. is there some calculation which measures how well my line correlates with the target (Joe Bloggs) line? I would hope this calculation stipulates a factor or number which classifies an acceptable correlation e.g. a value of less than or equal to 1 is an acceptable correlation, while a value greater than 1 is not an acceptable correlation.....

Hopefully this provides more clarity to my query.

Thanks for your help

fergalr · 16-10-2013 04:10PM

PGL wrote: »

Yes, both models produce 2d linear models

Yes both model's axes have the same scale. The X axis is Years and the Y axis is %. Hence the X axis figures will be the same for both models; it is the Y axis % figures that I am trying correlate as closely as possible with Joe Bloggs.

That simplifies things.

PGL wrote: »

To clarify, Joe Blogg's model results are tried and trusted. I am trying to get my model to correlate with Joe Bloggs as closely as possible. Hence there is no requirement for 3rd party representative data.

Obviously if the results from my model produces a line which has a very similar slope and is positioned close to Joe Bloggs' line, then my model will be generally correlating closely with his. However my question is: How does one quantify what is a good correlation? i.e. is there some calculation which measures how well my line correlates with the target (Joe Bloggs) line?

There are different ways you could quantify the 'closeness'.

One of the best ways would be if you have a representative set of third party X data, and, for each of those Xs, you calculate the Y that your model and Blogg's models predict, and you quantify the difference between your model and Bloggs' Y values. If the different is small in your domain, then you've got a good fit.

But note what counts as 'small' or 'acceptable' depends on the domain.

There'd be some domains where if the two models agree on the test data within an average of 20%, its good; and others where if the agreement is less than 5% its bad.

If you don't have representative data, but you have an idea of the range of X values that its reasonable to use the models for, you could calculate the (absolute) area between the two lines across the X interval, divided by the width of the X interval.

That'd be another way of quantifying the difference. Again, interpreting it depends.

After that, you are into doing something like just looking at the slopes.

PGL wrote: »

I would hope this calculation stipulates a factor or number which classifies an acceptable correlation e.g. a value of less than or equal to 1 is an acceptable correlation, while a value greater than 1 is not an acceptable correlation.....

There's no general way that you can say that one value of a test statistic is acceptable, which another is not acceptable, in an absolute sense. It depends on what you are trying to do, which is subjective.

Even if you look at the level of p value people consider signficant (e.g. 0.05) that's just convention. There are situations where that wouldn't be acceptable.

Equally, if you look at a correlation coefficient, there'd be some experiments where a small but significant correlation would be a huge success.

Can you talk about the domain that you are working in? Probably if you just talk to someone else in the domain, they'll give you a good idea what counts as 'reasonably similar'. Its difficult to talk about these things in absolutes.

PGL · 16-10-2013 11:01PM

Hi fergalr

I've had another think about this and I'm starting to come to the conclusion that "correlate" is the wrong word, and instead think that "accuracy" is the right term.

Accuracy is the degree of conformity of a measured or calculated quantity to its actual (true) value. The accuracy of an experiment/object/value is a measure of how closely the experimental results agree with a true or accepted value.

A bit of googling suggests that calculating percentage error is one method of quantifying accuracy, with some sources saying that up to 10% percentage error acceptable, while others state up to 5%.

I wonder am I heading in the right direction? If so do you know how best to quantify acceptable accuracy? If it helps, my analysis relates to a term in electrical engineering called curtailment, which is where the output of generators is reduced for system security reasons.

Thanks again for your help on this.

fergalr · 17-10-2013 12:49AM

PGL wrote: »

Hi fergalr

I've had another think about this and I'm starting to come to the conclusion that "correlate" is the wrong word, and instead think that "accuracy" is the right term.

Accuracy is the degree of conformity of a measured or calculated quantity to its actual (true) value. The accuracy of an experiment/object/value is a measure of how closely the experimental results agree with a true or accepted value.

Well, those are just names - they can have specific meanings, but I wouldn't get too into a particular definition accuracy

PGL wrote: »

A bit of googling suggests that calculating percentage error is one method of quantifying accuracy, with some sources saying that up to 10% percentage error acceptable, while others state up to 5%.

I wonder am I heading in the right direction? If so do you know how best to quantify acceptable accuracy? If it helps, my analysis relates to a term in electrical engineering called curtailment, which is where the output of generators is reduced for system security reasons.

Thanks again for your help on this.

Here is what I would suggest.

This is my best guess; if possible, find out the standard in your field, or run whatever you decide past someone in your field, to see if its something that makes sense to them.

Get your two lines.

Pick a valid range of Xs to look at. E.g. X between 10 and 20.
This should be your best estimate of the range that the model would be used for.

I.e. the range of Xs that someone would like to use the model to predict the Ys for.

Then calculate the percentage difference between your line and Bloggs' line across that range.

One simple way of doing it approximately:

(Where "Y_your(X_i)" denotes the the Y value that your line gets for the point X_i, and "Y_bloggs(X_i)" denotes the Y value that bloggs' line gets for the point X_i).

1) Break that X range up in 100 or 1000 equally spaced intervals.
2) For each X_i value that starts an interval:
. 3) Get the Y_your(X_i). Get the Y_bloggs(X_i).
. 4) Calculate the absolute value of (Y_bloggs(X_i) - Y_your(X_i)). Lets call this the absolute error at X_i.
. 5) Divide (the absolute error at X_i) by (Y_bloggs(X_i)). Multiply by 100. This gives you the error as a percentage of the value of Y_bloggs(X_i)
6) Get the average of these errors.

Report the average, as the average difference between Bloggs' and your regression line, across the range considered, as a percentage of Y_bloggs(X_i).

I think this would be a reasonable way of reporting the difference between the two, if you are determined to compare the two regression lines. Especially if you are clear about what you are calculating, I'd imagine people would be happy enough with it.

If that's within your 5 or 10%, then you are probably ok.

But show the final product to someone who is experienced at this stuff in your field, just to be sure.

That's all I can say; hope this helps. Its basically similar to the approaches anonymo mentioned.

PGL · 18-10-2013 01:08PM

Hi fergalr

Fair play to you for your time and advice which makes sense to me.

As you suggested, I will run this approach by some people in the industry.

Thanks!

How to correlate one line against another?

Comments