Correlation Doesn't Imply Causation

Amirani · 13-05-2014 02:29PM #1

One the the fundamental rules to be aware of when analysing a dataset, this article humorously shows some of the poorly founded conclusions that could be made if one were to ignore it: http://www.vox.com/2014/5/13/5710874/the-best-illustration-youll-see-that-correlation-doesnt-equal

Black Swan · 15-05-2014 12:58AM

Diego Simeone wrote: »

Correlation Doesn't Imply Causation: One the the fundamental rules to be aware of when analysing a dataset

I would exercise caution before labeling this a "fundamental rule," which can be misleading to the point of becoming an often used cliché expression in non-scientific circles. Rather, would it be better stated that correlation is a necessary but insufficient condition to establish causation; i.e., other conditions (in addition to correlation) must be present and in agreement before we can estimate causation?

Certainly we can have correlations that are spurious and sometimes humourous, but such correlations would not be in agreement with other necessary conditions needed to be sufficient to estimate a causation. In addition to correlation cautions, I would furthermore use the term causation with caution, given that establishing causation can be problematic; i.e., more often than not we can only suggest causation with the scientific method, not prove.

Amirani · 16-05-2014 05:24PM

Yep, can't disagree with any of that. I rather lazily phrased my thread title. I should instead have said that it doesn't automatically imply causation.

keith16 · 19-05-2014 09:29PM

Good link Diego,

I guess when trying to look at causation, the hypothesis usually is somewhat reasonable. In my experience with limited data, most people know the question and the answer the data will likely give them.

I guess it's the role of data mining to find answers to the questions no-one asked, such as do drownings increase when Nicolas Cage releases a film

Black Swan · 19-05-2014 10:12PM

keith16 wrote: »

I guess when trying to look at causation, the hypothesis usually is somewhat reasonable. In my experience with limited data, most people know the question and the answer the data will likely give them.

Caution should be exercised when attempting to establish the necessary and sufficient conditions for causation, especially when hypothesizing relationships. By scientific method convention we do not test the research hypothesis (or the notion that "most people know the question and the answer the data will likely give them"); rather we test the null hypothesis of no significance. If the null is rejected, then support is suggested for the research hypothesis. It's just too easy to fall into the trap of attempting to find support for what we believe to be true, potentially ignoring contrary evidence.

The "hypothesis" approach is deductive, and not the only way we approach data mining. Sometimes we are inductive, beginning with an exploration of the data without the test of an hypothesis. Occasionally we may discover patterns in the data that can be used to form meaningful empirical generalisations, while other times we find spurious and sometimes humourous relationships.

riasc · 24-05-2014 03:07PM

The trick with analytics is to follow the scientific method.

Data on its own will only get you so far, you need to have context. The context will come in several guises, all of which are necessary. Typically you need a model of the system, and each operational state needs its own model, because structurally the model can change drastically as states change, for instance from normal operations to emergency operations, from normal risk to catostrophic risk etc.

Many get limited or the wrong answer because they dont understand the system they are analysing and do not consult those who do. Data analysis will only get you so far before you hit the buffer of finding relationships that you do not understand and therefore cannot progress further.

To do analytics proparly you need:

Good data
A system model
Competence so you cvan join the two

fergalr · 02-07-2014 03:07AM

This is a hideously complex topic, depending on the level you take it.

I'm not sure we've gotten to the bottom of it yet really, but I don't really know, my knowledge wouldn't be at the cutting edge.

Like, clearly you have to be careful with spurious correlations; if that's what we mean by the phrase, then fine.

But there's another level at which, as humans, we don't ever really have anything other than correlation to work from; any knowledge of 'causation' we see is always just inferred from previous correlations; if you can't use correlations to show causation, then what does the word even mean?

metrosity · 22-07-2014 03:00PM

Eh ..

That actually is a rule of statistics, according to reputable textbooks.

The rule is very fundamental to Statistics, and thus the data analysis and machine learning that builds atop of it:

Basically, in real-world applied statistics, there are "observational experiments" whereby you measure the value of the response variable without attempting to influence the value of either the response or explanatory variables. In these experiments, there are too many lurking variables to account for so you just get a correlation, nothing more. Otherwise 'confounding' happens: where the effects of two or more explanatory variables are not separated.

Only in "designed experiments", are you professionally able to assert causation. In a designed experiment you have control over the explanatory variables and their affect on the response variable so you get a clearer picture of their influence i.e. you build the model more than observing it. "Lurking variables" should be accounted for in other words. This is just the tip of the iceberg.

Only graduates with a solid grounding in Statistics ""should"" work in the Data Analysis sector, or it's just another bubble waiting to burst/be found out, ..if that growing market is to be pumped with people that BS their credentials.

fergalr · 23-07-2014 04:14AM

metrosity wrote: »

Eh ..

'Eh' yourself :-)

metrosity wrote: »

That actually is a rule of statistics, according to reputable textbooks.

The rule is very fundamental to Statistics, and thus the data analysis and machine learning that builds atop of it:

Basically, in real-world applied statistics, there are "observational experiments" whereby you measure the value of the response variable without attempting to influence the value of either the response or explanatory variables. In these experiments, there are too many lurking variables to account for so you just get a correlation, nothing more. Otherwise 'confounding' happens: where the effects of two or more explanatory variables are not separated.

Only in "designed experiments", are you professionally able to assert causation.

So, you don't believe smoking causes cancer then?

After all, I don't know of any randomised controlled trials where people are randomly assigned ahead of time to become smokers or not, and examined 30 years later on for cancer.

Rather, the studies on smoking are fundamentally observational in nature.

But contrary to your claims, I think most statisticians would agree that there's sufficient evidence to professionally assert causation on that question.

It's not simple certainly, and was controversial for a long time (e.g. Fischer); but it's not correct of you to say that professional standards don't allow people to assert causation using observational methods.

More prosaically, do you believe that having ones head chopped off causes death? Ever heard of a randomised controlled trial on it? No? How do you square that?

metrosity wrote: »

In a designed experiment you have control over the explanatory variables and their affect on the response variable so you get a clearer picture of their influence i.e. you build the model more than observing it. "Lurking variables" should be accounted for in other words. This is just the tip of the iceberg.

Yes, its a complicated topic, as I said.

But you are being too absolute in the statements you are making.

Finally, even if you do your RCT, and you see that, for each sample where you increased X, as X increases by M, Y increases reliably by 2M.

What do you do then? Do you conclude that increasing X causes Y to increase? (subject to statistical significance etc) That's generally what people do. (If they are good scientists, after a lot of soul searching for other explanations, additional data, additional factors etc.) But fundamentally, if the trend remains consistent, they eventually conclude increasing X causes Y to increase.

But if you look at what is ACTUALLY happening there, its using repeated correlations to establish causation.

So, while simple correlation is in many cases insufficent to establish causation, if you look at cases where causation eventually is established, its often as a result of mounting evidence due to repeated correlations.

Finally, you might find this interesting reading (or mind blowing), coming at the topic from yet another angle:
http://lesswrong.com/lw/ev3/causal_diagrams_and_causal_models/

metrosity wrote: »

Only graduates with a solid grounding in Statistics ""should"" work in the Data Analysis sector, or it's just another bubble waiting to burst/be found out, ..if that growing market is to be pumped with people that BS their credentials.

Oh, come on.

Yes, there's a lot of hype in 'data analysis' and a lot of BS and people who dont know what they are talking about.

No, that doesn't mean only 'graduates' can work 'in the sector', or that said 'graduates' need to have a solid grounding in statistics (i.e. be statisticians; I honestly don't believe I know of any other class of graduate that reliability finishes a primary degree with a solid grounding in statistics).
But definitely would agree more statistics or people with advanced training in statistics (i.e. know some of the basics) would be good.

metrosity · 24-07-2014 09:35AM

fergalr wrote: »

'Eh' yourself :-)

So, you don't believe smoking causes cancer then?

After all, I don't know of any randomised controlled trials where people are randomly assigned ahead of time to become smokers or not, and examined 30 years later on for cancer.

That's one example but the easiest one to pick. When cancer rates are steadily increasing anyway (be more specific btw, you kind of have to be in this area - it's lung cancer for example you might want to correlate it with) We're not saying 'correlation' is useless as a tool. It's just not definitive.

fergalr wrote: »

Rather, the studies on smoking are fundamentally observational in nature.

But contrary to your claims, I think most statisticians would agree that there's sufficient evidence to professionally assert causation on that question.

No most (I would pray all) statisticians behave like statisticians. They take the correlation as circumstantial evidence if you will. Grounds for further studies.

fergalr wrote: »

It's not simple certainly, and was controversial for a long time (e.g. Fischer); but it's not correct of you to say that professional standards don't allow people to assert causation using observational methods.

Yes it is. Please read some of the countless texts on the matter if you please. You have to be very careful with observational experiments. You just cherry picked smoking and cancer, but if we apply that logic to everything (setting a precedent) then you end up with a complete mess. More often than not, there appears to be causation, but when you look deeper there are lurking variables. It's just not the done thing. Sure, if it's something non critical like whether to gamble on a stock or not, then you can use a more liberal statistical model for your data, but in healthcare and other more serious areas, there is a tendency to be more conservative, and gather more data. Are there unethical studies where they get monkeys to smoke in a designed experiment? I'm not sure. You see, if someones is a smoker, what else are they likely to be?? Have a guess? Then suddenly tour causation is torn to shreds.

fergalr wrote: »

More prosaically, do you believe that having ones head chopped off causes death? Ever heard of a randomised controlled trial on it? No? How do you square that?

I'm not sure I should even justify that with an answer but Statistics is not always concerned with brute verifiable facts. It's a science of approximation. There is always room for error. That's why we have things like confidence intervals. We would be 100% confident that the guy died of having his head chopped off.

fergalr wrote: »

Yes, its a complicated topic, as I said.

But you are being too absolute in the statements you are making.

It's called 'Statistics' Don't blame me. I wasn't claiming to be the Father of Statistics, just a student.

fergalr wrote: »

Finally, even if you do your RCT, and you see that, for each sample where you increased X, as X increases by M, Y increases reliably by 2M.

What do you do then? Do you conclude that increasing X causes Y to increase? (subject to statistical significance etc) That's generally what people do. (If they are good scientists, after a lot of soul searching for other explanations, additional data, additional factors etc.) But fundamentally, if the trend remains consistent, they eventually conclude increasing X causes Y to increase.

But if you look at what is ACTUALLY happening there, its using repeated correlations to establish causation.

That's something for you to write a paper on then, and not try to change overnight. These methods weren't dreamt up overnight and until we have something better, we use what we have. Place some trust in the people that went before you and show some respect for them I say. There are some nuances to the probability side of things I find a bit puzzling, but only at very small values. If I studied further, it would probably make sense, but I trust it.

fergalr wrote: »

So, while simple correlation is in many cases insufficent to establish causation, if you look at cases where causation eventually is established, its often as a result of mounting evidence due to repeated correlations.

We have a name for that too. It's called empirical analysis. If I flick a coin 10 times, the data might look skewed. I might think heads is more likely. If I flick it a million times, it will be more accurate. You want as big a sample set as you can get, whatever methods you use.

fergalr wrote: »

Finally, you might find this interesting reading (or mind blowing), coming at the topic from yet another angle: [can't include URLs']

That looks pretty interesting fergalr, but I do think you're more interested in the theory than the practice, though I wouldn't bet on it, just speculating. You might be interested in researching the area. If someone working in data analysis comes across a situation where it's difficult to distinguish between the cause and the effect..well it's rare. Quantum mechanics throws a spanner into the works, but I don'r know many people doing statistical analysis of any kind in that area. It would be a rare example. It's a more esoteric topic which goes far beyond the statistics of mere mortals.

fergalr wrote: »

Oh, come on.

Yes, there's a lot of hype in 'data analysis' and a lot of BS and people who dont know what they are talking about.

No, that doesn't mean only 'graduates' can work 'in the sector', or that said 'graduates' need to have a solid grounding in statistics (i.e. be statisticians; I honestly don't believe I know of any other class of graduate that reliability finishes a primary degree with a solid grounding in statistics).
But definitely would agree more statistics or people with advanced training in statistics (i.e. know some of the basics) would be good.

I studied CS in college and stats and probability was my best subject. I'm studying it again now, only this time more applied, work oriented. There's lots I didn't know and still don't know. I want to know more, and I think if there's one area where you really should have a solid grounding of Statistics, it's data analysis and machine learning etc. Indeed no self respecting postgrad in those areas would let you in without refreshing your statistics.

I don't agree that you need a PhD in it to get the top jobs as a lot of the top jobs I see in America look for that, but you should have studies it in a degree at least, preferable recently and preferable practical oriented, which means you might want to do refresher course in it at the ver most minuscule least.

__My only real concern on a personal level is that it good be another addition to the "jobs for the boys" culture__ - where well connected people from rich families get their son or daughter into a job in Data Analysis job without the skills, but with the connections. I think we have enough of that in this country..

To work as a Data Analyst, you really should have some qualifications relevant to the field, as well as the practical programming skills - in SAS, R, SSRS, Python etc. If you don't have any idea about Statistics, the latter may as well be fancy names for snakes or something. They go together.

fergalr · 25-07-2014 05:19AM

metrosity wrote: »

That's one example but the easiest one to pick. When cancer rates are steadily increasing anyway (be more specific btw, you kind of have to be in this area - it's lung cancer for example you might want to correlate it with) We're not saying 'correlation' is useless as a tool. It's just not definitive.

No. You made a general statement:

metrosity wrote: »

Only in "designed experiments", are you professionally able to assert causation.

I gave a counter-example. And not a trivial or merely pedantic one, either.

Your general statement is now proven wrong.

I'm glad you think the counter-example I chose was the 'easiest one to pick'. That means I chose an accessible counter-example. Which doesn't change the fact your point has been proved wrong.

metrosity wrote: »

(be more specific btw, you kind of have to be in this area - it's lung cancer for example you might want to correlate it with)

Yawn :-)

metrosity wrote: »

We're not saying 'correlation' is useless as a tool. It's just not definitive.

No; you said statisticians couldn't assert causation without defined experiments; but clearly they do in the case of, e.g. smoking causing cancer.

We eventually accept enough correlation evidence, when we see it in different places and after a lot of checks, as becoming definitive.

metrosity wrote: »

No most (I would pray all) statisticians behave like statisticians.

meaningless tautology.

metrosity wrote: »

They take the correlation as circumstantial evidence if you will. Grounds for further studies.

Sure, initially, yeah - and as I said, this is fraught stuff. But, eventually, with enough correlations, again and again, and after removing other competing explanations, we begin to accept it as evidence for causation. With enough such evidence we tend to accept it.

metrosity wrote: »

Yes it is. Please read some of the countless texts on the matter if you please.

What texts? Have you read them?

All I know is several statistics books I've read have mentioned Fisher (famous statistician) opposing the 'smoking causes cancer' hypothesis, and claiming he was in error. (example paper: http://www.epidemiology.ch/history/PDF%20bg/Stolley%20PD%201991%20when%20genius%20errs%20-%20RA%20fisher%20and%20the%20lung%20cancer.pdf)

Its my understanding that most statisticians would accept there's enough evidence to believe a causal link between smoking and cancer, even without controlled studies.

metrosity wrote: »

You have to be very careful with observational experiments.

I said that in all of my posts here.

metrosity wrote: »

You just cherry picked smoking and cancer, but if we apply that logic to everything (setting a precedent) then you end up with a complete mess.

That quote there is an example of what is known as the 'slippery slope fallacy'.
It is an error in reasoning.

I gave a counter example, which was logically sound.

If we applied the logic of the counter-example to everything, including places it wasn't suitable, we would end up in a complete mess, yes. However, there is no suggestion we are going to do that. As such the logic of the counter-example remains sound.

metrosity wrote: »

More often than not, there appears to be causation, but when you look deeper there are lurking variables.

Agreed.

metrosity wrote: »

It's just not the done thing.

??
Not 'the done thing'? You make it sound like its impolite! Who cares?
There's areas of scientific analysis where the only data you have is observational.
That doesn't mean you just give up. You use the tools you have, you be careful, and you make progress towards the truth. It might be slower or harder than if you had controlled trials, but that doesn't mean you don't do it.

metrosity wrote: »

Sure, if it's something non critical like whether to gamble on a stock or not, then you can use a more liberal statistical model for your data, but in healthcare and other more serious areas, there is a tendency to be more conservative, and gather more data.

Which is why, in the abscence of controlled trials, healthcare professionals conservatively say they have no opinion on whether people should smoke or not.

Wait a minute! No they don't! Rather, they conclude from plentiful observational evidence that its so likely smoking causes cancer (effect sizes, correlations, natural experiments etc) that they actively tell people not to smoke.

metrosity wrote: »

Are there unethical studies where they get monkeys to smoke in a designed experiment? I'm not sure. You see, if someones is a smoker, what else are they likely to be?? Have a guess? Then suddenly tour causation is torn to shreds.

That's the kind of thing fisher used to say. That perhaps smoking acted to relieve the symptoms of lung cancer, which is why so many smokers were observed to have lung cancer.

But, actually, that doesn't tear the causation to shreds at all, because there's enough observational evidence to overcome those things. Its hard to be completely 100% certain, but its still overwhelming.

metrosity wrote: »

I'm not sure I should even justify that with an answer but Statistics is not always concerned with brute verifiable facts. It's a science of approximation. There is always room for error.

Ok, if you need uncertainty, then do you think playing russian roulette is bad for your health?

Arguments that work in the presence of uncertainty should also work if you replace the uncertainty with certainty. If they don't, chances are something is flawed in the argument (and this is a good check to do.)

metrosity wrote: »

That's why we have things like confidence intervals. We would be 100% confident that the guy died of having his head chopped off.

It's called 'Statistics' Don't blame me. I wasn't claiming to be the Father of Statistics, just a student.

That's something for you to write a paper on then, and not try to change overnight. These methods weren't dreamt up overnight and until we have something better, we use what we have. Place some trust in the people that went before you and show some respect for them I say.

I'm not saying anything unorthodox here; I think you are just misunderstanding some things.

I disagree with your general attitude that we should 'place trust' and 'show respect' etc. Obviously, we stand on the shoulders of those who came before, but we don't give anyone a free pass.

You seem pretty keen on respecting our forebearers, our polite betters, or whatever you want to call them. I thus refer you to the motto of the most learned and established and posh royal society:

http://en.wikipedia.org/wiki/Nullius_in_verba

metrosity wrote: »

There are some nuances to the probability side of things I find a bit puzzling, but only at very small values. If I studied further, it would probably make sense, but I trust it.

We have a name for that too. It's called empirical analysis. If I flick a coin 10 times, the data might look skewed. I might think heads is more likely. If I flick it a million times, it will be more accurate. You want as big a sample set as you can get, whatever methods you use.

That looks pretty interesting fergalr, but I do think you're more interested in the theory than the practice, though I wouldn't bet on it, just speculating. You might be interested in researching the area.

I'm on the quite applied side of things; machine learning, putting bayesian stuff into practice, etc.

metrosity wrote: »

If someone working in data analysis comes across a situation where it's difficult to distinguish between the cause and the effect..well it's rare.

No, its common in my experience.

metrosity wrote: »

Quantum mechanics throws a spanner into the works, but I don'r know many people doing statistical analysis of any kind in that area. It would be a rare example. It's a more esoteric topic which goes far beyond the statistics of mere mortals.

I studied CS in college and stats and probability was my best subject.

Unless your CS course was a lot more stats oriented than mine, you'll have learned almost no stats during it.
You've heard of 'knowing enough to be dangerous'? Typically CS grads don't even know that much.
I reckon stats is one of the most underestimated areas of knowledge - takes years to get a grip, and most non-statisticians who have had a couple of courses (even grad students) have a few 'cookbook' style methods they use, but understand nothing.

metrosity wrote: »

I'm studying it again now, only this time more applied, work oriented. There's lots I didn't know and still don't know. I want to know more, and I think if there's one area where you really should have a solid grounding of Statistics, it's data analysis and machine learning etc.

Agreed.

metrosity wrote: »

Indeed no self respecting postgrad in those areas would let you in without refreshing your statistics.

Not agreed! No way, like.

metrosity wrote: »

I don't agree that you need a PhD in it to get the top jobs as a lot of the top jobs I see in America look for that, but you should have studies it in a degree at least, preferable recently and preferable practical oriented, which means you might want to do refresher course in it at the ver most minuscule least.

__My only real concern on a personal level is that it good be another addition to the "jobs for the boys" culture__ - where well connected people from rich families get their son or daughter into a job in Data Analysis job without the skills, but with the connections. I think we have enough of that in this country..

Of all the sectors where there'll be 'jobs for the boys' or connected rich families getting their children jobs - well, I wouldn't be too worried that Data Analysis will be particularly vulnerable to that.

That said, I've seen some pretty funny presentations from some of the more management consulting types in this area :-)

metrosity wrote: »

To work as a Data Analyst, you really should have some qualifications relevant to the field, as well as the practical programming skills - in SAS, R, SSRS, Python etc. If you don't have any idea about Statistics, the latter may as well be fancy names for snakes or something. They go together.

Not sure if you are referring to me, but I've a lot of practical skills; pretty fluent python programmer, to start with.

If I can give you a suggestion, some books I'd recommend:
Comparative statistical inference by Barnett
Doing Bayesian Data Analysis by Kruschke

You can spend a long time trying to learn stats, if you have a few good books its much easier; most of the books I've come across are not good.

A Primal Nut · 25-07-2014 05:41AM

It's interesting in that in the linked article in the OP the two things that are correlated together are not connected. But even where it would seem to be connected, it doesn't mean there is a connection.

For example, imagine it becomes commonly accepted; without evidence; that a certain food makes you a lot healthier, but it is not particularly tasty or cheap. The only people who take it are going to be those who care about their health; whereas people who don't take it are a mix of people who care about their health and those who don't. Twenty years later studies are done which show people who take this food regularly are healthier or less likely to get cancer than those who don't. But that could be because those the kind of people who take something thought to be healthy are the kind of people who live healthier lifestyles all around; it's not necessarily because of this one food. I see this all the time with "xxx causes cancer" without taking into account other factors.

Or for example, rich people can afford more expensive types of wine; but they can also afford better healthcare. So a study that showed a link between expensive wine and being healthy would be a fallacy.

Another example are private schools which consistently show higher success than public schools; but how much of that is down to the fact that students who attend private schools and their families clearly put a high emphasis on education; whereas public schools would have a mixture of all different kinds of emphasis on education; high, low, average. This is why the statistics should reflect improvements individual students make when going from a public school to a private school; rather than a straightforward comparison between people that is often made.

metrosity · 25-07-2014 03:13PM

fergalr wrote: »

No. You made a general statement:

I gave a counter-example. And not a trivial or merely pedantic one, either.

Your general statement is now proven wrong.

Not sure, I'll be in a rush to read any more of your 'proofs'.

fergalr wrote: »

I'm glad you think the counter-example I chose was the 'easiest one to pick'. That means I chose an accessible counter-example. Which doesn't change the fact your point has been proved wrong.

Yawn :-)

See above.

fergalr wrote: »

No; you said statisticians couldn't assert causation without defined experiments; but clearly they do in the case of, e.g. smoking causing cancer.

We eventually accept enough correlation evidence, when we see it in different places and after a lot of checks, as becoming definitive.

..Yes^2?

fergalr wrote: »

meaningless tautology.

Sure, initially, yeah - and as I said, this is fraught stuff. But, eventually, with enough correlations, again and again, and after removing other competing explanations, we begin to accept it as evidence for causation. With enough such evidence we tend to accept it.

In Ireland and other countries , there's a bit more fear mongering about the affects of smoking imo.

fergalr wrote: »

What texts? Have you read them?

All I know is several statistics books I've read have mentioned Fisher (famous statistician) opposing the 'smoking causes cancer' hypothesis, and claiming he was in error. (example paper:

Where did I ever say I didn't support Fisher.

fergalr wrote: »

Its my understanding that most statisticians would accept there's enough evidence to believe a causal link between smoking and cancer, even without controlled studies.

They can believe what they want. I take a lot of statistical "studies" with a grain of salt. I'm really not a fan of inferring causation in observational experiments.

fergalr wrote: »

I said that in all of my posts here.

That quote there is an example of what is known as the 'slippery slope fallacy'.
It is an error in reasoning.

I gave a counter example, which was logically sound.

If we applied the logic of the counter-example to everything, including places it wasn't suitable, we would end up in a complete mess, yes. However, there is no suggestion we are going to do that. As such the logic of the counter-example remains sound.

You might enjoy this: en(DOT)wikipedia(DOT)org/wiki/Intuitionistic_logic
I'm a big fan of intuitionistic logic.

fergalr wrote: »

Agreed.

??
Not 'the done thing'? You make it sound like its impolite! Who cares?
There's areas of scientific analysis where the only data you have is observational.
That doesn't mean you just give up. You use the tools you have, you be careful, and you make progress towards the truth. It might be slower or harder than if you had controlled trials, but that doesn't mean you don't do it.

I was merely trying to lighten up a bit.

fergalr wrote: »

Which is why, in the abscence of controlled trials, healthcare professionals conservatively say they have no opinion on whether people should smoke or not.

Wait a minute! No they don't! Rather, they conclude from plentiful observational evidence that its so likely smoking causes cancer (effect sizes, correlations, natural experiments etc) that they actively tell people not to smoke.

Well, I just don't think they should. Too much water can kill a man stone dead.

fergalr wrote: »

That's the kind of thing fisher used to say. That perhaps smoking acted to relieve the symptoms of lung cancer, which is why so many smokers were observed to have lung cancer.

But, actually, that doesn't tear the causation to shreds at all, because there's enough observational evidence to overcome those things. Its hard to be completely 100% certain, but its still overwhelming.

I don't share your optimism in that regard.

fergalr wrote: »

Ok, if you need uncertainty, then do you think playing russian roulette is bad for your health?

Arguments that work in the presence of uncertainty should also work if you replace the uncertainty with certainty. If they don't, chances are something is flawed in the argument (and this is a good check to do.)

Not if the uncertainty is overlapping the explanatory variable you are using in your model. It's not possible to know whether your response variable will be accurate if there is unexplained overlap like that. So you get more data and now the uncertain part is certain, instead of x, you have a number.

That to me sounds like the very process of refining a designed experiment.. - altering the values of the explanatory variables to get a better fit, a more reliable solid explanation that you can clearly attribute to the response variable without worry of lurking variables from overlapping uncertainty like that..

fergalr wrote: »

I'm not saying anything unorthodox here; I think you are just misunderstanding some things.

I disagree with your general attitude that we should 'place trust' and 'show respect' etc. Obviously, we stand on the shoulders of those who came before, but we don't give anyone a free pass.

You seem pretty keen on respecting our forebearers, our polite betters, or whatever you want to call them. I thus refer you to the motto of the most learned and established and posh royal society:
Nullius_in_verba

Kind of shot yourself in the foot there. Nallius in verbal "is an expression of the determination of Fellows to withstand the domination of authority and to verify all statements by an appeal to facts determined by experiment."

The keyword here is experiment. At present, in the field of statistics, an observational experiment cannot or at least should not be scientifically used to infer causation. Too many lurkers.. Seems reasonable to me.

fergalr wrote: »

I'm on the quite applied side of things; machine learning, putting bayesian stuff into practice, etc.

Good.

fergalr wrote: »

No, its common in my experience.

Unless your CS course was a lot more stats oriented than mine, you'll have learned almost no stats during it.
You've heard of 'knowing enough to be dangerous'? Typically CS grads don't even know that much.
I reckon stats is one of the most underestimated areas of knowledge - takes years to get a grip, and most non-statisticians who have had a couple of courses (even grad students) have a few 'cookbook' style methods they use, but understand nothing.

Agreed.

Not agreed! No way, like.

Of all the sectors where there'll be 'jobs for the boys' or connected rich families getting their children jobs - well, I wouldn't be too worried that Data Analysis will be particularly vulnerable to that.

That said, I've seen some pretty funny presentations from some of the more management consulting types in this area :-)

Not sure if you are referring to me, but I've a lot of practical skills; pretty fluent python programmer, to start with.

Great.

fergalr wrote: »

If I can give you a suggestion, some books I'd recommend:
Comparative statistical inference by Barnett
Doing Bayesian Data Analysis by Kruschke

Statistics, Informed Decisions Using Data, 4th Edition - Michael Sullivan III - brilliant book!

Intuitionistic Type Theory - Per Martin Lo ̈f - a bit heavy for me, but fascinating

Anything on dependent types, more here - en(DOT)wikipedia(DOT)org/wiki/Dependent_types

I'm not sure you'll be into dependent types though. You might hate them.
I like them. Maybe I could convert you but I doubt it.

Really, a lot of our arguments are related to a fundamental split in logics - those who believe in the law of excluded middle, and those who don't really accept it's full use.

You could say that "Enough people have bad health from smoking and there is enough variation in the samples to say that there are no lurking variables that can explain all of these bad health conditions" but it's still subjective to me - it's excluded middle - all true or all false.

I would lean more towards intuitionistic logic. It's not the case that you can say it's all true OR all false. It's undefined. It's too observational. Collect more data.. correlate what you will and let people decided.

www(DOT)youtube(DOT)com/watch?v=s20ki6Dtjlo

fergalr · 26-07-2014 11:50PM

metrosity wrote: »

In Ireland and other countries , there's a bit more fear mongering about the affects of smoking imo.
Where did I ever say I didn't support Fisher.

If you support Fisher, and if you believe that there isn't sufficient data and sats to make us believe, with a high degree of confidence, that smoking causes cancer, then I think you'd be seen as pretty 'fringe' by most practicising statisticians.

metrosity wrote: »

fergalr wrote:

Rather, they conclude from plentiful observational evidence that its so likely smoking causes cancer (effect sizes, correlations, natural experiments etc) that they actively tell people not to smoke.

Well, I just don't think they should. Too much water can kill a man stone dead.

Again, your beliefs are pretty fringe there; probably wrong.

metrosity wrote: »

Really, a lot of our arguments are related to a fundamental split in logics - those who believe in the law of excluded middle, and those who don't really accept it's full use.

I disagree. We could cast everything in terms of probability, rather than two valued logic (and honestly, thats how I think about these things, the two-value logic stuff is just for communication) and we'd still have the same disagreement.

metrosity wrote: »

You could say that "Enough people have bad health from smoking and there is enough variation in the samples to say that there are no lurking variables that can explain all of these bad health conditions" but it's still subjective to me - it's excluded middle - all true or all false.

There's a correct degree of belief in the proposition that 'smoking causes cancer'. I think that it should be high. I think its ok to reach such a high belief from observation studies (once you have lots, attempt to control, predict, check for confounding etc).
You seem to disagree and think the correct degree of belief should be low or close to some prior.

One of us is wrong, and I think its you.

And I don't see how anything about 'excluded middles' changes that.

There's either a strong causal link between smoking and cancer or there is not.

Most people believe there is; if you believe there isn't, I don't see a middle ground.

metrosity · 27-07-2014 01:15PM

fergalr wrote: »

If you support Fisher, and if you believe that there isn't sufficient data and sats to make us believe, with a high degree of confidence, that smoking causes cancer, then I think you'd be seen as pretty 'fringe' by most practicising statisticians.

Again, your beliefs are pretty fringe there; probably wrong.

I disagree. We could cast everything in terms of probability, rather than two valued logic (and honestly, thats how I think about these things, the two-value logic stuff is just for communication) and we'd still have the same disagreement.

There's a correct degree of belief in the proposition that 'smoking causes cancer'. I think that it should be high. I think its ok to reach such a high belief from observation studies (once you have lots, attempt to control, predict, check for confounding etc).
You seem to disagree and think the correct degree of belief should be low or close to some prior.

One of us is wrong, and I think its you.

And I don't see how anything about 'excluded middles' changes that.

There's either a strong causal link between smoking and cancer or there is not.

Most people believe there is; if you believe there isn't, I don't see a middle ground.

Now you just seem to be confirming what I suspected of you - you're desperate.
You used language like 'you're wrong', 'no', 'I proved you're wrong' and ..
.. I called you on it.

You conveniently only responded to the parts of my post that suited you. You were selective. Maybe you're interested in selective sampling for statistical models too, let me refresh your memory on that if you have one - it's not done. Any sample collected that is selective is a joke.

You, yet again, didn't apply much care in reading my above post and I'm not really interested in ""proving"" (as you say) who's right here, especially to you..or anyone who would support your erroneous viewpoint.

In your final act of desperation, you're even now trying to imply that I have the ""fringe"" viewpoint, even though you're clearly the person that thinks causation should be applied to observational experiments. You might want to check the logic of what I actually said on Fisher. I tried to bite my tongue before and leave you the chance to behave like an adult, which you declined.

IMO, you don't really know what you're talking about, so have a good day!

fergalr · 27-07-2014 10:13PM

metrosity wrote: »

Now you just seem to be confirming what I suspected of you - you're desperate.
You used language like 'you're wrong', 'no', 'I proved you're wrong' and ..
.. I called you on it.

You conveniently only responded to the parts of my post that suited you. You were selective. Maybe you're interested in selective sampling for statistical models too, let me refresh your memory on that if you have one - it's not done. Any sample collected that is selective is a joke.

- Selectively responding to parts of an argument isn't the same as selectively sampling. For example, if an argument depended on 3 premises, you could conclude it was flawed by showing just 1 premise was false.

- All real world samples have selection bias to some degree. We should try and minimise it, bu we can rarely eliminate it - again, your 'its not done' point doesn't make sense.

metrosity wrote: »

You, yet again, didn't apply much care in reading my above post and I'm not really interested in ""proving"" (as you say) who's right here, especially to you..or anyone who would support your erroneous viewpoint.

In your final act of desperation, you're even now trying to imply that I have the ""fringe"" viewpoint, even though you're clearly the person that thinks causation should be applied to observational experiments. You might want to check the logic of what I actually said on Fisher. I tried to bite my tongue before and leave you the chance to behave like an adult, which you declined.

I'm now not sure whether you are taking Fisher's perspective, or not.

metrosity wrote: »

IMO, you don't really know what you're talking about

I actually do. I'm actually trying to help here, providing shortcuts. You don't seem to see it that way, which is your prerogative.

metrosity wrote: »

so have a good day!

On that we can agree at least - I genuinely wish you well as you do more stats.

metrosity · 28-07-2014 12:14AM

fergalr wrote: »

- Selectively responding to parts of an argument isn't the same as selectively sampling. For example, if an argument depended on 3 premises, you could conclude it was flawed by showing just 1 premise was false.

- All real world samples have selection bias to some degree. We should try and minimise it, bu we can rarely eliminate it - again, your 'its not done' point doesn't make sense.

I'm now not sure whether you are taking Fisher's perspective, or not.

I actually do. I'm actually trying to help here, providing shortcuts. You don't seem to see it that way, which is your prerogative.

On that we can agree at least - I genuinely wish you well as you do more stats.

That's more positive. I was dreading reading your post, but as you say let's agree on that at least.

jefferson73 · 28-07-2014 06:33PM

fergalr wrote: »

Unless your CS course was a lot more stats oriented than mine, you'll have learned almost no stats during it.
You've heard of 'knowing enough to be dangerous'? Typically CS grads don't even know that much.
I reckon stats is one of the most underestimated areas of knowledge - takes years to get a grip, and most non-statisticians who have had a couple of courses (even grad students) have a few 'cookbook' style methods they use, but understand nothing.

Just wow!

Anyway my contribution or rather to say an article i liked on the topic was this one
Taken from this great site

fergalr · 01-08-2014 06:54AM

jefferson73 wrote: »

Just wow!

Why 'wow'? Do you disagree? That's my honest assessment.

This sort of "oh, its an important topic, so we'll give them one course in it and hope something sticks" is quite common in college courses.

In my CS undergrad we also had a course on the quantum physics of semiconductors. Went into some pretty detailed quantum physics.
You think the students in the course actually understood quantum physics at any useful level? Not a bit of it. We could give you some memorized names and memorized formulae, but there was pretty much no understanding - and if someone set out to do some practical quantum physics design work on the strength of that one course, well, the results wouldn't be great.

I'm sure its the same with people in business or economics courses studying some of the maths subjects, people in the sociology courses studying the economics subjects, and so on. I'm not saying students shouldn't have the odd course that is complimentary to the main thrust of their undergrad - just that the reality is that often students who take these courses understand none of it.

Sure, every year in Ireland, a large % of the leaving cert class learns off the key words or rote methods to get grades in particular exams (maths and physics are probably some good examples) but understand next to none of what is actually going on. Very few would dispute this - its practically institutionalized.

So its no surprise this happens in college statistics courses too - statistics being a particularly subtle and hard subject, in my mind.

Correlation Doesn't Imply Causation

Comments