Correlation can be confounder determination

Friday November 27, 2020

In a situation of pure confounding (xzy) with equivariant noise, the correlation between x and y is the coefficient of determination between z and either x or y: \( r = R^2 \).

I got to thinking about this because of The Book of Why:

"[Burks] computed the direct effect of parental IQ on children's IQ and found that only 35 percent, or about one-third, of IQ variation is inherited. In other words, parents with an IQ fifteen points above average would typically have children five points above average." (page 306)

That works out if we have a simple data-generating model like this in R:

z <- rnorm(1000000, sd=1)            # genetics,            variance=1
x <- z + rnorm(1000000, sd=sqrt(2))  # IQ of parents,  adds variance=2 noise
y <- z + rnorm(1000000, sd=sqrt(2))  # IQ of children, adds variance=2 noise
mean(y[0.9 < x & x < 1.1])
## [1] 0.334556

One third of x and y variance is due to z, and when x values are around 1, average corresponding y value is around 1/3, so that checks out: regression to the mean.

The genetic contribution z isn't really observable, though. All we see are x and y. So how do we get the contribution of z?

Both x and y have variance 3. If we had a twin study in this world, we could recover that variance is 2 when genetics is fixed, and that would let us find that genetics explains 1/3 of variance. But we don't generally have a twin study.

The common thing to do would be a regression.

summary(lm(y ~ x))  # (Pulling out one line of output...)
## Multiple R-squared:  0.1114,     Adjusted R-squared:  0.1114

This is usually interpreted as saying that x explains 11% of the variation of y. That's true enough, but it isn't how much is explained by z.

The correlation between x and y, however, is how much variance is explained by z.

cor(x, y)
## [1] 0.3338363

More generally (for non-equivariant noise) the regression coefficient gives the fraction of variance explained by z. (See also the earlier estimate illustrating regression to the mean.)

So this is a way of estimating the effect of an unobserved confounder, which seems to work in this very simple setting. Is this like what Burks did? Probably her models were not so simple. Is this a special case of some more general approach?

I don't think I've seen this kind of "method" before; I'd love to see more elaboration of this kind of idea for estimating impacts of unobserved variables.


The data-generating model above (and surrounding discussion) requires suspension of disbelief; it represents some sort of asexual reproduction, has additional influences on IQ that are entirely uncorrelated between parents and children, and of course IQ is fraught as a topic to begin with. Hopefully the statistical considerations are still visible.