What should be in your regression?

Saturday September 12, 2020

If you're fitting a model in the hopes of better understanding the effect of x on y, then including another predictor variable z could be necessary, or could be harmful. Deciding whether to control for z depends on identifying how the variables are causally related.

Simple direct causation
Mediated causation: don't control for z
Common cause (confounding): control for z
Common effect (collider): don't control for z
More complicated cases

These examples vary the data generating process (which isn't possible in practice) in order to illustrate what happens in each case when you do or don't control for z.

The R code uses random normal data, additive relationships, and linear regression. With real data and processes, other modeling methods may be more appropriate.

Simple direct causation

x -> y

Say the reality is that x influences y (not the reverse) and no other variables are involved.

x = rnorm(100)
y = x + rnorm(100)
summary(lm(y ~ x))
##              Estimate Std. Error t value Pr(>|t|)
##  x            1.07747    0.11931   9.031 1.53e-14 ***

The estimate of 1.077 is close to the true value of 1. The very low p-value (with three stars) indicates confidence that there is a non-random relationship between x and y.

Mediated causation: don't control for z

x -> z -> y

Here, the reality is that x influences z, which then influences y.

In this case, x really does influence y, so you're right to look for the relationship. And it might be tempting to control for z, perhaps because of a desire to “isolate” the effect of x.

If you include z in your regression, the desired estimate of the effect of x will disappear.

x = rnorm(100)
z = x + rnorm(100)
y = z + rnorm(100)
summary(lm(y ~ x + z))
##               Estimate Std. Error t value Pr(>|t|)
##  x           -0.146874   0.135695  -1.082    0.282
##  z            1.127586   0.087819  12.840   <2e-16 ***

Including z makes it appear that x has no significant influence on y, which is incorrect.

summary(lm(y ~ x))
##              Estimate Std. Error t value Pr(>|t|)
##  x            1.11140    0.15343   7.244    1e-10 ***

Leaving z out gives a good estimate of the effect of x.

Common cause (confounding): control for z

x <- z -> y

In this case, the reality is that z influences both x and y, which are otherwise unrelated.

This is an extreme example, in the sense that you'd be totally wrong to look for an effect of x on y in this case. The case where there's also a direct connection between x and y is similar, but for simplicity this example doesn't have one.

If you don't include z in your regression, the relationship between x and y could be misleading.

z = rnorm(100)
x = z + rnorm(100)
y = z + rnorm(100)
summary(lm(y ~ x))
##              Estimate Std. Error t value Pr(>|t|)
##  x            0.56257    0.10365   5.428 4.14e-07 ***

There is a relationship between x and y, but it's because of z, not because of x. The regression result ignoring z could easily be misinterpreted.

summary(lm(y ~ x + z))
##             Estimate Std. Error t value Pr(>|t|)
## x            0.03608    0.10590   0.341    0.734
## z            1.05814    0.13556   7.806 6.93e-12 ***

With z in the model, the result correctly indicates that z is the influencing factor.

Common effect (collider): don't control for z

x -> z <- y

Here, the reality is that x and y are completely unrelated, but they both influence z.

As with common cause, it's more interesting when there's also a direct connection between x and y, but the behavior is similar and this example keeps things simpler.

Somewhat surprisingly, if you include z in your regression, you'll see evidence that both z and x influence y.

x = rnorm(100)
y = rnorm(100)
z = x + y + rnorm(100)
summary(lm(y ~ x + z))
##              Estimate Std. Error t value Pr(>|t|)
##  x           -0.46951    0.08831  -5.317 6.76e-07 ***
##  z            0.44439    0.04971   8.939 2.62e-14 ***

This is a case of “explaining away.”

summary(lm(y ~ x))
##              Estimate Std. Error t value Pr(>|t|)
##  x           -0.11308    0.10586  -1.068    0.288

Without z in the regression, the result correctly indicates no relationship between x and y.

More complicated cases

With a more complex causal graph, it may not be obvious what you need to control for, or it may be that conventional methods of control aren't sufficient. Methods like Judea Pearl's do-calculus might be helpful.

Thanks to Erica Blom, Ben Klemens, and Jonathan Robinson for providing helpful feedback.