Simple Front Door Regression

Saturday May 1, 2021

If you have a mediator, you can estimate effects despite confounding, which is neat. The idea is like flip of instrumental variables; here the constructed non-confounded variable is what’s not explained by the confounded one. A two-stage regression illustrates the idea.

We seek to estimate the impact of x on y, where they’re both influenced by u.

front door scenario

Unfortunately, u is unobserved, so we can’t control for it. But lo, there is z, a perfect mediator between x and y. Let’s simulate data where all the true coefficients are one.

u = rnorm(100)
x = u + rnorm(100)
z = x + rnorm(100)
y = u + z + rnorm(100)

Regressing naively, we get an incorrect estimate for the effect of x on y.

summary(lm(y ~ x))
##              Estimate Std. Error t value Pr(>|t|)
##  x            1.54648    0.09848  15.704   <2e-16 ***

Luckily, z is in there, and it isn’t confounded with u.

But how do we use it? Throwing it in the regression doesn’t help.

summary(lm(y ~ x + z))
##              Estimate Std. Error t value Pr(>|t|)
##  x            0.46175    0.15517   2.976  0.00369 **
##  z            1.02436    0.12740   8.041  2.2e-12 ***

A fancier model-fitting approach is needed. Here’s an illustrative two-stage regression, ignoring standard error concerns.

Stage one: Do a regression using x to predict z. Note the coefficient on x. Then get the residuals for z from that model, which I’ll call z_not_from_x. This uses the non-confounding of z with u to make a “version of” (maybe a component of) z that is independent of u (because it’s independent of x): the variation of z not due to x (or u).

Stage two: Do a regression using z_not_from_x to predict y.

Multiplying the coefficients from the two models gives the effect of x on y.

summary(lm(z ~ x))
##              Estimate Std. Error t value Pr(>|t|)
##  x           1.058940   0.060797  17.418   <2e-16 ***
z_not_from_x = residuals(lm(z ~ x))
summary(lm(y ~ z_not_from_x))
##               Estimate Std. Error t value Pr(>|t|)
##  z_not_from_x   1.0244     0.2889   3.546 0.000601 ***
1.058940 * 1.0244  # Estimate for x on y
##  1.084778

We’ve recovered a fair estimate of the true parameter for x, despite not using the unobserved confound u!

Here’s the same thing as above but with the lavaan package for SEM, following Thoemmes.

model = "z ~ x_on_z * x
         y ~ z_on_y * z
         x ~~ y  # Allow x and y to still covary
         x_on_y := x_on_z * z_on_y"
summary(sem(model, data.frame(x=x, z=z, y=y)))
##                   Estimate  Std.Err  z-value  P(>|z|)
##    x_on_z            1.059    0.060   17.594    0.000
##    z_on_y            1.024    0.125    8.164    0.000
##    x_on_y            1.085    0.146    7.406    0.000

So there’s standard errors for you!

Other cases

This example goes well with a collection of four simpler ("back door") regression situations, What should be in your regression? It goes especially well with its cousin, A simple Instrumental Variable.

More complicated cases

The dagitty tools seems to be a great way to analyze a given situation (expressed as a DAG) and figure out what you can do with it.

Maybe regression discontinuity is another kind of case?