Simple Front Door Regression
Saturday May 1, 2021
If you have a mediator, you can estimate effects despite confounding, which is neat. The idea is like flip of instrumental variables; here the constructed non-confounded variable is what’s not explained by the confounded one. A two-stage regression illustrates the idea.
We seek to estimate the impact of x on y, where they’re both influenced by u.
Unfortunately, u is unobserved, so we can’t control for it. But lo, there is z, a perfect mediator between x and y. Let’s simulate data where all the true coefficients are one.
u = rnorm(100)
x = u + rnorm(100)
z = x + rnorm(100)
y = u + z + rnorm(100)
Regressing naively, we get an incorrect estimate for the effect of x on y.
summary(lm(y ~ x))
## Estimate Std. Error t value Pr(>|t|)
## x 1.54648 0.09848 15.704 <2e-16 ***
Luckily, z is in there, and it isn’t confounded with u.
But how do we use it? Throwing it in the regression doesn’t help.
summary(lm(y ~ x + z))
## Estimate Std. Error t value Pr(>|t|)
## x 0.46175 0.15517 2.976 0.00369 **
## z 1.02436 0.12740 8.041 2.2e-12 ***
A fancier model-fitting approach is needed. Here’s an illustrative two-stage regression, ignoring standard error concerns.
Stage one: Do a regression using x to predict z. Note the coefficient on x. Then get the residuals for z from that model, which I’ll call z_not_from_x. This uses the non-confounding of z with u to make a “version of” (maybe a component of) z that is independent of u (because it’s independent of x): the variation of z not due to x (or u).
Stage two: Do a regression using z_not_from_x to predict y.
Multiplying the coefficients from the two models gives the effect of x on y.
summary(lm(z ~ x))
## Estimate Std. Error t value Pr(>|t|)
## x 1.058940 0.060797 17.418 <2e-16 ***
z_not_from_x = residuals(lm(z ~ x))
summary(lm(y ~ z_not_from_x))
## Estimate Std. Error t value Pr(>|t|)
## z_not_from_x 1.0244 0.2889 3.546 0.000601 ***
1.058940 * 1.0244 # Estimate for x on y
## 1.084778
We’ve recovered a fair estimate of the true parameter for x, despite not using the unobserved confound u!
Here’s the same thing as above but with the lavaan
package for
SEM, following Thoemmes.
model = "z ~ x_on_z * x
y ~ z_on_y * z
x ~~ y # Allow x and y to still covary
x_on_y := x_on_z * z_on_y"
summary(sem(model, data.frame(x=x, z=z, y=y)))
## Estimate Std.Err z-value P(>|z|)
## x_on_z 1.059 0.060 17.594 0.000
## z_on_y 1.024 0.125 8.164 0.000
## x_on_y 1.085 0.146 7.406 0.000
So there’s standard errors for you!
Other cases
This example goes well with a collection of four simpler ("back door") regression situations, What should be in your regression? It goes especially well with its cousin, A simple Instrumental Variable.
More complicated cases
The dagitty tools seems to be a great way to analyze a given situation (expressed as a DAG) and figure out what you can do with it.
Maybe regression discontinuity is another kind of case?