# Regression standard errors scale by all residuals

Thursday July 6, 2023

Even in cases where it seems like it shouldn't matter whether you run individual regressions or a combined regression, it does matter to the standard errors, because the standard errors all scale by a factor that includes all the residuals for the model.

Here's a simple data set in R:

```
set.seed(42)
x = rnorm(10)
df_a = data.frame(group='a', x=x, y=2*x + 5 + rnorm(10))
x = rnorm(10)
df_b = data.frame(group='b', x=x, y=3*x + 7 + rnorm(10))
df = rbind(df_a, df_b)
df
## group x y
## 1 a 1.37095845 9.0467865
## 2 a -0.56469817 6.1572490
## 3 a 0.36312841 4.3373961
## 4 a 0.63286260 5.9869364
## 5 a 0.40426832 5.6752153
## 6 a -0.10612452 5.4237014
## 7 a 1.51152200 7.7387911
## 8 a -0.09465904 2.1542265
## 9 a 2.01842371 6.5963805
## 10 a -0.06271410 6.1946851
## 11 b -0.30663859 6.5355343
## 12 b -1.78130843 2.3609120
## 13 b -0.17191736 7.5193515
## 14 b 1.21467470 10.0350977
## 15 b 1.89519346 13.1905355
## 16 b -0.43046913 3.9915839
## 17 b -0.25726938 5.4437328
## 18 b -1.76316309 0.8596032
## 19 b 0.46009735 5.9660844
## 20 b -0.63999488 5.1161380
```

If we're trying to recover the slopes and intercepts, we can run
separate regressions for group *a* and group *b*, or we can run a
combined regression for both at the same time.

```
summary(lm(y ~ x, data=df_a))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.2370 0.6162 8.500 2.82e-05 ***
## x 1.2682 0.6397 1.983 0.0827 .
summary(lm(y ~ x, data=df_b))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.6258 0.3783 17.515 1.15e-07 ***
## x 2.9421 0.3405 8.641 2.50e-05 ***
summary(lm(y ~ group:x + group - 1, data=df))
## Estimate Std. Error t value Pr(>|t|)
## groupa 5.2370 0.5411 9.679 4.32e-08 ***
## groupb 6.6258 0.4511 14.689 1.05e-10 ***
## groupa:x 1.2682 0.5618 2.258 0.0383 *
## groupb:x 2.9421 0.4060 7.247 1.95e-06 ***
```

The estimated coefficients are the same between the two methods
(good!) but the standard errors are different. Why? We know that the
data for group *a* doesn't really interact with the data for group *b*
at all, so it shouldn't affect the standard errors, right?

Indeed, adding extra independent data shouldn't affect the standard errors, but regression doesn't know about the independence. (Run the independent regressions, not the combined one, in cases like this if you want correct standard errors.)

With \( X \) the design matrix and \( \sigma^2 \) the sum of squared residuals divided by degrees of freedom, the standard errors are \( \sqrt{ \sigma^2 ( X^TX )^{-1} } \). (Thanks to ocram.)

```
model = lm(y ~ group:x + group - 1, data=df)
matrix = model.matrix(y ~ group:x + group - 1, data=df)
sqrt(diag((sum(model$residuals^2) / 16) * solve(t(matrix) %*% matrix)))
## groupa groupb groupa:x groupb:x
## 0.5410892 0.4510810 0.5617751 0.4059782
```

In the example here, the group *a* coefficient on *x* isn't
significant (at 0.05) with the individual regression, but it appears
to be when the combined regression is done. And this is a modest
example, with the same scale normal noise in both data sets; data
won't be so homoscedastic in general so results could be even worse
for combined regressions.