This page is a reference that I keep for myself, to refer to when fitting linear models.

Suppose that our data generating process is:

\[\begin{array}{lcl} \mathbf{y} &=& \mathbf{X}\beta + \epsilon \\ \epsilon &{\sim}& \text{N}\Big(0 \space, \hspace{3mm} \sigma^2\mathbf{I}\Big) \end{array}\]

\(\mathbf{y}\) is an \(n\times 1\) vector.

\(\mathbf{X}\) is an \(n\times p\) matrix.

\(\epsilon\) is an \(n\times 1\) vector.

\(\epsilon\) is a random vector, and thus \(\mathbf{y}\) is a random vector too (since its value is dependent on \(\epsilon\)).

\(\mathbf{X}\) and \(\beta\) are fixed (i.e. \(\mathbf{X}\) and \(\beta\) are the same in all datasets generated by this process).

This implies that random vector \(\mathbf{y}\) has a multivariate normal distribution \(\mathbf{y}\sim\text{N}\Big(\mathbf{X}\beta \space, \hspace{3mm} \sigma^2\mathbf{I}\Big)\)

For a single given dataset \(\{\mathbf{y},\mathbf{X}\}\), where outcome vector \(\mathbf{y}\) and features/predictors \(\mathbf{X}\) are known, we would want to estimate parameter/coefficient vector \(\beta\).

The estimator \(\hat \beta\) minimising the sum of squared errors \(\epsilon'\epsilon=\sum_{i=1}^n\Big(\mathbf{y}-\underset{\hat{ \mathbf{y}}}{\underbrace{X \hat\beta}}\Big)^2\) for a given sample is \(\hat \beta = (X'X)^{-1}X'\mathbf{y}\). This is called the Ordinary Least Squares (OLS) estimator.

The distribution of this estimator \(\hat\beta\) across samples, under this data-generating process, is multivariate normal:

\[\hat\beta \quad\sim\quad \text{N}_p\Big(\beta \space, \hspace{3mm} \sigma^2 (X'X)^{-1}\Big)\]

If \(\sigma^2\) were known, then this information could be used to generate confidence intervals for \(\hat\beta\).

A confidence interval for a single coefficient \(\beta_j\) is

\[\hat\beta_j \pm z(\displaystyle\frac{\alpha}{2})\cdot \sqrt{\Big[\sigma^2(X'X)^{-1}\Big]_{jj}}\]

..where \(z(\displaystyle\frac{\alpha}{2})\) is the \(1-\displaystyle\frac{\alpha}{2}\) quantile of the standard normal distribution.

In practice, since the true value of \(\sigma^2\) is unknown, confidence intervals are generated using an estimate \(\quad s^2=\hat{\sigma^2} \quad\) of \(\sigma^2\), and the confidence interval is then:

\[\hat\beta_j \pm t_{n-p}(\displaystyle\frac{\alpha}{2})\cdot \sqrt{\Big[s^2(X'X)^{-1}\Big]_{jj}}\]

Rather than considering coefficients one at a time (which leads to more false positives/type 1 errors), it is possible to create simultaneous confidence regions for the parameters \(\beta\). See [1] for details.

The estimated errors for a particular dataset are \(\hat \epsilon = \underset{\hat{\mathbf{y}}}{\underbrace{X\hat\beta}} - \mathbf{y}\).

The expected value of \(\hat\epsilon'\hat\epsilon\) across datasets is \(E\Big[\hat\epsilon'\hat\epsilon\Big]=\sigma^2\Big(n-p\Big)\), which gives us an estimator

\[\hat{\sigma^2}=\displaystyle\frac{\hat\epsilon'\hat\epsilon}{n-p}\]

of \(\sigma^2\), since

\[E\Big[\displaystyle\frac{\hat\epsilon'\hat\epsilon}{n-p}\Big]=\displaystyle\frac{1}{n-p}E\Big[\hat\epsilon'\hat\epsilon\Big]=\sigma^2\]

Here is an example of the distribution of \(\hat{\sigma^2}\) across multiple datasets by simulation:

In order to test the hypothesis \(\beta_j=0\) (i.e. whether a particular coefficient \(\beta_j\) is 0), we could use the result

\[\displaystyle\frac{\hat\beta_j}{\text{SE}\Big[\hat\beta_j\Big]} \quad=\quad \displaystyle\frac{\hat\beta_j}{\sqrt{\Big[\sigma^2(X^TX)^{-1}\Big]_{jj}}} \quad\sim\quad \text{N}\Big(0,1\Big)\]

However, not knowing \(\sigma^2\), we can use the result

\[\begin{array}{lcl} t_j &=& \displaystyle\frac{\hat\beta_j}{\hat{\text{SE}}\Big[\hat\beta_j\Big]} \quad=\quad \displaystyle\frac{\hat\beta_j}{\sqrt{\Big[s^2(X^TX)^{-1}\Big]_{jj}}} \\ t_j &\sim& \text{t}_{n-p} \quad \text{ if the null hypothesis } (H_0:\beta_j=0) \text{ is true} \end{array}\]

i.e. \(t_j\) has a \(t\)-distribution with \(n-p\) degress of freedom under the null hypothesis.

Again, this test based on the \(t\)-statistic assumes that \(\sigma^2\) is known. In practice, \(\quad s=\hat{\sigma^2}=\displaystyle\frac{(\hat{\mathbf{y}}-\mathbf{y})'(\hat{\mathbf{y}}-\mathbf{y})}{n-p}\quad\) is used in place of the unknown constant \(\sigma^2\).

Note that \(p\)-values become smaller when sample size increases, making hypothesis tests based on \(p\)-values simply a function of sample size. Confidence intervals for effect sizes are preferred.

When to consider estimators of \(\beta\) other than OLS

From reference [1]:

  1. If errors are correlated or have unequal variance (\(\sigma^2\) is not constant across samples 1,..,n), then Generalized Least Squares should be used to estimate \(\hat \beta\).

  2. If the error distribution is long-tailed, then robust estimators of \(\hat \beta\) should be considered.

  3. When the features/predictors are highly correlated (collinear), then penalised (biased) estimators should be considered (e.g. Ridge regression, Lasso).

Generalized Least Squares (GLS)

Suppose that our data generating process is:

\[\begin{array}{lcl} \mathbf{y} &=& \mathbf{X}\beta + \epsilon \\ \epsilon &{\sim}& \text{N}\Big(0 \space, \hspace{3mm} \sigma^2\Sigma\Big) \end{array}\]

\(\mathbf{y}\) is an \(n\times 1\) vector.

\(\mathbf{X}\) is an \(n\times p\) matrix.

\(\epsilon\) is an \(n\times 1\) vector.

\(\epsilon\) is a random vector, and thus \(\mathbf{y}\) is a random vector too (since its value is dependent on \(\epsilon\)).

\(\mathbf{X}\) and \(\beta\) are fixed (i.e. \(\mathbf{X}\) and \(\beta\) are the same in all datasets generated by this process).

The constant \(\sigma^2\) (absolute scale of the errors) is unknown, but the covariance structure (matrix \(\Sigma\)) of the errors is known [1].

This implies that random vector \(\mathbf{y}\) has a multivariate normal distribution \(\mathbf{y}\sim\text{N}\Big(\mathbf{X}\beta \space, \hspace{3mm} \sigma^2\Sigma\Big)\)

For a single given dataset \(\{\mathbf{y},\mathbf{X}\}\), where outcome vector \(\mathbf{y}\) and features/predictors \(\mathbf{X}\) are known, we would want to estimate parameter/coefficient vector \(\beta\).

The estimator \(\hat \beta\) minimising \((\mathbf{y}-\mathbf{X}\beta)'\Sigma^{-1}(\mathbf{y}-\mathbf{X}\beta)\) for a given sample is:

\[\hat \beta = (X'\Sigma^{-1}X)^{-1}X'\Sigma^{-1}\mathbf{y}\]

\((\mathbf{y}-\mathbf{X}\beta)'\Sigma^{-1}(\mathbf{y}-\mathbf{X}\beta)\) is the squared Mahalanobis distance between vector \(\mathbf{y}\) and vector \(\hat{\mathbf{y}}\).

The variance of this estimator \(\hat\beta\) across samples, under this data-generating process, is \(\sigma^2 (X'\Sigma^{-1} X)^{-1}\):

Of course, matrix \(\Sigma\) will not be known in practice and must be estimated.

Weighted Least Squares (WLS)

If some observations are more important than others, then Weighted Least Squares (WLS) can be used to estimate the coefficients \(\beta\) in the model.

Parameter estimates \(\hat{\beta}\) are now found by choosing the vector \(\hat\beta\) which minimises:

\[\mathcal{L}(\hat\beta)=\displaystyle\sum_{i=1}^n w_i \cdot \Big(y_i - \mathbf{x}_i'\beta\Big)^2\]

..where \(w_i\) is the weight on observation \(i\).

This gives the following estimate of \(\hat\beta\):

\[\hat\beta \quad=\quad \Big(X' W X\Big)^{-1} X' W y\]

where

\[W=\begin{bmatrix} w_1 & 0 & 0 & ... & 0 \\ 0 & w_2& 0 & ... & 0 \\ . & . & . & ... & . \\ . & . & . & ... & . \\ 0 & 0 & 0 & ... & w_n \\ \end{bmatrix}\]

WLS is usually used in the situation where the errors \(\epsilon\) in the model are uncorrelated but heteroskedastic (error variance for each observation is different):

\[\begin{array}{lcl} \mathbf{y} &=& \mathbf{X}\beta + \epsilon \\ \epsilon &{\sim}& \text{N}\Big(0 \space, \hspace{3mm} \Sigma=\begin{bmatrix}\sigma_1^2 &0 & 0 & ...& 0 \\ 0 & \sigma_2^2 & 0 & ... & 0 \\ . & .& . & ... & . \\ . & .& . & ... & . \\ 0 & 0& 0 & ... & \sigma_n^2 \\ \end{bmatrix}\Big) \end{array}\]

In this case, the weight chosen for each observation is \(w_i=\displaystyle\frac{1}{\sigma_i^2}\).

Weighted regression calculated in this way is equivalent to performing standard OLS regression on a dataset in which each sample/observation \(i=1,..,n\) has been duplicated a number of times proportional to it’s weight \(w_i\).

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = augmented_dataset)
## 
## Coefficients:
## (Intercept)           x1           x2           x3  
##      0.3346     -10.0615      -5.6777      -1.1734
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = mydata, weights = mydata$weight)
## 
## Coefficients:
## (Intercept)           x1           x2           x3  
##      0.3346     -10.0615      -5.6777      -1.1734
##           [,1]
## 1    0.3345705
## x1 -10.0615104
## x2  -5.6777383
## x3  -1.1734282