This page is a reference that I keep for myself, to refer to when fitting linear models.

library(tidyverse)
library(mvtnorm)

Suppose that our data generating process is:

\[\begin{array}{lcl} \mathbf{y} &=& \mathbf{X}\beta + \epsilon \\ \epsilon &{\sim}& \text{N}\Big(0 \space, \hspace{3mm} \sigma^2\mathbf{I}\Big) \end{array}\]

\(\mathbf{y}\) is an \(n\times 1\) vector.

\(\mathbf{X}\) is an \(n\times p\) matrix.

\(\epsilon\) is an \(n\times 1\) vector.

\(\epsilon\) is a random vector, and thus \(\mathbf{y}\) is a random vector too (since its value is dependent on \(\epsilon\)).

\(\mathbf{X}\) and \(\beta\) are fixed (i.e. \(\mathbf{X}\) and \(\beta\) are the same in all datasets generated by this process).

This implies that random vector \(\mathbf{y}\) has a multivariate normal distribution \(\mathbf{y}\sim\text{N}\Big(\mathbf{X}\beta \space, \hspace{3mm} \sigma^2\mathbf{I}\Big)\)

For a single given dataset \(\{\mathbf{y},\mathbf{X}\}\), where outcome vector \(\mathbf{y}\) and features/predictors \(\mathbf{X}\) are known, we would want to estimate parameter/coefficient vector \(\beta\).

The estimator \(\hat \beta\) minimising the sum of squared errors \(\epsilon'\epsilon=\sum_{i=1}^n\Big(\mathbf{y}-\underset{\hat{ \mathbf{y}}}{\underbrace{X \hat\beta}}\Big)^2\) for a given sample is \(\hat \beta = (X'X)^{-1}X'\mathbf{y}\). This is called the Ordinary Least Squares (OLS) estimator.

The distribution of this estimator \(\hat\beta\) across samples, under this data-generating process, is multivariate normal:

\[\hat\beta \quad\sim\quad \text{N}_p\Big(\beta \space, \hspace{3mm} \sigma^2 (X'X)^{-1}\Big)\]

set.seed(2)
# Define fixed parameters (the parameters that are the same for every dataset):
n <- 100 
p <- 4
X <- cbind( intercept=1, x1=runif(n,-100,100), x2=rnorm(n,12,5), x3=rexp(n,rate=0.05) )
beta <- sample( -10:10, size=p ) * c(0,1,0,1)         # and make beta0 and beta2 zero
sigma <- 100

store_estimates <- matrix( rep(NA,4*10000), ncol=4 )       # create a matrix to store the ols_beta_hat estimates
colnames(store_estimates) <- c("b0", "b1", "b2", "b3")   # name columns of the matrix created above

for( i in 1:10000 ){      # run 10,000 simulations
  
  e <- rnorm(n, mean=0, sd=sigma)           # generate the errors 
  y <- X %*% beta + e     # generate y
  ols_beta_hat <- solve( t(X)%*%X ) %*% t(X) %*% y    # calculate the OLS estimator of beta
  store_estimates[i,] <- ols_beta_hat                 # store the estimator
}

# plot the distributions of the beta OLS estimates generated in the simulation above:
par( mfrow=c(2,2) )
for( i in 1:4 ){
  hist(      x = store_estimates[,i], 
        breaks = 50,
          main = bquote("Distribution of"~hat(beta[.(i-1)]) ),
          xlab = "Estimate",
          freq = FALSE
      )
  abline( v = beta[i], col=2, lwd=2 )
  curve( dnorm( x, 
                mean = beta[i], 
                  sd = sqrt(  (sigma^2) * solve( t(X)%*%X )[i,i]  )  
              ), 
         add = TRUE,
         col = 2
        )
}

If \(\sigma^2\) were known, then this information could be used to generate confidence intervals for \(\hat\beta\).

A confidence interval for a single coefficient \(\beta_j\) is

\[\hat\beta_j \pm z(\displaystyle\frac{\alpha}{2})\cdot \sqrt{\Big[\sigma^2(X'X)^{-1}\Big]_{jj}}\]

..where \(z(\displaystyle\frac{\alpha}{2})\) is the \(1-\displaystyle\frac{\alpha}{2}\) quantile of the standard normal distribution.

In practice, since the true value of \(\sigma^2\) is unknown, confidence intervals are generated using an estimate \(\quad s^2=\hat{\sigma^2} \quad\) of \(\sigma^2\), and the confidence interval is then:

\[\hat\beta_j \pm t_{n-p}(\displaystyle\frac{\alpha}{2})\cdot \sqrt{\Big[s^2(X'X)^{-1}\Big]_{jj}}\]

Rather than considering coefficients one at a time (which leads to more false positives/type 1 errors), it is possible to create simultaneous confidence regions for the parameters \(\beta\). See [1] for details.

The estimated errors for a particular dataset are \(\hat \epsilon = \underset{\hat{\mathbf{y}}}{\underbrace{X\hat\beta}} - \mathbf{y}\).

The expected value of \(\hat\epsilon'\hat\epsilon\) across datasets is \(E\Big[\hat\epsilon'\hat\epsilon\Big]=\sigma^2\Big(n-p\Big)\), which gives us an estimator

\[\hat{\sigma^2}=\displaystyle\frac{\hat\epsilon'\hat\epsilon}{n-p}\]

of \(\sigma^2\), since

\[E\Big[\displaystyle\frac{\hat\epsilon'\hat\epsilon}{n-p}\Big]=\displaystyle\frac{1}{n-p}E\Big[\hat\epsilon'\hat\epsilon\Big]=\sigma^2\]

Here is an example of the distribution of \(\hat{\sigma^2}\) across multiple datasets by simulation:

set.seed(2)
# use the same values of [X] and [sigma] as the simulation above

store_sigma_estimators <- rep(NA,10000)       # create a vector to store sigma estimate from each sample 

for( i in 1:10000 ){      # run 10,000 simulations
  
  e <- rnorm(n, mean=0, sd=sigma)                   # generate the errors 
  y <- X %*% beta + e                               # generate y
  ols_beta_hat <- solve( t(X)%*%X ) %*% t(X) %*% y    # calculate the OLS estimator of beta
  y_hat <- X %*% ols_beta_hat                       # calculate the predicted values of y
  sample_errors <- y_hat - y                        # calculate the predicted errors for this dataset
  store_sigma_estimators[i] <- ( t(sample_errors)%*%sample_errors ) / (n-p)
}

hist(      x = store_sigma_estimators, 
      breaks = 50,
        main = bquote("Distribution of"~hat(sigma^2) )    
    )

# draw in true sigma^2 value:
abline( v = sigma^2, lwd=3 )
abline( v = mean(store_sigma_estimators), lwd=3, col=2, lty=2 )

# draw a legend on the plot:
legend( "topright",
        lty = c(1,2),
        col = c(1,2),
        legend = c( as.expression( bquote("true"~sigma^2) ) ,
                    as.expression( bquote("mean of "~hat(sigma^2)~"values") ) 
                  ) 
      )

In order to test the hypothesis \(\beta_j=0\) (i.e. whether a particular coefficient \(\beta_j\) is 0), we could use the result

\[\displaystyle\frac{\hat\beta_j}{\text{SE}\Big[\hat\beta_j\Big]} \quad=\quad \displaystyle\frac{\hat\beta_j}{\sqrt{\Big[\sigma^2(X^TX)^{-1}\Big]_{jj}}} \quad\sim\quad \text{N}\Big(0,1\Big)\]

However, not knowing \(\sigma^2\), we can use the result

\[\begin{array}{lcl} t_j &=& \displaystyle\frac{\hat\beta_j}{\hat{\text{SE}}\Big[\hat\beta_j\Big]} \quad=\quad \displaystyle\frac{\hat\beta_j}{\sqrt{\Big[s^2(X^TX)^{-1}\Big]_{jj}}} \\ t_j &\sim& \text{t}_{n-p} \quad \text{ if the null hypothesis } (H_0:\beta_j=0) \text{ is true} \end{array}\]

i.e. \(t_j\) has a \(t\)-distribution with \(n-p\) degress of freedom under the null hypothesis.

set.seed(2)
store_estimates <- matrix( rep(NA,4*10000), ncol=4 )       # create a matrix to store the ols_beta_hat estimates
colnames(store_estimates) <- c("t_b0", "t_b1", "t_b2", "t_b3")   # name columns of the matrix created above

for( i in 1:10000 ){      # run 10,000 simulations
  
  e <- rnorm(n, mean=0, sd=sigma)           # generate the errors 
  y <- c(  X %*% beta + e  )                        # generate y
  cov_beta_hat <- sigma^2 * solve( t(X)%*%X )   # calculate the variance/covariance matrix of beta_hat estimators
  ols_beta_hat <- c(  solve( t(X)%*%X ) %*% t(X) %*% y  )    # calculate the OLS estimator of beta
  y_hat <- c(  X %*% ols_beta_hat  )                        # calculate the fitted values of y
  sigma2_est <- c( t(y_hat-y)%*%(y_hat-y) ) / (n-p)          # generate estimate of sigma^2
  store_estimates[i,] <- c(ols_beta_hat) / sqrt( sigma2_est * diag(solve( t(X)%*%X )) )    # store the t-statistics
}

# plot the distributions of the t statistics generated in the simulation above:
par( mfrow=c(2,2) )
for( i in 1:4 ){
  hist(      x = store_estimates[,i], 
        breaks = 50,
          main = bquote("Distribution of"~t[hat(beta[.(i-1)])] ),
          xlab = "Estimate",
          freq = FALSE
      )
  curve( dt( x, 
             df = n-p
            ), 
         add = TRUE,
         col = 2
       )
}

Again, this test based on the \(t\)-statistic assumes that \(\sigma^2\) is known. In practice, \(\quad s=\hat{\sigma^2}=\displaystyle\frac{(\hat{\mathbf{y}}-\mathbf{y})'(\hat{\mathbf{y}}-\mathbf{y})}{n-p}\quad\) is used in place of the unknown constant \(\sigma^2\).

Note that \(p\)-values become smaller when sample size increases, making hypothesis tests based on \(p\)-values simply a function of sample size. Confidence intervals for effect sizes are preferred.

When to consider estimators of \(\beta\) other than OLS

From reference [1]:

If errors are correlated or have unequal variance (\(\sigma^2\) is not constant across samples 1,..,n), then Generalized Least Squares should be used to estimate \(\hat \beta\).
If the error distribution is long-tailed, then robust estimators of \(\hat \beta\) should be considered.
When the features/predictors are highly correlated (collinear), then penalised (biased) estimators should be considered (e.g. Ridge regression, Lasso).

Generalized Least Squares (GLS)

Suppose that our data generating process is:

\[\begin{array}{lcl} \mathbf{y} &=& \mathbf{X}\beta + \epsilon \\ \epsilon &{\sim}& \text{N}\Big(0 \space, \hspace{3mm} \sigma^2\Sigma\Big) \end{array}\]

\(\mathbf{y}\) is an \(n\times 1\) vector.

\(\mathbf{X}\) is an \(n\times p\) matrix.

\(\epsilon\) is an \(n\times 1\) vector.

\(\epsilon\) is a random vector, and thus \(\mathbf{y}\) is a random vector too (since its value is dependent on \(\epsilon\)).

\(\mathbf{X}\) and \(\beta\) are fixed (i.e. \(\mathbf{X}\) and \(\beta\) are the same in all datasets generated by this process).

The constant \(\sigma^2\) (absolute scale of the errors) is unknown, but the covariance structure (matrix \(\Sigma\)) of the errors is known [1].

This implies that random vector \(\mathbf{y}\) has a multivariate normal distribution \(\mathbf{y}\sim\text{N}\Big(\mathbf{X}\beta \space, \hspace{3mm} \sigma^2\Sigma\Big)\)

The estimator \(\hat \beta\) minimising \((\mathbf{y}-\mathbf{X}\beta)'\Sigma^{-1}(\mathbf{y}-\mathbf{X}\beta)\) for a given sample is:

\[\hat \beta = (X'\Sigma^{-1}X)^{-1}X'\Sigma^{-1}\mathbf{y}\]

\((\mathbf{y}-\mathbf{X}\beta)'\Sigma^{-1}(\mathbf{y}-\mathbf{X}\beta)\) is the squared Mahalanobis distance between vector \(\mathbf{y}\) and vector \(\hat{\mathbf{y}}\).

The variance of this estimator \(\hat\beta\) across samples, under this data-generating process, is \(\sigma^2 (X'\Sigma^{-1} X)^{-1}\):

set.seed(2)
# Define fixed parameters (the parameters that are the same for every dataset):
n <- 50 
p <- 4
X <- cbind( intercept=1, x1=runif(n,-100,100), x2=rnorm(n,12,5), x3=rexp(n,rate=0.05) )
beta <- sample( -10:10, size=p ) * c(0,1,0,1)         # and make beta0 and beta2 zero
sigma <- 10

# generate random variance/covariance structure for errors:
tempmat <- matrix( runif(n*n,-10,10), n, n )
sigma_covmat <- round( t(tempmat) %*% tempmat )

store_estimates <- matrix( rep(NA,4*10000), ncol=4 )       # create a matrix to store the beta_hat estimates
colnames(store_estimates) <- c("b0", "b1", "b2", "b3")   # name columns of the matrix created above

for( i in 1:10000 ){      # run 10,000 simulations
  e <- mvtnorm::rmvnorm(     n = 1,        # generate the errors
                          mean = rep(0, n),
                         sigma = sigma^2 * sigma_covmat 
                       ) %>% t()               
  y <- X %*% beta + e     # generate y
  
  # calculate the GLS estimator of beta:
  estimate_beta_hat <- solve( t(X)%*%solve(sigma_covmat)%*%X ) %*% t(X) %*% solve(sigma_covmat) %*% y    
  store_estimates[i,] <- estimate_beta_hat                 # store the GLS estimator
}


# plot the distributions of the beta estimates generated in the simulation above:
par( mfrow=c(2,2) )
for( i in 1:4 ){
  hist(      x = store_estimates[,i], 
        breaks = 50,
          main = bquote("Distribution of G.L.S."~hat(beta[.(i-1)]) ),
          xlab = "Estimate",
          freq = FALSE
      )
  abline( v = beta[i], col=2, lwd=2 )
  curve( dnorm( x, 
                mean = beta[i], 
                  sd = sqrt(  (sigma^2) * solve( t(X)%*%solve(sigma_covmat)%*%X )[i,i]  )  
              ), 
         add = TRUE,
         col = 2
        )
}

Of course, matrix \(\Sigma\) will not be known in practice and must be estimated.

Weighted Least Squares (WLS)

If some observations are more important than others, then Weighted Least Squares (WLS) can be used to estimate the coefficients \(\beta\) in the model.

Parameter estimates \(\hat{\beta}\) are now found by choosing the vector \(\hat\beta\) which minimises:

\[\mathcal{L}(\hat\beta)=\displaystyle\sum_{i=1}^n w_i \cdot \Big(y_i - \mathbf{x}_i'\beta\Big)^2\]

..where \(w_i\) is the weight on observation \(i\).

This gives the following estimate of \(\hat\beta\):

\[\hat\beta \quad=\quad \Big(X' W X\Big)^{-1} X' W y\]

where

\[W=\begin{bmatrix} w_1 & 0 & 0 & ... & 0 \\ 0 & w_2& 0 & ... & 0 \\ . & . & . & ... & . \\ . & . & . & ... & . \\ 0 & 0 & 0 & ... & w_n \\ \end{bmatrix}\]

WLS is usually used in the situation where the errors \(\epsilon\) in the model are uncorrelated but heteroskedastic (error variance for each observation is different):

\[\begin{array}{lcl} \mathbf{y} &=& \mathbf{X}\beta + \epsilon \\ \epsilon &{\sim}& \text{N}\Big(0 \space, \hspace{3mm} \Sigma=\begin{bmatrix}\sigma_1^2 &0 & 0 & ...& 0 \\ 0 & \sigma_2^2 & 0 & ... & 0 \\ . & .& . & ... & . \\ . & .& . & ... & . \\ 0 & 0& 0 & ... & \sigma_n^2 \\ \end{bmatrix}\Big) \end{array}\]

In this case, the weight chosen for each observation is \(w_i=\displaystyle\frac{1}{\sigma_i^2}\).

Weighted regression calculated in this way is equivalent to performing standard OLS regression on a dataset in which each sample/observation \(i=1,..,n\) has been duplicated a number of times proportional to it’s weight \(w_i\).

# simulate some random data:
mydata <- tibble( x1 = rnorm(100), x2=runif(100), x3=sample(0:1,size=100,replace=TRUE) )

true_beta <- sample(-10:10, size=4)
mydata <- mydata %>%
            mutate( error = rnorm(n()) ) %>% 
            mutate( y = true_beta[1] + true_beta[2]*x1 + true_beta[3]*x2 + true_beta[4]*x3 + error ) %>% 
            mutate( weight = sample(1:5, size=n(), replace=TRUE) )
mydata

X <- cbind(1,mydata[,c("x1","x2","x3")]) %>% as.matrix()
y <- mydata$y
W <- diag(mydata$weight)

# create an augmented dataset by duplicating each row/sample/observation according to it's weight:
mydata_5col <- mydata %>% select(y,x1,x2,x3) %>% mutate(rownum = row_number())
augmented_dataset <- mydata_5col %>% slice(0)
for( row_i in 1:nrow(mydata) ){
  
  for( j in 1:mydata$weight[row_i] ){
    augmented_dataset <- bind_rows( augmented_dataset,
                                    mydata_5col[row_i,]
                                  )
  }
}

# three equivalent methods, all leading to the same result:
lm( y ~ x1 + x2 + x3, data=augmented_dataset )

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = augmented_dataset)
## 
## Coefficients:
## (Intercept)           x1           x2           x3  
##      0.3346     -10.0615      -5.6777      -1.1734

lm( y ~ x1+x2+x3, data=mydata, weights=mydata$weight )

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = mydata, weights = mydata$weight)
## 
## Coefficients:
## (Intercept)           x1           x2           x3  
##      0.3346     -10.0615      -5.6777      -1.1734

solve( t(X) %*% W %*% X ) %*% t(X) %*% W %*% y

##           [,1]
## 1    0.3345705
## x1 -10.0615104
## x2  -5.6777383
## x3  -1.1734282

References

https://cran.r-project.org/doc/contrib/Faraway-PRA.pdf

Least Squares Linear Model Theory

Joseph Bolton

24 April, 2020

When to consider estimators of \(\beta\) other than OLS

Generalized Least Squares (GLS)

Weighted Least Squares (WLS)

References