This post is an intuitive, visual, non-theoretical introduction to regression splines, assuming a basic knowledge of linear models.

....................................................................................................
....................................................................................................
....................................................................................................
....................................................................................................
..........................................................................7.........................
.......................................................................717..........................
.................................................................7.....771777717.777................
...............................................................77..77777427722217.77................
..............................................................7..721723651453342427742721...........
...........................................................7777.143436854566600339566477777.........
............................................................71.72385692777777774212356921177777.....
..........................................................7775225425517.77777..7127771906617717.77..
.........................................................711160177.77..177777.777717.77123831777....
......................................................71542664537................77..17..135522227..
..................................................77775955317.77...........................755564347
............................................7..77.772425555277741.......................7.7734195042
...............................................77..2269144777...........................7..77..74585
77.........................................777777726947..7...................................77.7725
717.......................................7777..7982447...........................................77
277......................................77777456377.7.........................................7...7
3147..............................77.....7717295337...7...........................................71
2657774117.............................7777264211777..7.............................................
7195532777.........................72772175523177...................................................
1723993377777.....7777...........7177145903727......................................................
7771774696547717.777......77.7..14777296777177......................................................
...12711245517777717.7777777.777553555771717........................................................
....77777725564333212477217722236332441717..........................................................
.....777712174509366354909986566343127777...........................................................
.........777131147731353466333317.727....7..........................................................
.............77.7777177177772317.7..................................................................
............777777777........7......................................................................
.................7..77......7..7....................................................................
....................................................................................................
......................7.............................................................................
....................................................................................................
....................................................................................................
....................................................................................................

All code in this post is R code.

Load required packages:

library(tidyverse)        # for basic data manipulation 
library(splines)          # for fitting regression splines

Suppose that we observe the following data, consisting of \((x_i,y_i)\) pairs:

set.seed(5431)
random_data <- tibble( y = sample(0:100, size=20),
                       x = sample(-50:50, size=20)
                     )

par(bg="black", col="white", col.axis="white", col.lab="white", col.main="white")
plot( x=random_data$x, y=random_data$y, xlab="x", ylab="y", pch=16 )

We could model the relationship between \(x\) and \(y\) using a simple linear model \(\quad y_i=\beta_0+\beta_1x_i + \epsilon_i \quad\) (a pretty useless model in this case):

If we calculate \(x^2\) from \(x\) and include \(x^2\) as an additional variable in the linear model (i.e. we fit the model \(\quad y_i=\beta_0+\beta_1x_i+\beta_2x_i^2 + \epsilon_i \quad\)), then the linear model can fit a quadratic curve (parabola) shape to the data:

Adding a further term \(x^3\) allows the linear model to fit a polynomial curve of degree 3:

(this model is \(\quad y_i=\beta_0+\beta_1x_i+\beta_2x_i^2 + \beta_3x_i^3 + \epsilon_i \quad\))

Adding further terms \(x^4,x^5,x^6...etc.\) allows even more complex polynomial curves to be fitted. Here is a polynomial curve of degree 10 fit to the data:

Notice the radical behaviour of the fitted curve at the extreme values of \(x\) (\(x<-40\) and \(x>50\)).

Predictions of the model outside of these extreme values can definitely not be trusted.

Regression Splines

Regression splines are a more constrained form of the polynomial regression discussed above.

To fit a model with splines, we must first define knot points. Knot points (or knots) are selected values of \(x\).

For example, we could place equidistant (uniform) knots:

Alternatively, we could place the knots at the quantiles of \(x\) (i.e. the same number of observations lie between each knot):

See plots below: a degree-1 spline fits a degree-1 linear model (line) between each knot, where the lines are constrained to connect (hinge) at every knot point.

The positions of the lines are calculated using Ordinary Least Squares (OLS) just like a standard linear regression model, meaning that the fitted lines are those lines obeying the continuity (connection) constraints that minimise the sum of squared errors (squared vertical distances) between those lines and the points.

A degree-2 spline fits a quadratic (degree 2) curve between each knot, giving the linear model access to a wider range of shapes. The curves are constrained to connect at every knot point, and also to be continuous in the first derivative at each knot point (i.e. the slope of the curve is forced to change smoothly across each knot point):

In general, a spline of degree \(D\) fits polynomial curves of degree \(D\) between each knot point, where these curves are constrained to be continuous in their \(1^{\text{st}},2^{\text{nd}},...,(D-1)^{\text{th}}\) derivatives at the knot points. Effectively, these continuity constraints force different degrees of smoothness on the fitted curve where it crosses knot points.

Practically, splines are incorporated into a standard linear model by adding specific additional variables to the model, which are functions of the existing variables (a process similar to adding polynomial terms to a model to fit polynomial curves, as discussed at the beginning of this post). In contrast to polynomial regression, regression splines require the specification of knots, which are chosen values of \(x\) that constrain the fit of the curve.

The parameters to be chosen when fitting regression splines in R are the degrees of freedom (df) of the splines and the degree of the polynomial curve. The degrees of freedom (df) determines the number of knot points, and the degree determines the flexibility of the curve (it is the degree of the polynomial curve shapes fitted). The default behaviour of the R splines() function is to place knot points at the quantiles of \(x\), with the number of knots being \(\quad (df \space-\space degree) \quad\). Alternatively, instead of choosing \(df\), the user can specify the desired locations of the knots manually.

What follow below are plots of splines with different parameters fit to the same dataset: (with the fitted model coefficients reported under each plot)

##                (Intercept) bs(x, df = 3, degree = 1)1 
##                   57.18815                  -22.94823 
## bs(x, df = 3, degree = 1)2 bs(x, df = 3, degree = 1)3 
##                   19.70990                  -16.41487

##                (Intercept) bs(x, df = 4, degree = 1)1 
##                   58.38727                  -22.06934 
## bs(x, df = 4, degree = 1)2 bs(x, df = 4, degree = 1)3 
##                  -16.32475                   28.90788 
## bs(x, df = 4, degree = 1)4 
##                  -22.83509

##                (Intercept) bs(x, df = 5, degree = 1)1 
##                   58.78301                  -17.46091 
## bs(x, df = 5, degree = 1)2 bs(x, df = 5, degree = 1)3 
##                  -23.16210                   13.77229 
## bs(x, df = 5, degree = 1)4 bs(x, df = 5, degree = 1)5 
##                   15.55141                  -29.19284

##                (Intercept) bs(x, df = 6, degree = 1)1 
##                   58.97190                  -18.59368 
## bs(x, df = 6, degree = 1)2 bs(x, df = 6, degree = 1)3 
##                  -21.83824                  -14.03221 
## bs(x, df = 6, degree = 1)4 bs(x, df = 6, degree = 1)5 
##                   13.63540                   12.22451 
## bs(x, df = 6, degree = 1)6 
##                  -29.96143

##                (Intercept) bs(x, df = 3, degree = 2)1 
##                   63.91964                  -54.82157 
## bs(x, df = 3, degree = 2)2 bs(x, df = 3, degree = 2)3 
##                   33.44914                  -27.74324

##                (Intercept) bs(x, df = 4, degree = 2)1 
##                   59.28189                  -24.98595 
## bs(x, df = 4, degree = 2)2 bs(x, df = 4, degree = 2)3 
##                  -15.12867                   45.02574 
## bs(x, df = 4, degree = 2)4 
##                  -30.28332

##                (Intercept) bs(x, df = 5, degree = 2)1 
##                  56.905024                  -3.989234 
## bs(x, df = 5, degree = 2)2 bs(x, df = 5, degree = 2)3 
##                 -28.473248                  15.558043 
## bs(x, df = 5, degree = 2)4 bs(x, df = 5, degree = 2)5 
##                  25.711681                 -27.156239

##                (Intercept) bs(x, df = 3, degree = 3)1 
##                   64.92996                  -99.80511 
## bs(x, df = 3, degree = 3)2 bs(x, df = 3, degree = 3)3 
##                   78.00862                  -30.53351

##                (Intercept) bs(x, df = 3, degree = 3)1 
##                   64.92996                  -99.80511 
## bs(x, df = 3, degree = 3)2 bs(x, df = 3, degree = 3)3 
##                   78.00862                  -30.53351

##                (Intercept) bs(x, df = 5, degree = 3)1 
##                   55.20297                   11.20199 
## bs(x, df = 5, degree = 3)2 bs(x, df = 5, degree = 3)3 
##                  -49.80223                   36.50782 
## bs(x, df = 5, degree = 3)4 bs(x, df = 5, degree = 3)5 
##                   22.05224                  -25.25417

##                (Intercept) bs(x, df = 6, degree = 3)1 
##                  52.676158                  29.945396 
## bs(x, df = 6, degree = 3)2 bs(x, df = 6, degree = 3)3 
##                 -38.914762                  -1.684552 
## bs(x, df = 6, degree = 3)4 bs(x, df = 6, degree = 3)5 
##                  35.995121                  15.914212 
## bs(x, df = 6, degree = 3)6 
##                 -23.410597

##                (Intercept) bs(x, df = 8, degree = 4)1 
##                   42.96218                   95.24252 
## bs(x, df = 8, degree = 4)2 bs(x, df = 8, degree = 4)3 
##                  -52.65989                   18.02773 
## bs(x, df = 8, degree = 4)4 bs(x, df = 8, degree = 4)5 
##                  -17.27315                   60.19318 
## bs(x, df = 8, degree = 4)6 bs(x, df = 8, degree = 4)7 
##                   42.97850                  -16.51947 
## bs(x, df = 8, degree = 4)8 
##                   -8.33110

Notice that the extreme/undesirable behaviour at the extreme values of \(x\), reminiscent of the polynomial regression at the beginning of this post, has not been mitigated in this case by using splines.

Natural splines

Natural splines are cubic splines (degree-3 splines) that are constrained to be linear on the boundary (to the left of the left-most knot and to the right of the right-most knot). The default number of knots used for natural splines is \((df-1+2)\). \(\quad (df-1)\) knots are placed at the quantiles of \(x\), and an additional knot is placed at each of the highest (maximum) and lowest (minimum) value of \(x\) (i.e. boundary knots are placed on the range of the data).

##   (Intercept) ns(x, df = 1) 
##     48.431818      6.373517

##    (Intercept) ns(x, df = 2)1 ns(x, df = 2)2 
##      46.765482      10.131176       2.723467

##    (Intercept) ns(x, df = 3)1 ns(x, df = 3)2 ns(x, df = 3)3 
##      62.238630      48.529712     -45.974897       1.122686

##    (Intercept) ns(x, df = 4)1 ns(x, df = 4)2 ns(x, df = 4)3 ns(x, df = 4)4 
##       58.52996      -22.59724       52.45135      -19.94042      -14.01139

##    (Intercept) ns(x, df = 5)1 ns(x, df = 5)2 ns(x, df = 5)3 ns(x, df = 5)4 
##       58.12404      -27.94305       17.61753       32.42301      -24.99641 
## ns(x, df = 5)5 
##      -18.47705

##    (Intercept) ns(x, df = 6)1 ns(x, df = 6)2 ns(x, df = 6)3 ns(x, df = 6)4 
##      57.454743     -30.665337      -2.468535      16.979747      30.564380 
## ns(x, df = 6)5 ns(x, df = 6)6 
##     -21.378124     -21.872811

##      (Intercept)  ns(x, df = 10)1  ns(x, df = 10)2  ns(x, df = 10)3 
##        53.298076       -57.793285        43.353853       -47.201761 
##  ns(x, df = 10)4  ns(x, df = 10)5  ns(x, df = 10)6  ns(x, df = 10)7 
##        39.065086        42.367917       -13.185491        79.028449 
##  ns(x, df = 10)8  ns(x, df = 10)9 ns(x, df = 10)10 
##       -37.905169         3.023802       -33.968799

Adding splines to many variables

When it is desired that splines of the same type are to be added to many variables, typing out all of the variable names becomes tedious in R. What follows is an example of code to fix this problem.

random_dat <- tibble( y = runif(100),
                      x1 = rnorm(100),
                      x2 = sample(1:5, size=100, replace=TRUE) %>% factor(),
                      x3 = rnorm(100),
                      x4 = sample(1:9, size=100, replace=TRUE) %>% factor(),
                      x5 = rnorm(100),
                      x6 = rnorm(100)
                    )
random_dat

numeric_colnames <-
  random_dat %>%
    select(-y) %>%
      select_if( is.numeric ) %>%
      colnames(.)

factor_colnames <-
   random_dat %>%
    select(-y) %>%
      select_if( is.factor ) %>%
      colnames(.)

numeric_colnames

## [1] "x1" "x3" "x5" "x6"

formula_for_model <-
  paste0( "y ~ ",
          paste( "ns(", numeric_colnames, ",df=3 )", collapse=" + " ),   # numeric variables
          " + ",
          paste( factor_colnames, collapse=" + " )            # factor variables
       ) %>%
    as.formula()

this gives the following formula, which can be given to the lm() model:

formula_for_model

## y ~ ns(x1, df = 3) + ns(x3, df = 3) + ns(x5, df = 3) + ns(x6, 
##     df = 3) + x2 + x4
## <environment: 0x000000000c5f59a0>

lm( formula = formula_for_model,
       data = random_dat
  )

## 
## Call:
## lm(formula = formula_for_model, data = random_dat)
## 
## Coefficients:
##     (Intercept)  ns(x1, df = 3)1  ns(x1, df = 3)2  ns(x1, df = 3)3  
##        0.416395         0.073965         0.678727         0.118909  
## ns(x3, df = 3)1  ns(x3, df = 3)2  ns(x3, df = 3)3  ns(x5, df = 3)1  
##       -0.241284        -0.870487        -0.047459         0.269169  
## ns(x5, df = 3)2  ns(x5, df = 3)3  ns(x6, df = 3)1  ns(x6, df = 3)2  
##        0.913253         0.399557        -0.042307        -0.116973  
## ns(x6, df = 3)3              x22              x23              x24  
##       -0.176722         0.022490        -0.057537        -0.239666  
##             x25              x42              x43              x44  
##       -0.192413        -0.075847        -0.109074        -0.038354  
##             x45              x46              x47              x48  
##       -0.192988        -0.060129         0.001852        -0.101924  
##             x49  
##       -0.047031

A Visual Introduction to Regression Splines

Joseph Bolton

24 April, 2020

Regression Splines

Natural splines

Adding splines to many variables