This post is an intuitive, visual, non-theoretical introduction to regression splines, assuming a basic knowledge of linear models.
.................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... ..........................................................................7......................... .......................................................................717.......................... .................................................................7.....771777717.777................ ...............................................................77..77777427722217.77................ ..............................................................7..721723651453342427742721........... ...........................................................7777.143436854566600339566477777......... ............................................................71.72385692777777774212356921177777..... ..........................................................7775225425517.77777..7127771906617717.77.. .........................................................711160177.77..177777.777717.77123831777.... ......................................................71542664537................77..17..135522227.. ..................................................77775955317.77...........................755564347 ............................................7..77.772425555277741.......................7.7734195042 ...............................................77..2269144777...........................7..77..74585 77.........................................777777726947..7...................................77.7725 717.......................................7777..7982447...........................................77 277......................................77777456377.7.........................................7...7 3147..............................77.....7717295337...7...........................................71 2657774117.............................7777264211777..7............................................. 7195532777.........................72772175523177................................................... 1723993377777.....7777...........7177145903727...................................................... 7771774696547717.777......77.7..14777296777177...................................................... ...12711245517777717.7777777.777553555771717........................................................ ....77777725564333212477217722236332441717.......................................................... .....777712174509366354909986566343127777........................................................... .........777131147731353466333317.727....7.......................................................... .............77.7777177177772317.7.................................................................. ............777777777........7...................................................................... .................7..77......7..7.................................................................... .................................................................................................... ......................7............................................................................. .................................................................................................... .................................................................................................... ....................................................................................................
All code in this post is R code.
Load required packages:
Suppose that we observe the following data, consisting of \((x_i,y_i)\) pairs:
set.seed(5431)
random_data <- tibble( y = sample(0:100, size=20),
x = sample(-50:50, size=20)
)
par(bg="black", col="white", col.axis="white", col.lab="white", col.main="white")
plot( x=random_data$x, y=random_data$y, xlab="x", ylab="y", pch=16 )
We could model the relationship between \(x\) and \(y\) using a simple linear model \(\quad y_i=\beta_0+\beta_1x_i + \epsilon_i \quad\) (a pretty useless model in this case):
If we calculate \(x^2\) from \(x\) and include \(x^2\) as an additional variable in the linear model (i.e. we fit the model \(\quad y_i=\beta_0+\beta_1x_i+\beta_2x_i^2 + \epsilon_i \quad\)), then the linear model can fit a quadratic curve (parabola) shape to the data:
Adding a further term \(x^3\) allows the linear model to fit a polynomial curve of degree 3:
(this model is \(\quad y_i=\beta_0+\beta_1x_i+\beta_2x_i^2 + \beta_3x_i^3 + \epsilon_i \quad\))
Adding further terms \(x^4,x^5,x^6...etc.\) allows even more complex polynomial curves to be fitted. Here is a polynomial curve of degree 10 fit to the data:
Notice the radical behaviour of the fitted curve at the extreme values of \(x\) (\(x<-40\) and \(x>50\)).
Predictions of the model outside of these extreme values can definitely not be trusted.
Regression splines are a more constrained form of the polynomial regression discussed above.
To fit a model with splines, we must first define knot points. Knot points (or knots) are selected values of \(x\).
For example, we could place equidistant (uniform) knots:
Alternatively, we could place the knots at the quantiles of \(x\) (i.e. the same number of observations lie between each knot):
See plots below: a degree-1 spline fits a degree-1 linear model (line) between each knot, where the lines are constrained to connect (hinge) at every knot point.
The positions of the lines are calculated using Ordinary Least Squares (OLS) just like a standard linear regression model, meaning that the fitted lines are those lines obeying the continuity (connection) constraints that minimise the sum of squared errors (squared vertical distances) between those lines and the points.
A degree-2 spline fits a quadratic (degree 2) curve between each knot, giving the linear model access to a wider range of shapes. The curves are constrained to connect at every knot point, and also to be continuous in the first derivative at each knot point (i.e. the slope of the curve is forced to change smoothly across each knot point):
In general, a spline of degree \(D\) fits polynomial curves of degree \(D\) between each knot point, where these curves are constrained to be continuous in their \(1^{\text{st}},2^{\text{nd}},...,(D-1)^{\text{th}}\) derivatives at the knot points. Effectively, these continuity constraints force different degrees of smoothness on the fitted curve where it crosses knot points.
Practically, splines are incorporated into a standard linear model by adding specific additional variables to the model, which are functions of the existing variables (a process similar to adding polynomial terms to a model to fit polynomial curves, as discussed at the beginning of this post). In contrast to polynomial regression, regression splines require the specification of knots, which are chosen values of \(x\) that constrain the fit of the curve.
The parameters to be chosen when fitting regression splines in R are the degrees of freedom (df) of the splines and the degree of the polynomial curve. The degrees of freedom (df) determines the number of knot points, and the degree determines the flexibility of the curve (it is the degree of the polynomial curve shapes fitted). The default behaviour of the R splines() function is to place knot points at the quantiles of \(x\), with the number of knots being \(\quad (df \space-\space degree) \quad\). Alternatively, instead of choosing \(df\), the user can specify the desired locations of the knots manually.
What follow below are plots of splines with different parameters fit to the same dataset: (with the fitted model coefficients reported under each plot)
## (Intercept) bs(x, df = 3, degree = 1)1
## 57.18815 -22.94823
## bs(x, df = 3, degree = 1)2 bs(x, df = 3, degree = 1)3
## 19.70990 -16.41487
## (Intercept) bs(x, df = 4, degree = 1)1
## 58.38727 -22.06934
## bs(x, df = 4, degree = 1)2 bs(x, df = 4, degree = 1)3
## -16.32475 28.90788
## bs(x, df = 4, degree = 1)4
## -22.83509
## (Intercept) bs(x, df = 5, degree = 1)1
## 58.78301 -17.46091
## bs(x, df = 5, degree = 1)2 bs(x, df = 5, degree = 1)3
## -23.16210 13.77229
## bs(x, df = 5, degree = 1)4 bs(x, df = 5, degree = 1)5
## 15.55141 -29.19284
## (Intercept) bs(x, df = 6, degree = 1)1
## 58.97190 -18.59368
## bs(x, df = 6, degree = 1)2 bs(x, df = 6, degree = 1)3
## -21.83824 -14.03221
## bs(x, df = 6, degree = 1)4 bs(x, df = 6, degree = 1)5
## 13.63540 12.22451
## bs(x, df = 6, degree = 1)6
## -29.96143
## (Intercept) bs(x, df = 3, degree = 2)1
## 63.91964 -54.82157
## bs(x, df = 3, degree = 2)2 bs(x, df = 3, degree = 2)3
## 33.44914 -27.74324
## (Intercept) bs(x, df = 4, degree = 2)1
## 59.28189 -24.98595
## bs(x, df = 4, degree = 2)2 bs(x, df = 4, degree = 2)3
## -15.12867 45.02574
## bs(x, df = 4, degree = 2)4
## -30.28332
## (Intercept) bs(x, df = 5, degree = 2)1
## 56.905024 -3.989234
## bs(x, df = 5, degree = 2)2 bs(x, df = 5, degree = 2)3
## -28.473248 15.558043
## bs(x, df = 5, degree = 2)4 bs(x, df = 5, degree = 2)5
## 25.711681 -27.156239
## (Intercept) bs(x, df = 3, degree = 3)1
## 64.92996 -99.80511
## bs(x, df = 3, degree = 3)2 bs(x, df = 3, degree = 3)3
## 78.00862 -30.53351
## (Intercept) bs(x, df = 3, degree = 3)1
## 64.92996 -99.80511
## bs(x, df = 3, degree = 3)2 bs(x, df = 3, degree = 3)3
## 78.00862 -30.53351
## (Intercept) bs(x, df = 5, degree = 3)1
## 55.20297 11.20199
## bs(x, df = 5, degree = 3)2 bs(x, df = 5, degree = 3)3
## -49.80223 36.50782
## bs(x, df = 5, degree = 3)4 bs(x, df = 5, degree = 3)5
## 22.05224 -25.25417
## (Intercept) bs(x, df = 6, degree = 3)1
## 52.676158 29.945396
## bs(x, df = 6, degree = 3)2 bs(x, df = 6, degree = 3)3
## -38.914762 -1.684552
## bs(x, df = 6, degree = 3)4 bs(x, df = 6, degree = 3)5
## 35.995121 15.914212
## bs(x, df = 6, degree = 3)6
## -23.410597
## (Intercept) bs(x, df = 8, degree = 4)1
## 42.96218 95.24252
## bs(x, df = 8, degree = 4)2 bs(x, df = 8, degree = 4)3
## -52.65989 18.02773
## bs(x, df = 8, degree = 4)4 bs(x, df = 8, degree = 4)5
## -17.27315 60.19318
## bs(x, df = 8, degree = 4)6 bs(x, df = 8, degree = 4)7
## 42.97850 -16.51947
## bs(x, df = 8, degree = 4)8
## -8.33110
Notice that the extreme/undesirable behaviour at the extreme values of \(x\), reminiscent of the polynomial regression at the beginning of this post, has not been mitigated in this case by using splines.
Natural splines are cubic splines (degree-3 splines) that are constrained to be linear on the boundary (to the left of the left-most knot and to the right of the right-most knot). The default number of knots used for natural splines is \((df-1+2)\). \(\quad (df-1)\) knots are placed at the quantiles of \(x\), and an additional knot is placed at each of the highest (maximum) and lowest (minimum) value of \(x\) (i.e. boundary knots are placed on the range of the data).
## (Intercept) ns(x, df = 1)
## 48.431818 6.373517
## (Intercept) ns(x, df = 2)1 ns(x, df = 2)2
## 46.765482 10.131176 2.723467
## (Intercept) ns(x, df = 3)1 ns(x, df = 3)2 ns(x, df = 3)3
## 62.238630 48.529712 -45.974897 1.122686
## (Intercept) ns(x, df = 4)1 ns(x, df = 4)2 ns(x, df = 4)3 ns(x, df = 4)4
## 58.52996 -22.59724 52.45135 -19.94042 -14.01139
## (Intercept) ns(x, df = 5)1 ns(x, df = 5)2 ns(x, df = 5)3 ns(x, df = 5)4
## 58.12404 -27.94305 17.61753 32.42301 -24.99641
## ns(x, df = 5)5
## -18.47705
## (Intercept) ns(x, df = 6)1 ns(x, df = 6)2 ns(x, df = 6)3 ns(x, df = 6)4
## 57.454743 -30.665337 -2.468535 16.979747 30.564380
## ns(x, df = 6)5 ns(x, df = 6)6
## -21.378124 -21.872811
## (Intercept) ns(x, df = 10)1 ns(x, df = 10)2 ns(x, df = 10)3
## 53.298076 -57.793285 43.353853 -47.201761
## ns(x, df = 10)4 ns(x, df = 10)5 ns(x, df = 10)6 ns(x, df = 10)7
## 39.065086 42.367917 -13.185491 79.028449
## ns(x, df = 10)8 ns(x, df = 10)9 ns(x, df = 10)10
## -37.905169 3.023802 -33.968799
When it is desired that splines of the same type are to be added to many variables, typing out all of the variable names becomes tedious in R. What follows is an example of code to fix this problem.
random_dat <- tibble( y = runif(100),
x1 = rnorm(100),
x2 = sample(1:5, size=100, replace=TRUE) %>% factor(),
x3 = rnorm(100),
x4 = sample(1:9, size=100, replace=TRUE) %>% factor(),
x5 = rnorm(100),
x6 = rnorm(100)
)
random_dat
numeric_colnames <-
random_dat %>%
select(-y) %>%
select_if( is.numeric ) %>%
colnames(.)
factor_colnames <-
random_dat %>%
select(-y) %>%
select_if( is.factor ) %>%
colnames(.)
numeric_colnames
## [1] "x1" "x3" "x5" "x6"
formula_for_model <-
paste0( "y ~ ",
paste( "ns(", numeric_colnames, ",df=3 )", collapse=" + " ), # numeric variables
" + ",
paste( factor_colnames, collapse=" + " ) # factor variables
) %>%
as.formula()
this gives the following formula, which can be given to the lm() model:
## y ~ ns(x1, df = 3) + ns(x3, df = 3) + ns(x5, df = 3) + ns(x6,
## df = 3) + x2 + x4
## <environment: 0x000000000c5f59a0>
##
## Call:
## lm(formula = formula_for_model, data = random_dat)
##
## Coefficients:
## (Intercept) ns(x1, df = 3)1 ns(x1, df = 3)2 ns(x1, df = 3)3
## 0.416395 0.073965 0.678727 0.118909
## ns(x3, df = 3)1 ns(x3, df = 3)2 ns(x3, df = 3)3 ns(x5, df = 3)1
## -0.241284 -0.870487 -0.047459 0.269169
## ns(x5, df = 3)2 ns(x5, df = 3)3 ns(x6, df = 3)1 ns(x6, df = 3)2
## 0.913253 0.399557 -0.042307 -0.116973
## ns(x6, df = 3)3 x22 x23 x24
## -0.176722 0.022490 -0.057537 -0.239666
## x25 x42 x43 x44
## -0.192413 -0.075847 -0.109074 -0.038354
## x45 x46 x47 x48
## -0.192988 -0.060129 0.001852 -0.101924
## x49
## -0.047031