Understanding Intercepts in Linear Regression Models

Author

Affiliation

Jack R. Leary

University of Florida

Published

April 16, 2024

1 Libraries

Code

library(dplyr)    # data manipulation
library(ggplot2)  # plots

2 Introduction

One concept I struggled with a lot in early statistics courses was what the intercept meant in linear regression models. I tended to just ignore it unless questions specifically pertained to it, and the vast majority of homework questions focused on interpreting the effects of covariates instead. I also saw many of my master’s-level students struggle with the topic in the SAS computing course I taught during Fall 2022 as well, with confusion about the effect of centering, the difference between centering and standardizing, and intercept interpretation in the different types of (generalized) linear (mixed) models being common pain points on homeworks. As such, I thought it might be useful - for myself and others - to jot down some notes on how the intercept is estimated and what it means under a variety of regression modelling frameworks.

3 Matrix algebra review

We’re going to start from first principles here with a quick review on matrix algebra. Linear regression is, after, just multiplying matrices in a clever way.

3.1 Multiplication

3.1.1 Theory

First define two matrices \(\mathbf{A}\) and \(\mathbf{B}\), each with 2 rows and 2 columns:

\[ \begin{aligned} \mathbf{A} &= \begin{bmatrix} a_{11} & a_{21} \\ a_{12} & a_{22} \\ \end{bmatrix} \\ \mathbf{B} &= \begin{bmatrix} b_{11} & b_{21} \\ b_{12} & b_{22} \\ \end{bmatrix} \\ \end{aligned} \tag{1}\]

Their product, another matrix \(C\), also has 2 rows and 2 columns, and its elements are defined like so, with \(i\) specifying the row and \(j\) the column of each element. What we’re doing is finding the dot product of the \(i^{\text{th}}\) row of \(\mathbf{A}\) and the \(j^{\text{th}}\) column of \(\mathbf{B}\), the expanded definition of which is below.

\[ \begin{aligned} c_{ij} &= \mathbf{A}_{i*} \cdot \mathbf{B}_{*j} \\ c_{ij} &= \sum_{k=1}^n a_{ik}b_{kj} \\ c_{ij} &= a_{i1}b_{1j} + \dots + a_{n1}b_{nj} \\ \end{aligned} \tag{2}\]

As such, we can define the product of \(\mathbf{A}\) and \(\mathbf{B}\) like so:

\[ \begin{aligned} \mathbf{C} &= \mathbf{A} \mathbf{B} \\ \mathbf{C} &= \begin{bmatrix} \mathbf{A}_{1*} \cdot \mathbf{B}_{*1} & \mathbf{A}_{2*} \cdot \mathbf{B}_{*1} \\ \mathbf{A}_{2*} \cdot \mathbf{B}_{*1} & \mathbf{A}_{2*} \cdot \mathbf{B}_{*2} \\ \end{bmatrix} \\ \mathbf{C} &= \begin{bmatrix} a_{11}b_{11} + a_{12}b_{21} & a_{11}b_{12} + a_{12}b_{22} \\ a_{21}b_{11} + a_{22}b_{21} & a_{21}b_{12} + a_{22}b_{22} \\ \end{bmatrix} \\ \end{aligned} \tag{3}\]

Important Note: To multiply two matrices \(\mathbf{A}\) and \(\mathbf{B}\) together, the number of rows of \(\mathbf{B}\) must be equal to the number of columns in \(\mathbf{A}\). To generalize:

\[ \mathbf{A}_{m \times n} \cdot \mathbf{B}_{n \times p} = \mathbf{C}_{m \times p} \tag{4}\]

3.1.2 Example

Let’s define two matrices:

\[ \begin{aligned} \mathbf{A} &= \begin{bmatrix} 3 & 2 \\ 0 & 7 \\ \end{bmatrix} \\ \mathbf{B} &= \begin{bmatrix} 1 & 4 \\ 1 & 2 \\ \end{bmatrix} \\ \end{aligned} \tag{5}\]

Their product \(\mathbf{C}\) is defined as:

\[ \begin{aligned} \mathbf{C} &= \begin{bmatrix} 3 \times 1 + 2 \times 1 & 3 \times 4 + 2 \times 2 \\ 0 \times 1 + 7 \times 1 & 0 \times 4 + 7 \times 2 \\ \end{bmatrix} \\ \mathbf{C} &= \begin{bmatrix} 5 & 16 \\ 7 & 14 \\ \end{bmatrix} \\ \end{aligned} \tag{6}\]

We can check this using R:

Code

A_mat <- matrix(c(3, 2, 0, 7), 
                nrow = 2, 
                ncol = 2, 
                byrow = TRUE)
B_mat <- matrix(c(1, 4, 1, 2), 
                nrow = 2, 
                ncol = 2, 
                byrow = TRUE)
C_mat <- A_mat %*% B_mat
C_mat

     [,1] [,2]
[1,]    5   16
[2,]    7   14

3.2 Transposition

3.2.1 Theory

Very simply, the transpose of a matrix can be thought of as simply flipping the rows & columns. The transpose of an \(m \times n\) matrix is thus \(n \times m\). A matrix that is its own transpose is symmetric. Notation-wise, some write the matrix transpose as \(A^T\), others as \(A^\mathsf{T}\), and still others denote it by \(A^\intercal\), but I personally prefer \(A^\prime\). The matrix transpose is more formally defined as:

\[ \begin{aligned} & \mathbf{A}_{m \times n} \\ & \mathbf{B}_{n \times n} = A^\prime \\ & \mathbf{B}_{i, j} = \mathbf{A}_{j, i} \\ \end{aligned} \tag{7}\]

3.2.2 Example

If we define the following matrix \(\mathbf{A}_{2 \times 3}\), we would expect its transpose to be the \(3 \times 2\) matrix \(\mathbf{A}^\prime\):

\[ \begin{aligned} \mathbf{A} &= \begin{bmatrix} 0 & -2 & 2 \\ 3 & -10 & 0 \\ \end{bmatrix} \\ \mathbf{A}^\prime &= \begin{bmatrix} 0 & 3 \\ -2 & -10 \\ 2 & 0 \\ \end{bmatrix} \\ \end{aligned} \tag{8}\]

We can confirm this in R using the t() function:

Code

A <- matrix(c(0, -2, 2, 3, -10, 0), 
            nrow = 2, 
            ncol = 3, 
            byrow = TRUE)
t(A)

     [,1] [,2]
[1,]    0    3
[2,]   -2  -10
[3,]    2    0

3.3 Inversion

3.3.1 Theory

The last matrix operation we’ll go over is inversion - this one is very important & is applied all throughout statistics & computing in general. The inverse of a matrix is simply another matrix that, when multiple by the first matrix, returns the identity matrix (a \(n \times n\) matrix with 1s on the diagonal and 0s everywhere else). Matrices must be square to be invertible, but not even all square matrices are invertible; the ones that aren’t are called singular. We’ll gloss over that fact though, & simply assume / fervently hope that the matrices we encounter will have an easily-computable (approximate) inverse. See this article for more information on inverse computation. We define the inverse of \(\mathbf{A}_{n \times n}\) as:

\[ \mathbf{A} \mathbf{A}^{-1} = \mathbf{I}_n \tag{9}\]

3.3.2 Example

In R, the solve() function is used most frequently to compute the matrix inverse; since the result is an approximation, we round the results to show that the result is the identity matrix:

Code

A <- matrix(rnorm(9), nrow = 3, ncol = 3)
A_inv <- solve(A)
round(A %*% A_inv)

     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

There are many other ways to compute a matrix inverse, as well as several different types of pseudoinverses. We can compute the Moore-Penrose pseudoinverse using the MASS package:

Code

A_mp_inv <- MASS::ginv(A)
round(A %*% A_mp_inv)

     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

3.4 The identity matrix

3.4.1 Theory

As mentioned previously, the identity matrix \(\mathbf{I}_{n}\) is a square matrix composed entirely of zeroes except along the diagonal, which is composed of ones. This matrix carries some unique properties (which are listed below) that will be helpful to us later on.

\[ \begin{aligned} \mathbf{I}_{n} &= \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & 0 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix} \\ \mathbf{I}_{n}^\prime &= \mathbf{I}_{n} \\ \mathbf{I}_{n}^{-1} &= \mathbf{I}_{n} \\ \end{aligned} \tag{10}\]

3.4.2 Example

We can set up a \(3 \times 3\) identity matrix \(\mathbf{I}_{3}\) in R using the diag() function:

Code

ident_mat <- diag(nrow = 3)
ident_mat

     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

The transpose is also equal to \(\mathbf{I}_{3}\):

Code

t(ident_mat)

     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

As is the inverse:

Code

solve(ident_mat)

     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

4 Linear model

4.1 Setup

For now, we’ll take it for granted that the solution to a linear regression problem is defined as follows:

\[ \widehat{\boldsymbol{\beta}} = \left(\mathbf{X}^\prime \mathbf{X} \right)^{-1} \mathbf{X}^\prime \mathbf{y} \tag{11}\]

4.2 The intercept-only model

The intercept-only model (also sometimes called the null model) is defined as linear regression when \(\mathbf{X}\) is simply a column vector of ones:

\[ \mathbf{X} = \begin{bmatrix} 1 \\ \vdots \\ 1 \\ \end{bmatrix} \tag{12}\]

We know the intercept-only model produces the mean as the one predicted value, as the mean minimizes the sum of squared errors in the absence of any other covariates. We can check this using R - we’ll first generate a vector \(\mathbf{y}\) consisting of 5 realizations of a random variable, such that \(Y \sim \mathcal{N}(0, 3)\).

Code

y <- rnorm(5, mean = 0, sd = 3)
y <- matrix(y, ncol = 1)
y

          [,1]
[1,] -5.415350
[2,]  4.356184
[3,] -2.261118
[4,]  3.439089
[5,] -2.163296

The mean of \(\mathbf{y}\) is:

Code

mean(y)

[1] -0.4088981

We can use R to fit an intercept-only model. We can see that the intercept coefficient \(\beta_0\) is equal to the mean of \(\mathbf{y}\).

Code

null_mod <- lm(y ~ 1)
coef(null_mod)

(Intercept) 
 -0.4088981

Let’s use linear algebra to figure out why this is true. Once again, we know that the linear regression closed-form solution is given by the following:

\[ \widehat{\boldsymbol{\beta}} = \left(\mathbf{X}^\prime \mathbf{X} \right)^{-1} \mathbf{X}^\prime \mathbf{y} \tag{13}\]

Let’s first define \(\mathbf{X}\) - just a column vector of 1s with \(n = 5\) rows:

Code

X <- c(1, 1, 1, 1, 1)
X <- matrix(X, ncol = 1)
X

     [,1]
[1,]    1
[2,]    1
[3,]    1
[4,]    1
[5,]    1

The value of \(\mathbf{X}^\prime \mathbf{X}\) is given by the following - note that this is equal to our sample size \(n = 5\). We knew that this quantity would be a scalar (a \(1 \times 1\) matrix) since \(\mathbf{X}^\prime\) has 1 row and 5 columns, and \(\mathbf{X}\) has 5 rows and 1 column, thus by the rule we defined above their product has 1 row and 1 column.

Code

t(X) %*% X

     [,1]
[1,]    5

The inverse of which, \(\left(\mathbf{X}^\prime \mathbf{X} \right)^{-1}\), is of course \(n^{-1}\):

Code

solve(t(X) %*% X)

     [,1]
[1,]  0.2

We multiply the above by \(\mathbf{X}^\prime\) again to obtain \(\left(\mathbf{X}^\prime \mathbf{X} \right)^{-1} \mathbf{X}^\prime\), which gives us a constant vector of length \(n\) with all values being equal to \(n^{-1}\):

Code

solve(t(X) %*% X) %*% t(X)

     [,1] [,2] [,3] [,4] [,5]
[1,]  0.2  0.2  0.2  0.2  0.2

Lastly, we multiply the above by \(\mathbf{y}\). Remember how multiplying vectors works - in this case we are multiplying each element of the above vector \(\left(\mathbf{X}^\prime \mathbf{X} \right)^{-1} \mathbf{X}^\prime\) with each element of \(\mathbf{y}\) and adding them together. We’ll define \(\mathbf{Z} = \left(\mathbf{X}^\prime \mathbf{X} \right)^{-1} \mathbf{X}^\prime\) for convenience of notation:

\[ \mathbf{Z} \mathbf{y} = \sum_{i=1}^n \mathbf{Z}_i \mathbf{y}_i \tag{14}\]

Since each element of \(\mathbf{Z}\) is the same, \(n^{-1}\), by the transitive property the above quantity is equivalent to:

\[ \begin{aligned} \mathbf{Z} \mathbf{y} &= \left(\mathbf{X}^\prime \mathbf{X} \right)^{-1} \mathbf{X}^\prime \mathbf{y} \\ \mathbf{Z} \mathbf{y} &= \sum_{i=1}^n \mathbf{Z}_i \mathbf{y}_i \\ \mathbf{Z} \mathbf{y} &= n^{-1} \sum_{i=1}^n \mathbf{y}_i \\ \end{aligned} \tag{15}\]

This is simply the sum of all the elements of \(\mathbf{y}\) divided by \(n\) - the mean! We can verify this with R by using linear algebra to compute the OLS solution:

Code

solve(t(X) %*% X) %*% t(X) %*% y

           [,1]
[1,] -0.4088981

This is equal to simply taking the mean of \(\mathbf{y}\):

Code

mean(y)

[1] -0.4088981

4.3 Models with categorical predictors

In practice of course we usually build models with predictors of interest outside of the intercept. Categorical variables are composed of discrete values, each with a different meaning e.g., we could have a variable containing the type of treatment a patient has received. In order to fit models with categorical variables, it’s necessary to expand a categorical variable into multiple indicator variables - variables composed of 1s and 0s depending on whether a certain observation belongs to a certain category. This is a little confusing, so let’s show an example. We’ll start by creating a categorical variable .

Code

X <- sample(c("A", "B"), size = 10, replace = TRUE)
X <- matrix(X, ncol = 1)
X

      [,1]
 [1,] "B" 
 [2,] "B" 
 [3,] "B" 
 [4,] "A" 
 [5,] "B" 
 [6,] "B" 
 [7,] "A" 
 [8,] "B" 
 [9,] "B" 
[10,] "A"

To convert \(\mathbf{X}\) into a numeric variable that we can use in a model, we use the model.matrix() function. To use this function though, we need to define the model we’re interested in using R’s formula syntax. The output we see shows an intercept column, which we understand, and another column composed of 1s and 0s called XB. This column is an indicator variable that tells us whether each observation belongs to category B or not - thus when XB is equal to 0, we know that the observation belongs to category A. This process of converting non-numeric categorical data to indicator variables has many names (one-hot encoding, dummifying, etc.), and there’s many ways of doing it. You can read more about the various ways of doing so here, but for now we’ll take it for granted that this is how it works under the hood in the lm() function that we use to fit linear models.

Code

X_2 <- model.matrix(~X)
X_2

   (Intercept) XB
1            1  1
2            1  1
3            1  1
4            1  0
5            1  1
6            1  1
7            1  0
8            1  1
9            1  1
10           1  0
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$X
[1] "contr.treatment"

From what we already know about matrix multiplication, we can see that the B group is going to have a different predicted average than the A group. We’ll also need to generate a new response variable \(\mathbf{y}\), since we’ve increased our sample size to \(n = 10\).

Code

y <- rnorm(10, mean = 0, sd = 3)
y <- matrix(y, ncol = 1)
y

            [,1]
 [1,] -5.8375020
 [2,] -1.8924219
 [3,]  2.9715616
 [4,] -1.5104106
 [5,] -1.1778186
 [6,]  1.3690104
 [7,] -0.5117578
 [8,]  1.8279983
 [9,]  1.9796838
[10,] -4.7857809

The mean of \(\mathbf{y}\) for each treatment group can be computed as follows. We’re going to use a little dplyr code to perform the summarization, as I find it a little more readable & replicable than base R. This will necessitate creating a data.frame to hold our \(\mathbf{y}\) and \(\mathbf{X}\) variables. Note that we’ve switched back to the categorical representation of \(\mathbf{X}\), as it’s more interpretable than the indicator variable version for summaries such as this.

Code

data.frame(X = X, y = y)

X	y
B	-5.8375020
B	-1.8924219
B	2.9715616
A	-1.5104106
B	-1.1778186
B	1.3690104
A	-0.5117578
B	1.8279983
B	1.9796838
A	-4.7857809

Here’s the mean for each group:

Code

data.frame(X = X, y = y) %>% 
  with_groups(X, 
              summarise, 
              mu = mean(y))

X	mu
A	-2.2693164
B	-0.1084983

Let’s use the OLS formula to solve for \(\boldsymbol{\beta} = [\beta_0, \beta_1]\). Note that we’re once again using the design matrix version of \(\mathbf{X}\) with the intercept column and indicator variable for group. We see that the intercept \(\beta_0\) is equal to the mean of the A group - but the coefficient for the B group \(\beta_1\) isn’t! This is because \(\beta_1\) doesn’t have the same interpretation as \(\beta_0\). While \(\beta_0\) is equal to the mean of the reference group (i.e., the first level of the categorical variable, in our case group A), \(\beta_1\) represents the difference between the mean for group A and the mean for group B.

Code

cat_mod_beta <- solve(t(X_2) %*% X_2) %*% t(X_2) %*% y
cat_mod_beta

                 [,1]
(Intercept) -2.269316
XB           2.160818

This becomes easier to understand when we sum the coefficients and get the average for group B, which is -0.1084983.

Code

sum(cat_mod_beta)

[1] -0.1084983

This is validated by fitting a linear model with lm() and checking the output, which matches our manual solution:

Code

cat_mod <- lm(y ~ X)
coef(cat_mod)

(Intercept)          XB 
  -2.269316    2.160818

To summarize: when categorical variables are used in an ordinary linear regression, the intercept represents the mean of the response variable when each of the categorical variables is at its reference level. When running regressions like this, it’s important to make sure that 1) you know what the reference levels are for each of your categorical variables and 2) that those reference levels make sense. Sometimes it doesn’t matter what order the categorical variables are in, but it often does - so check! A final note is that this interpretation holds when only categorical variables are used in your model. When continuous variables are included too, the interpretation changes. More on that in a bit.

Note

Working with categorical variables (or factors, as R labels them) can be confusing. If you want to gain a deeper understanding of how factors work, check out the chapter on them in the R for Data Science book. For factor-related tools, try the forcats R package, which is part of the tidyverse ecosystem and makes dealing with factors a lot simpler.

4.4 Models with continuous predictors

Continuous predictors differ from categorical ones in that they do not have a discrete set of possible values. The interpretation of the intercept thus differs. For any regression, the intercept interpretation is the value of the response when all predictors are equal to zero. What “all predictors equal to zero” means depends on the types of predictors you’re using; when predictors are categorical this implies that all predictors are at their reference levels (thanks to the indicator variable encoding that’s done in the background). With continuous variables, being equal to zero might have a reasonable interpretation or it might not, depending on what the variable is. In this case, think of the intercept like you would in middle school when learning \(y = mx + b\). The intercept, \(b\), is the value of \(y\) where \(x = 0\), like the plot below.

Code

data.frame(x = rnorm(500, sd = 2)) %>% 
  mutate(y = x + rnorm(500, sd = 0.5)) %>% 
  ggplot(aes(x = x, y = y)) + 
  geom_point(alpha = 0.8) + 
  geom_vline(xintercept = 0, color = "forestgreen", size = 1) +
  labs(x = "X", y = "Y") + 
  theme_classic(base_size = 14)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

In some cases, this value might be meaningful - for example, if your covariate of interest was the expression of a certain gene, then zero expression has a biological meaning. In other situations it makes little sense, such as when the covariate of interest is the age of each patient in your dataset. It’s unlikely that age being equal to zero would really mean much, as newborns aren’t often part of clinical trials. There’s ways to remedy this difficulty in interpretation, but we’ll focus first on why the interpretation is the way it is.

We’ll start by generating some data. The predictor variable we’re interested in will be negative-binomially distributed, with \(X \sim \text{NB}(10, 0.7)\). Note that we’re using the parameterization of the negative binomial used in the rnbinom() function defaults, with the size and probability parameters. Our response variable \(\mathbf{y}\) will be a function of \(\mathbf{X}\) with added normally-distributed noise. Since we’ve increased our sample size to \(n = 500\), we’ll only look at the first few rows of each variable.

Code

X <- rnbinom(500, size = 10, prob = 0.7)
y <- 2 * X + rnorm(500, mean = 0, sd = 2)
X <- matrix(X, ncol = 1)
y <- matrix(y, ncol = 1)
data.frame(X = X[, 1], y = y[, 1]) %>% 
  slice_head(n = 5)

X	y
6	12.495948
7	15.185192
4	8.957864
6	10.231077
3	4.752311

We can plot the data using ggplot2 to get an idea of what the relationship between the two variables is.

Code

data.frame(X = X[, 1], y = y[, 1]) %>% 
  ggplot(aes(x = X, y = y)) + 
  geom_point() + 
  labs(x = "X", y = "y") + 
  theme_classic(base_size = 14)

Using dplyr, we can find the mean of \(\mathbf{y}\) when \(\mathbf{X} = 0\).

Code

data.frame(X = X[, 1], y = y) %>% 
  filter(X == 0) %>% 
  summarise(mu = mean(y))

mu
0.6471092

Let’s fit a linear model manually and see what the coefficients are. We’ll first need to create the design matrix again using model.matrix(). This gives us a two column matrix, with the first column being the intercept (all 1s), and the second column being the negative binomial random variable \(\mathbf{X}\) we just simulated. Unlike models with categorical predictors, the intercept is not simply equal to the expected value when \(\mathbf{X} = 0\). Instead, the intercept is the expected value of the response variable conditional on the line of best fit that has been obtained i.e., conditional on the rest of the data in \(\mathbf{X}\). See this StackOverflow post for another example.

Code

X_3 <- model.matrix(~X)
cont_mod_beta <- solve(t(X_3) %*% X_3) %*% t(X_3) %*% y
cont_mod_beta

                  [,1]
(Intercept) -0.1207059
X            2.0299096

We can validate the above by fitting a linear model with lm() and checking the coefficients \(\boldsymbol{\beta} = [\beta_0, \beta_1]\), which are equal to our manually-computed coefficients.

Code

cont_mod <- lm(y ~ X)
coef(cont_mod)

(Intercept)           X 
 -0.1207059   2.0299096

4.4.1 Centering continuous predictors

One way to make models like this more interpretable is to center continuous variables around their means. Doing so ensures that the centered version of continuous variable is equal to zero when the original version of the variable is at its mean. This can give a better interpretation to some models e.g., if the continuous variable of interest was patient age, then the intercept would be the expected value of the response for a patient of mean age. Since centering doesn’t change the units of the continuous variable, only the intercept \(\beta_0\) will change i.e., the coefficient for our predictor of interest will stay the same. We can validate this by creating a centered version of \(\mathbf{X}\) and re-running the regression. The scale() function centers (subtracts the mean) and standardizes (divides by the standard deviation) by default, so we need to set scale = FALSE in order to only center the data.

Code

X_cent <- scale(X, scale = FALSE)
cont_mod_centered <- lm(y ~ X_cent)
coef(cont_mod_centered)

(Intercept)      X_cent 
    8.74188     2.02991

4.4.2 Standardizing continuous predictors

Standardizing (or scaling, as R calls it) can also occasionally be useful. Dividing by the standard deviation in addition to centering results in our continuous random variable having mean 0 and standard deviation 1. This does change the units of the variable though, which is important to remember. It does not, however, change the interpretation of the intercept - which remains unchanged from the model we fit above with only centering. The coefficient \(\beta_1\) differs though, and now represents the change in \(\mathbf{y}\) given a one standard deviation increase in \(\mathbf{X}\). For more on standardization see this StatLect post and this blog post from Columbia’s Dr. Andrew Gelman.

Code

X_scaled <- scale(X)
cont_mod_scaled <- lm(y ~ X_scaled)
coef(cont_mod_scaled)

(Intercept)    X_scaled 
   8.741880    5.010312

Lastly, with respect to standardization at least, it’s important to note that if we standardize the response variable \(\mathbf{y}\) in addition to standardizing the predictor matrix \(\mathbf{X}\), the intercept will disappear i.e., it will become zero. This is because after standardization, the means of both the predictor and response variables are equal to zero. Since the intercept is the mean of the response when the predictor is zero, the intercept is also zero. Note that because of how integers work in computer memory the value of the intercept shown below isn’t quite zero, but it is very close. For another example of this, see this Stackoverflow post.

Code

y_scaled <- scale(y)
cont_mod_resp_scaled <- lm(y_scaled ~ X_scaled)
coef(cont_mod_resp_scaled)

  (Intercept)      X_scaled 
-1.233443e-16  9.276516e-01

4.5 Models with categorical & continuous predictors

Finally, let’s put it all together. In most real-life modeling scenarios you’ll have a mix of categorical and continuous predictors, and thus care must be taken when preparing the data. You generally will want your intercept to be meaningful - whatever that means for the data you’re analyzing. In this case, we’ll simulate data under the following scenario: our response variable \(\mathbf{y}\) is the expression of some gene of interest in each of \(n = 500\) patients, and our predictor matrix \(\mathbf{X}\) is composed of one categorical variable representing a treatment, a continuous variable representing patient age in years, and another categorical variable representing the facility at which each patient was treated.

Code

X_df <- data.frame(age = rpois(500, lambda = 30), 
                   treatment = sample(c("A", "B"), 500, replace = TRUE), 
                   facility = sample(c("Hosp1", "Hosp2"), 500, replace = TRUE)) %>% 
        mutate(y = 2 * age + rpois(500, lambda = 10), 
               y = case_when(treatment == "A" ~ 0.7 * y - rpois(1, lambda = 20), 
                             TRUE ~ y), 
               y = case_when(facility == "Hosp2" ~ y + rpois(1, lambda = 10), 
                             TRUE ~ y))

The above code to might be a bit confusing - we simulate age as a Poisson random variable with a mean of 30 years, and randomly assign one of two treatments and one of two facilities to each patient. Our response variable \(\mathbf{y}\) is a function of all three predictors. We start by multiplying age by two and then adding Poisson-distributed random noise. We change the slope and subtract Poisson noise for treatment group A, and add Poisson-distributed noise for facility group Hosp2. Visualizing the data should help make sense of this process:

Code

ggplot(X_df, aes(x = age, y = y, color = treatment)) + 
  facet_wrap(~facility) + 
  geom_point(alpha = 0.8) + 
  geom_smooth(mapping = aes(group = treatment), 
              color = "black",
              method = "lm", 
              show.legend = FALSE) + 
  labs(x = "Age (Years)", 
       y = "Gene Expression", 
       color = "Treatment") + 
  theme_classic()

We can see that the lowest value of age in our dataset is 15, and thus it doesn’t really make sense to have our intercept correspond to age being equal to 0. Instead, we should center age, which will ensure that our intercept represents the expected value of \(\mathbf{y}\) for a patient of mean age that was given treatment A at facility Hosp1. The reference groups for each categorical variable are known since R sorts categorical variables alphabetically by default (though this can be overridden through functions like relevel()).

Code

X_df <- mutate(X_df, 
               age_centered = scale(age, scale = FALSE))

Now that we have our centered variable, we can set up our design matrix. We use the formula syntax to specify which predictors we’re interested in.

Code

X_design_mat <- model.matrix(~age_centered + treatment + facility, data = X_df)
head(X_design_mat)

  (Intercept) age_centered treatmentB facilityHosp2
1           1        -7.25          0             0
2           1         1.75          1             0
3           1        -7.25          1             1
4           1        -0.25          0             1
5           1         3.75          1             0
6           1         0.75          1             1

We can now solve the ordinary linear regression problem by hand to obtain the coefficient vector \(\boldsymbol{\beta} = [\beta_0, \beta_1, \beta_2, \beta_3]\).

Code

full_mod_beta <- solve(t(X_design_mat) %*% X_design_mat) %*% t(X_design_mat) %*% X_df$y
full_mod_beta

                   [,1]
(Intercept)   33.434628
age_centered   1.706063
treatmentB    37.137926
facilityHosp2 11.861502

We can verify the result once again using lm(). The interpretation for \(\beta_0\) is the expected response for a patient of mean age who is taking treatment A at facility Hosp1.

Code

full_mod <- lm(y ~ age_centered + treatment + facility, X_df)
coef(full_mod)

  (Intercept)  age_centered    treatmentB facilityHosp2 
    33.434628      1.706063     37.137926     11.861502

5 Generalized linear model

5.1 Setup

We’ll next move to the more complicated case of the generalized linear model (GLM). We’ll start by defining the basic form of a GLM; the main difference from an ordinary linear model is that the model’s response variable is a transformation of the actual response variable. This transformation is taken via what we call a link function. There are some detail here I’m skipping over, but in practice the link function is usually the natural log, and can be others such as the logit (for logistic regression). The link function is usually denoted \(g(\cdot)\), which gives us the following general form of a GLM with \(p\) covariates:

\[ g(\mathbb{E}[\mathbf{y}]) = \beta_0 + \beta_1 \mathbf{X}_1 + \dots + \beta_p\mathbf{X}_p \tag{16}\]

Like ordinary linear models, GLMs are linear in their covariates (as shown above), which is what gives them their relatively easy interpretations. However, unlike with ordinary linear regression, there is no simple closed-form solution to the above equation. Instead, a solution is estimated using something like iteratively reweighted least squares or Newton’s method. As such, we won’t be able to provide exact calculations of the GLM solutions like we did above with the ordinary linear models, so we’ll stick to first providing the interpretation and then checking to make sure the fitted model matches that interpretation.

5.2 The intercept-only model

In the intercept-only model, the GLM formula becomes:

\[ \begin{aligned} g(\mathbb{E}[\mathbf{y}]) &= \beta_0 \\ \mathbb{E}[\mathbf{y}] &= g^{-1}(\beta_0) \\ \end{aligned} \tag{17}\]

If we’re using \(g(\cdot) = \text{log}(\cdot)\) (the log-link), then the inverse of \(g(\cdot)\) is \(g^{-1}(\cdot) = \text{exp}(\cdot)\). In this case, it’s easy to see that the intercept \(\beta_0\) is actually the natural log of the mean of the response variable \(\mathbf{y}\).

We can verify this in R. In this example we’ll use a Poisson GLM with a log link function, as it’s probably the simplest to understand. We’ll start by simulating a Poisson-distributed response \(\mathbf{y} \sim \text{Poisson}(5)\) with \(n = 10\).

Code

y <- rpois(10, lambda = 5)
y <- matrix(y, ncol = 1)
y

      [,1]
 [1,]    4
 [2,]   10
 [3,]    4
 [4,]    6
 [5,]    8
 [6,]    4
 [7,]    3
 [8,]    3
 [9,]    5
[10,]    6

The mean of \(\mathbf{y}\) is 5.3, and the log of that quantity is:

Code

log(mean(y))

[1] 1.667707

We can fit a Poisson GLM like so. We can clearly see that \(\beta_0 = \text{log}(\bar{\mathbf{y}})\).

Code

null_mod_pois <- glm(y ~ 1, family = poisson(link = "log"))
coef(null_mod_pois)

(Intercept) 
   1.667707

Thus, when we use the inverse of the link function (exponentiation) on the estimated value of \(\beta_0\), we get the value of \(\mathbb{E}[\mathbf{y}] = 5.3\).

Code

exp(coef(null_mod_pois))

(Intercept) 
        5.3

5.3 Models with categorical predictors

The extension to categorical predictors is much the same as we saw before with ordinary linear models, with the primary change being that we’re now working on the log scale. Let’s add a categorical predictor to our model. First we simulate a categorical \(\mathbf{X}\):

Code

X <- sample(c("A", "B"), size = 10, replace = TRUE)
X <- matrix(X, ncol = 1)
X

      [,1]
 [1,] "A" 
 [2,] "B" 
 [3,] "A" 
 [4,] "A" 
 [5,] "B" 
 [6,] "A" 
 [7,] "A" 
 [8,] "B" 
 [9,] "B" 
[10,] "B"

Again using dplyr, we can find \(\bar{\mathbf{y}}\) and \(\text{log}(\bar{\mathbf{y}})\) for each group in our categorical variable:

Code

data.frame(y = y, 
           X = X[, 1]) %>% 
  with_groups(X, 
              summarise, 
              mu = mean(y), 
              log_mu = log(mean(y)))

X	mu	log_mu
A	4.2	1.435085
B	6.4	1.856298

We fit a Poisson GLM using the glm() function, again using the natural log as our link function. We see that the intercept is equal to the log of the mean of our response variable for group A.

Code

cat_mod_pois <- glm(y ~ X, family = poisson(link = "log"))
coef(cat_mod_pois)

(Intercept)          XB 
  1.4350845   0.4212135

Next, if we sum the coefficients, we get the log of the response variable for group B:

Code

sum(coef(cat_mod_pois))

[1] 1.856298

If we exponentiate the coefficients the intercept becomes simply the mean of \(\mathbf{y}\) for group A.

Code

exp(coef(cat_mod_pois))

(Intercept)          XB 
    4.20000     1.52381

Lastly, if we exponentiate the sum of the coefficients (the order of these operations is important), we get the mean of \(\mathbf{y}\) for group B:

Code

exp(sum(coef(cat_mod_pois)))

[1] 6.4

5.4 Models with continuous predictors

As with ordinary linear models, in the presence of a continuous predictor the intercept becomes the expected value of the response variable conditional on the line of best fit that we generate. First we generate another \(\mathbf{y}\) with a larger sample size and a slightly lower value of \(\lambda\). Then we generate Gamma random noise, and define the predictor \(\mathbf{X} = 3.1y + \epsilon\).

Code

y <- rpois(300, lambda = 3)
y <- matrix(y, ncol = 1)
X <- 3.1 * y + rgamma(300, shape = 20, rate = 4)
X <- matrix(X, ncol = 1)

We can plot \(\mathbf{X}\) and \(\mathbf{y}\) to get an idea of what their relationship is. Since \(\mathbf{X}\) is a Gamma random variable, it doesn’t actually have any zero values; the minimum value we’ve generated is actually 3.8055922. As such, the intercept will be an extrapolation of the data that we do have.

Code

data.frame(y = y, 
           X = X[, 1]) %>% 
  ggplot(aes(x = X, y = y)) + 
  geom_point(alpha = 0.8) + 
  labs(x = "X", y = "Y") + 
  theme_classic(base_size = 14)

We fit another Poisson GLM and check the coefficients. The intercept in this case is interpreted as the log of the expected value of the response when our predictor variable is equal to zero. However - since \(\mathbf{X}\) is essentially a Gamma random variable, it can actually never take a value of zero, as its support is \((0, \infty)\). As such, the interpretation of \(\beta_0\) is somewhat useless here, and thus it makes sense to center \(\mathbf{X}\) as we did earlier with the ordinary linear model.

Code

cont_mod_pois <- glm(y ~ X, family = poisson(link = "log"))
coef(cont_mod_pois)

(Intercept)           X 
-0.37029176  0.09311951

After centering and refitting the Poisson GLM, the intercept takes on a new interpretation - the log of \(\mathbf{y}\) when \(\mathbf{X}\) is at its mean. Note that the coefficient for \(\mathbf{X}\) does not change, as centering does not change the units of \(\mathbf{X}\).

Code

X_cent <- scale(X, scale = FALSE)
cont_mod_pois_centered <- glm(y ~ X_cent, family = poisson(link = "log"))
coef(cont_mod_pois_centered)

(Intercept)      X_cent 
 0.95596361  0.09311951

Let’s plot both the raw data and the fitted values obtained from our model. The blue horizontal line shows \(\text{exp}(\beta_0)\) i.e., the expected value of \(\mathbf{y}\) when \(\mathbf{X}\) is at its mean value (again, conditional on the line of best fit that we obtained). The vertical yellow line shows where the mean of \(\mathbf{X}\) is; since we centered \(\mathbf{X}\) this is equal to zero. Lastly, the green line shows the predicted values from our model, and this line intersects nicely with the value of the intercept as we would expect.

Code

intercept_exp <- exp(coef(cont_mod_pois_centered)[1])
data.frame(y = y, 
           y_pred = fitted(cont_mod_pois_centered), 
           X = X_cent[, 1]) %>% 
  ggplot(aes(x = X, y = y)) + 
  geom_point(alpha = 0.8) + 
  geom_hline(yintercept = intercept_exp, color = "dodgerblue", size = 1) + 
  geom_vline(xintercept = 0, color = "goldenrod", size = 1) + 
  geom_line(aes(y = y_pred), color = "forestgreen", size = 1) + 
  labs(x = "X (centered)", y = "Y") + 
  theme_classic()

5.5 Models with categorical & continuous predictors

Lastly, we’ll again examine the most common real-life situation - models with both categorical and continuous predictors. Consider a dataset where the response is a Poisson-distributed random variable, say the number of copies of viral RNA in a sample taken from a patient. First, let’s say that our sample size is set to a generous \(n = 1000\). Next, imagine that we have two continuous predictors; the first being patient age in years, and the second being the number of months that the patient has been in treatment. We’ll simulate age as a Poisson random variable with a mean of 25 years, and months in treatment as a negative binomial random variable with a mean of 3 months and overdispersion parameter (denoted size in the rnbinom() function) of 4. Lastly, let’s define a categorical predictor with 3 possible treatment categories. We store all 3 predictors in a data.frame.

Code

X_age <- rpois(1000, 25)
X_months <- rnbinom(1000, size = 4, mu = 3)
X_treat <- sample(c("Drug_A", "Drug_B", "Drug_C"), 
                  size = 1000, 
                  replace = TRUE)
mod_df <- data.frame(age = X_age, 
                     months = X_months, 
                     treat = X_treat)
slice_head(mod_df, n = 5)

age	months	treat
17	2	Drug_B
24	0	Drug_C
24	2	Drug_A
18	1	Drug_C
36	4	Drug_C

Finally, we’ll define \(\mathbf{y}\) to be a function of all 3 predictor variables along with a large amount of Poisson-distributed random noise following the distribution \(\epsilon \sim \text{Poisson}(50)\). We use the handy dplyr::case_when() function to create \(\mathbf{y}\) as a piecewise function whose relationship to the predictor variable changes based on the treatment each patient was given. After generating \(\mathbf{y}\), we round it to ensure that it is integer-valued, since we’re focused here on using Poisson GLMs.

Code

epsilon <- rpois(1000, 50)
mod_df <- mod_df %>% 
          mutate(y = case_when(treat == "Drug_A" ~ 2.25 * age - 1.2 * months, 
                               treat == "Drug_B" ~ 2 * age - 3 * months, 
                               treat == "Drug_C" ~ 1.75 * age - 5 * months), 
                 y = round(y + epsilon))
slice_head(mod_df, n = 5)

age	months	treat	y
17	2	Drug_B	79
24	0	Drug_C	86
24	2	Drug_A	104
18	1	Drug_C	84
36	4	Drug_C	86

We know that for our two continuous predictors, the intercept will indicate the predicted value of \(\mathbf{y}\) for patients with values of zero for those two variables. Since we also have a categorical variable, the intercept will refer to the reference group of that variable - in this case patients who were assigned Drug_A. Lastly, since we’re using a Poisson GLM, we know that the intercept will be the natural log of that quantity. Our model will be of the following form:

\[ \text{log}(\mathbb{E}[\text{Viral RNA}]) = \beta_0 + \beta_{\text{age}} X_{\text{age}} + \beta_{\text{months}} X_{\text{months}} + \beta_{\text{treat}} X_{\text{treat}} + \epsilon \tag{18}\]

Let’s visualize the data so that we can get some idea of what the relationships between our variables are, and what the intercept of the model we’re going to fit will tell us. First let’s look at the relationship between age and the response, splitting by treatment group. We add a purple vertical line showing where the overall mean of age is.

Code

ggplot(mod_df, aes(x = age, y = y, color = treat)) + 
  geom_vline(aes(xintercept = mean(age)), color = "purple", size = 1) + 
  geom_point(alpha = 0.8) + 
  labs(x = "Age", y = "Viral RNA", color = "Treatment") + 
  theme_classic(base_size = 14)

We repeat the visualization for months.

Code

ggplot(mod_df, aes(x = months, y = y, color = treat)) + 
  geom_vline(aes(xintercept = mean(months)), color = "purple", size = 1) + 
  geom_point(alpha = 0.8) + 
  labs(x = "Months Treated", y = "Viral RNA", color = "Treatment") + 
  theme_classic(base_size = 14)

Lastly, let’s simply compare the distribution of the response between the three treatment groups using a beeswarm plot, which is a variation on the classic violin plot that I’ve preferred recently. Check out the ggbeeswarm package docs for more. We also add a horizontal line showing \(\bar{\mathbf{y}}\) for each group.

Code

ggplot(mod_df, aes(x = treat, y = y, color = treat)) + 
  ggbeeswarm::geom_quasirandom() + 
  stat_summary(fun = "mean",
               geom = "crossbar", 
               width = 0.5,
               size = 0.75, 
               color = "black") + 
  labs(x = "Treatment", y = "Viral RNA", color = "Treatment") + 
  theme_classic(base_size = 14)

Let’s fit a model. As previously noted, we use the log link function. Since our treatment variable has three categories, we now have two coefficients for each of the non-reference group levels. The intercept \(\beta_0\) gives us the log of the expected value of the response for a patient that is 0 years old, has spent 0 months in treatment, and is assigned to be treated with Drug_A. We can tell right away that this isn’t the most useful interpretation, as there are no patients in our study who are zero years old (and even if there were, it probably wouldn’t be that useful to know that quantity).

Code

full_mod_pois <- glm(y ~ age + months + treat, 
                     data = mod_df, 
                     family = poisson(link = "log"))
coef(full_mod_pois)

(Intercept)         age      months treatDrug_B treatDrug_C 
 4.18370839  0.02170154 -0.03522072 -0.12013892 -0.25924502

Since we’re using the log-link, exponentiating the intercept gives us the expected value of \(\mathbf{y}\) under the conditions we described just above.

Code

exp(coef(full_mod_pois))[1]

(Intercept) 
   65.60871

Clearly, the age variable is a prime target for centering. While it could be useful to center months (depending on the goals of the study), there are a total of 113 patients in the study who have spent zero months in treatment, so it’s at least somewhat anchored in reality. We center age, then refit the model. We see that the estimate of \(\beta_0\) changes, while none of the coefficients for our predictors do (as expected).

Code

mod_df <- mutate(mod_df, 
                 age_cent = scale(age, scale = FALSE))
full_mod_pois_centered <- glm(y ~ age_cent + months + treat, 
                              data = mod_df, 
                              family = poisson(link = "log"))
coef(full_mod_pois_centered)

(Intercept)    age_cent      months treatDrug_B treatDrug_C 
 4.72589977  0.02170154 -0.03522072 -0.12013892 -0.25924502

Exponentiating \(\beta_0\) now gives us the expected value of \(\mathbf{y}\) for a patient of average age (which is 24.984 years) who was assigned treatment Drug_A and has spent zero months being treated. This quantity is useful because it tells us the baseline pre-treatment value for the average patient who takes Drug_A.

Code

exp(coef(full_mod_pois_centered))[1]

(Intercept) 
    112.832

6 Conclusions

While regression intercepts can seem tricky at first and don’t always carry a useful interpretation by default, I think they’re more useful than introductory & intermediate statistics courses give them credit for. I understand those course’s emphasis on interpreting estimates for covariates of interest of course, as those are what generally determine the success or failure of our scientific hypotheses, but the intercept can often help establish a useful baseline value against which you can compare your other estimates.

There’s a couple important takeaways I’d like to re-emphasize. First - always make sure you know the reference group of your categorical variables, and set it to something more useful if you need to! The alphabetical order used by R is often not quite what you want. Second - keep centering (or even pseudo-centering around another useful value) in mind & use it often. There’s many situations outside of just patient age where a value of 0 makes little sense in the context of your study. Lastly, when using GLMs, make sure you know your link function, its inverse, and how to interpret your coefficients on each scale.

Finally, I’ve included some references below that have been helpful to me, and which I would recommend as further reading if you’re interested in the technical details of regression & other modeling methods.

7 References

Faraway, J. Extending the Linear Model with R, 2nd Edition. Chapman and Hall. 2016.
Hastie, T. et al. The Elements of Statistical Learning, 2nd Edition. Springer. 2009.
Wakefield, J. Bayesian and Frequentist Regression Methods. Springer. 2016.
Klein, P. N. Coding the Matrix: Linear Algebra Through Computer Science Applications, 1st Edition. Newtonian Press. 2013.
Strang, G. Introduction to Linear Algebra, 5th Edition. Wellesley-Cambridge Press. 2016.

8 Session info

Code

sessioninfo::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 (2023-10-31)
 os       macOS Sonoma 14.3
 system   x86_64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2024-01-27
 pandoc   3.1.9 @ /usr/local/bin/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 beeswarm      0.4.0   2021-06-01 [1] CRAN (R 4.3.0)
 cli           3.6.2   2023-12-11 [1] CRAN (R 4.3.0)
 codetools     0.2-19  2023-02-01 [1] CRAN (R 4.3.2)
 colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)
 digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)
 dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.3.0)
 evaluate      0.23    2023-11-01 [1] CRAN (R 4.3.0)
 fansi         1.0.6   2023-12-08 [1] CRAN (R 4.3.0)
 farver        2.1.1   2022-07-06 [1] CRAN (R 4.3.0)
 fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
 ggbeeswarm    0.7.2   2023-04-29 [1] CRAN (R 4.3.0)
 ggplot2     * 3.4.4   2023-10-12 [1] CRAN (R 4.3.0)
 glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)
 gtable        0.3.4   2023-08-21 [1] CRAN (R 4.3.0)
 htmltools     0.5.7   2023-11-03 [1] CRAN (R 4.3.0)
 htmlwidgets   1.6.4   2023-12-06 [1] CRAN (R 4.3.0)
 jsonlite      1.8.8   2023-12-04 [1] CRAN (R 4.3.0)
 knitr         1.45    2023-10-30 [1] CRAN (R 4.3.0)
 labeling      0.4.3   2023-08-29 [1] CRAN (R 4.3.0)
 lattice       0.22-5  2023-10-24 [1] CRAN (R 4.3.0)
 lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.3.0)
 magrittr    * 2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
 MASS          7.3-60  2023-05-04 [1] CRAN (R 4.3.0)
 Matrix        1.6-4   2023-11-30 [1] CRAN (R 4.3.0)
 mgcv          1.9-1   2023-12-21 [1] CRAN (R 4.3.0)
 munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)
 nlme          3.1-164 2023-11-27 [1] CRAN (R 4.3.0)
 pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
 rlang         1.1.2   2023-11-04 [1] CRAN (R 4.3.0)
 rmarkdown     2.25    2023-09-18 [1] CRAN (R 4.3.0)
 scales        1.3.0   2023-11-28 [1] CRAN (R 4.3.0)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
 tibble        3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
 tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)
 utf8          1.2.4   2023-10-22 [1] CRAN (R 4.3.0)
 vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.3.0)
 vipor         0.4.7   2023-12-18 [1] CRAN (R 4.3.0)
 withr         2.5.2   2023-10-30 [1] CRAN (R 4.3.0)
 xfun          0.41    2023-11-01 [1] CRAN (R 4.3.0)
 yaml          2.3.8   2023-12-11 [1] CRAN (R 4.3.0)

 [1] /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library

──────────────────────────────────────────────────────────────────────────────