Code
library(dplyr) # data manipulation
library(ggplot2) # plots
April 16, 2024
One concept I struggled with a lot in early statistics courses was what the intercept meant in linear regression models. I tended to just ignore it unless questions specifically pertained to it, and the vast majority of homework questions focused on interpreting the effects of covariates instead. I also saw many of my master’s-level students struggle with the topic in the SAS computing course I taught during Fall 2022 as well, with confusion about the effect of centering, the difference between centering and standardizing, and intercept interpretation in the different types of (generalized) linear (mixed) models being common pain points on homeworks. As such, I thought it might be useful - for myself and others - to jot down some notes on how the intercept is estimated and what it means under a variety of regression modelling frameworks.
We’re going to start from first principles here with a quick review on matrix algebra. Linear regression is, after, just multiplying matrices in a clever way.
First define two matrices \(\mathbf{A}\) and \(\mathbf{B}\), each with 2 rows and 2 columns:
\[ \begin{aligned} \mathbf{A} &= \begin{bmatrix} a_{11} & a_{21} \\ a_{12} & a_{22} \\ \end{bmatrix} \\ \mathbf{B} &= \begin{bmatrix} b_{11} & b_{21} \\ b_{12} & b_{22} \\ \end{bmatrix} \\ \end{aligned} \tag{1}\]
Their product, another matrix \(C\), also has 2 rows and 2 columns, and its elements are defined like so, with \(i\) specifying the row and \(j\) the column of each element. What we’re doing is finding the dot product of the \(i^{\text{th}}\) row of \(\mathbf{A}\) and the \(j^{\text{th}}\) column of \(\mathbf{B}\), the expanded definition of which is below.
\[ \begin{aligned} c_{ij} &= \mathbf{A}_{i*} \cdot \mathbf{B}_{*j} \\ c_{ij} &= \sum_{k=1}^n a_{ik}b_{kj} \\ c_{ij} &= a_{i1}b_{1j} + \dots + a_{n1}b_{nj} \\ \end{aligned} \tag{2}\]
As such, we can define the product of \(\mathbf{A}\) and \(\mathbf{B}\) like so:
\[ \begin{aligned} \mathbf{C} &= \mathbf{A} \mathbf{B} \\ \mathbf{C} &= \begin{bmatrix} \mathbf{A}_{1*} \cdot \mathbf{B}_{*1} & \mathbf{A}_{2*} \cdot \mathbf{B}_{*1} \\ \mathbf{A}_{2*} \cdot \mathbf{B}_{*1} & \mathbf{A}_{2*} \cdot \mathbf{B}_{*2} \\ \end{bmatrix} \\ \mathbf{C} &= \begin{bmatrix} a_{11}b_{11} + a_{12}b_{21} & a_{11}b_{12} + a_{12}b_{22} \\ a_{21}b_{11} + a_{22}b_{21} & a_{21}b_{12} + a_{22}b_{22} \\ \end{bmatrix} \\ \end{aligned} \tag{3}\]
Important Note: To multiply two matrices \(\mathbf{A}\) and \(\mathbf{B}\) together, the number of rows of \(\mathbf{B}\) must be equal to the number of columns in \(\mathbf{A}\). To generalize:
\[ \mathbf{A}_{m \times n} \cdot \mathbf{B}_{n \times p} = \mathbf{C}_{m \times p} \tag{4}\]
Let’s define two matrices:
\[ \begin{aligned} \mathbf{A} &= \begin{bmatrix} 3 & 2 \\ 0 & 7 \\ \end{bmatrix} \\ \mathbf{B} &= \begin{bmatrix} 1 & 4 \\ 1 & 2 \\ \end{bmatrix} \\ \end{aligned} \tag{5}\]
Their product \(\mathbf{C}\) is defined as:
\[ \begin{aligned} \mathbf{C} &= \begin{bmatrix} 3 \times 1 + 2 \times 1 & 3 \times 4 + 2 \times 2 \\ 0 \times 1 + 7 \times 1 & 0 \times 4 + 7 \times 2 \\ \end{bmatrix} \\ \mathbf{C} &= \begin{bmatrix} 5 & 16 \\ 7 & 14 \\ \end{bmatrix} \\ \end{aligned} \tag{6}\]
We can check this using R:
Very simply, the transpose of a matrix can be thought of as simply flipping the rows & columns. The transpose of an \(m \times n\) matrix is thus \(n \times m\). A matrix that is its own transpose is symmetric. Notation-wise, some write the matrix transpose as \(A^T\), others as \(A^\mathsf{T}\), and still others denote it by \(A^\intercal\), but I personally prefer \(A^\prime\). The matrix transpose is more formally defined as:
\[ \begin{aligned} & \mathbf{A}_{m \times n} \\ & \mathbf{B}_{n \times n} = A^\prime \\ & \mathbf{B}_{i, j} = \mathbf{A}_{j, i} \\ \end{aligned} \tag{7}\]
If we define the following matrix \(\mathbf{A}_{2 \times 3}\), we would expect its transpose to be the \(3 \times 2\) matrix \(\mathbf{A}^\prime\):
\[ \begin{aligned} \mathbf{A} &= \begin{bmatrix} 0 & -2 & 2 \\ 3 & -10 & 0 \\ \end{bmatrix} \\ \mathbf{A}^\prime &= \begin{bmatrix} 0 & 3 \\ -2 & -10 \\ 2 & 0 \\ \end{bmatrix} \\ \end{aligned} \tag{8}\]
We can confirm this in R using the t()
function:
The last matrix operation we’ll go over is inversion - this one is very important & is applied all throughout statistics & computing in general. The inverse of a matrix is simply another matrix that, when multiple by the first matrix, returns the identity matrix (a \(n \times n\) matrix with 1s on the diagonal and 0s everywhere else). Matrices must be square to be invertible, but not even all square matrices are invertible; the ones that aren’t are called singular. We’ll gloss over that fact though, & simply assume / fervently hope that the matrices we encounter will have an easily-computable (approximate) inverse. See this article for more information on inverse computation. We define the inverse of \(\mathbf{A}_{n \times n}\) as:
\[ \mathbf{A} \mathbf{A}^{-1} = \mathbf{I}_n \tag{9}\]
In R, the solve()
function is used most frequently to compute the matrix inverse; since the result is an approximation, we round the results to show that the result is the identity matrix:
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
There are many other ways to compute a matrix inverse, as well as several different types of pseudoinverses. We can compute the Moore-Penrose pseudoinverse using the MASS
package:
As mentioned previously, the identity matrix \(\mathbf{I}_{n}\) is a square matrix composed entirely of zeroes except along the diagonal, which is composed of ones. This matrix carries some unique properties (which are listed below) that will be helpful to us later on.
\[ \begin{aligned} \mathbf{I}_{n} &= \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & 0 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix} \\ \mathbf{I}_{n}^\prime &= \mathbf{I}_{n} \\ \mathbf{I}_{n}^{-1} &= \mathbf{I}_{n} \\ \end{aligned} \tag{10}\]
We can set up a \(3 \times 3\) identity matrix \(\mathbf{I}_{3}\) in R using the diag()
function:
The transpose is also equal to \(\mathbf{I}_{3}\):
As is the inverse:
For now, we’ll take it for granted that the solution to a linear regression problem is defined as follows:
\[ \widehat{\boldsymbol{\beta}} = \left(\mathbf{X}^\prime \mathbf{X} \right)^{-1} \mathbf{X}^\prime \mathbf{y} \tag{11}\]
The intercept-only model (also sometimes called the null model) is defined as linear regression when \(\mathbf{X}\) is simply a column vector of ones:
\[ \mathbf{X} = \begin{bmatrix} 1 \\ \vdots \\ 1 \\ \end{bmatrix} \tag{12}\]
We know the intercept-only model produces the mean as the one predicted value, as the mean minimizes the sum of squared errors in the absence of any other covariates. We can check this using R - we’ll first generate a vector \(\mathbf{y}\) consisting of 5 realizations of a random variable, such that \(Y \sim \mathcal{N}(0, 3)\).
[,1]
[1,] -5.415350
[2,] 4.356184
[3,] -2.261118
[4,] 3.439089
[5,] -2.163296
The mean of \(\mathbf{y}\) is:
We can use R to fit an intercept-only model. We can see that the intercept coefficient \(\beta_0\) is equal to the mean of \(\mathbf{y}\).
Let’s use linear algebra to figure out why this is true. Once again, we know that the linear regression closed-form solution is given by the following:
\[ \widehat{\boldsymbol{\beta}} = \left(\mathbf{X}^\prime \mathbf{X} \right)^{-1} \mathbf{X}^\prime \mathbf{y} \tag{13}\]
Let’s first define \(\mathbf{X}\) - just a column vector of 1s with \(n = 5\) rows:
The value of \(\mathbf{X}^\prime \mathbf{X}\) is given by the following - note that this is equal to our sample size \(n = 5\). We knew that this quantity would be a scalar (a \(1 \times 1\) matrix) since \(\mathbf{X}^\prime\) has 1 row and 5 columns, and \(\mathbf{X}\) has 5 rows and 1 column, thus by the rule we defined above their product has 1 row and 1 column.
The inverse of which, \(\left(\mathbf{X}^\prime \mathbf{X} \right)^{-1}\), is of course \(n^{-1}\):
We multiply the above by \(\mathbf{X}^\prime\) again to obtain \(\left(\mathbf{X}^\prime \mathbf{X} \right)^{-1} \mathbf{X}^\prime\), which gives us a constant vector of length \(n\) with all values being equal to \(n^{-1}\):
Lastly, we multiply the above by \(\mathbf{y}\). Remember how multiplying vectors works - in this case we are multiplying each element of the above vector \(\left(\mathbf{X}^\prime \mathbf{X} \right)^{-1} \mathbf{X}^\prime\) with each element of \(\mathbf{y}\) and adding them together. We’ll define \(\mathbf{Z} = \left(\mathbf{X}^\prime \mathbf{X} \right)^{-1} \mathbf{X}^\prime\) for convenience of notation:
\[ \mathbf{Z} \mathbf{y} = \sum_{i=1}^n \mathbf{Z}_i \mathbf{y}_i \tag{14}\]
Since each element of \(\mathbf{Z}\) is the same, \(n^{-1}\), by the transitive property the above quantity is equivalent to:
\[ \begin{aligned} \mathbf{Z} \mathbf{y} &= \left(\mathbf{X}^\prime \mathbf{X} \right)^{-1} \mathbf{X}^\prime \mathbf{y} \\ \mathbf{Z} \mathbf{y} &= \sum_{i=1}^n \mathbf{Z}_i \mathbf{y}_i \\ \mathbf{Z} \mathbf{y} &= n^{-1} \sum_{i=1}^n \mathbf{y}_i \\ \end{aligned} \tag{15}\]
This is simply the sum of all the elements of \(\mathbf{y}\) divided by \(n\) - the mean! We can verify this with R by using linear algebra to compute the OLS solution:
This is equal to simply taking the mean of \(\mathbf{y}\):
In practice of course we usually build models with predictors of interest outside of the intercept. Categorical variables are composed of discrete values, each with a different meaning e.g., we could have a variable containing the type of treatment a patient has received. In order to fit models with categorical variables, it’s necessary to expand a categorical variable into multiple indicator variables - variables composed of 1s and 0s depending on whether a certain observation belongs to a certain category. This is a little confusing, so let’s show an example. We’ll start by creating a categorical variable .
[,1]
[1,] "B"
[2,] "B"
[3,] "B"
[4,] "A"
[5,] "B"
[6,] "B"
[7,] "A"
[8,] "B"
[9,] "B"
[10,] "A"
To convert \(\mathbf{X}\) into a numeric variable that we can use in a model, we use the model.matrix()
function. To use this function though, we need to define the model we’re interested in using R’s formula syntax. The output we see shows an intercept column, which we understand, and another column composed of 1s and 0s called XB. This column is an indicator variable that tells us whether each observation belongs to category B or not - thus when XB is equal to 0, we know that the observation belongs to category A. This process of converting non-numeric categorical data to indicator variables has many names (one-hot encoding, dummifying, etc.), and there’s many ways of doing it. You can read more about the various ways of doing so here, but for now we’ll take it for granted that this is how it works under the hood in the lm()
function that we use to fit linear models.
(Intercept) XB
1 1 1
2 1 1
3 1 1
4 1 0
5 1 1
6 1 1
7 1 0
8 1 1
9 1 1
10 1 0
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$X
[1] "contr.treatment"
From what we already know about matrix multiplication, we can see that the B group is going to have a different predicted average than the A group. We’ll also need to generate a new response variable \(\mathbf{y}\), since we’ve increased our sample size to \(n = 10\).
[,1]
[1,] -5.8375020
[2,] -1.8924219
[3,] 2.9715616
[4,] -1.5104106
[5,] -1.1778186
[6,] 1.3690104
[7,] -0.5117578
[8,] 1.8279983
[9,] 1.9796838
[10,] -4.7857809
The mean of \(\mathbf{y}\) for each treatment group can be computed as follows. We’re going to use a little dplyr
code to perform the summarization, as I find it a little more readable & replicable than base R. This will necessitate creating a data.frame
to hold our \(\mathbf{y}\) and \(\mathbf{X}\) variables. Note that we’ve switched back to the categorical representation of \(\mathbf{X}\), as it’s more interpretable than the indicator variable version for summaries such as this.
X | y |
---|---|
B | -5.8375020 |
B | -1.8924219 |
B | 2.9715616 |
A | -1.5104106 |
B | -1.1778186 |
B | 1.3690104 |
A | -0.5117578 |
B | 1.8279983 |
B | 1.9796838 |
A | -4.7857809 |
Here’s the mean for each group:
X | mu |
---|---|
A | -2.2693164 |
B | -0.1084983 |
Let’s use the OLS formula to solve for \(\boldsymbol{\beta} = [\beta_0, \beta_1]\). Note that we’re once again using the design matrix version of \(\mathbf{X}\) with the intercept column and indicator variable for group. We see that the intercept \(\beta_0\) is equal to the mean of the A group - but the coefficient for the B group \(\beta_1\) isn’t! This is because \(\beta_1\) doesn’t have the same interpretation as \(\beta_0\). While \(\beta_0\) is equal to the mean of the reference group (i.e., the first level of the categorical variable, in our case group A), \(\beta_1\) represents the difference between the mean for group A and the mean for group B.
[,1]
(Intercept) -2.269316
XB 2.160818
This becomes easier to understand when we sum the coefficients and get the average for group B, which is -0.1084983.
This is validated by fitting a linear model with lm()
and checking the output, which matches our manual solution:
To summarize: when categorical variables are used in an ordinary linear regression, the intercept represents the mean of the response variable when each of the categorical variables is at its reference level. When running regressions like this, it’s important to make sure that 1) you know what the reference levels are for each of your categorical variables and 2) that those reference levels make sense. Sometimes it doesn’t matter what order the categorical variables are in, but it often does - so check! A final note is that this interpretation holds when only categorical variables are used in your model. When continuous variables are included too, the interpretation changes. More on that in a bit.
Working with categorical variables (or factors, as R labels them) can be confusing. If you want to gain a deeper understanding of how factors work, check out the chapter on them in the R for Data Science book. For factor-related tools, try the forcats
R package, which is part of the tidyverse
ecosystem and makes dealing with factors a lot simpler.
Continuous predictors differ from categorical ones in that they do not have a discrete set of possible values. The interpretation of the intercept thus differs. For any regression, the intercept interpretation is the value of the response when all predictors are equal to zero. What “all predictors equal to zero” means depends on the types of predictors you’re using; when predictors are categorical this implies that all predictors are at their reference levels (thanks to the indicator variable encoding that’s done in the background). With continuous variables, being equal to zero might have a reasonable interpretation or it might not, depending on what the variable is. In this case, think of the intercept like you would in middle school when learning \(y = mx + b\). The intercept, \(b\), is the value of \(y\) where \(x = 0\), like the plot below.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
In some cases, this value might be meaningful - for example, if your covariate of interest was the expression of a certain gene, then zero expression has a biological meaning. In other situations it makes little sense, such as when the covariate of interest is the age of each patient in your dataset. It’s unlikely that age being equal to zero would really mean much, as newborns aren’t often part of clinical trials. There’s ways to remedy this difficulty in interpretation, but we’ll focus first on why the interpretation is the way it is.
We’ll start by generating some data. The predictor variable we’re interested in will be negative-binomially distributed, with \(X \sim \text{NB}(10, 0.7)\). Note that we’re using the parameterization of the negative binomial used in the rnbinom()
function defaults, with the size
and probability
parameters. Our response variable \(\mathbf{y}\) will be a function of \(\mathbf{X}\) with added normally-distributed noise. Since we’ve increased our sample size to \(n = 500\), we’ll only look at the first few rows of each variable.
X | y |
---|---|
6 | 12.495948 |
7 | 15.185192 |
4 | 8.957864 |
6 | 10.231077 |
3 | 4.752311 |
We can plot the data using ggplot2
to get an idea of what the relationship between the two variables is.
Using dplyr
, we can find the mean of \(\mathbf{y}\) when \(\mathbf{X} = 0\).
Let’s fit a linear model manually and see what the coefficients are. We’ll first need to create the design matrix again using model.matrix()
. This gives us a two column matrix, with the first column being the intercept (all 1s), and the second column being the negative binomial random variable \(\mathbf{X}\) we just simulated. Unlike models with categorical predictors, the intercept is not simply equal to the expected value when \(\mathbf{X} = 0\). Instead, the intercept is the expected value of the response variable conditional on the line of best fit that has been obtained i.e., conditional on the rest of the data in \(\mathbf{X}\). See this StackOverflow post for another example.
[,1]
(Intercept) -0.1207059
X 2.0299096
We can validate the above by fitting a linear model with lm()
and checking the coefficients \(\boldsymbol{\beta} = [\beta_0, \beta_1]\), which are equal to our manually-computed coefficients.
One way to make models like this more interpretable is to center continuous variables around their means. Doing so ensures that the centered version of continuous variable is equal to zero when the original version of the variable is at its mean. This can give a better interpretation to some models e.g., if the continuous variable of interest was patient age, then the intercept would be the expected value of the response for a patient of mean age. Since centering doesn’t change the units of the continuous variable, only the intercept \(\beta_0\) will change i.e., the coefficient for our predictor of interest will stay the same. We can validate this by creating a centered version of \(\mathbf{X}\) and re-running the regression. The scale()
function centers (subtracts the mean) and standardizes (divides by the standard deviation) by default, so we need to set scale = FALSE
in order to only center the data.
Standardizing (or scaling, as R calls it) can also occasionally be useful. Dividing by the standard deviation in addition to centering results in our continuous random variable having mean 0 and standard deviation 1. This does change the units of the variable though, which is important to remember. It does not, however, change the interpretation of the intercept - which remains unchanged from the model we fit above with only centering. The coefficient \(\beta_1\) differs though, and now represents the change in \(\mathbf{y}\) given a one standard deviation increase in \(\mathbf{X}\). For more on standardization see this StatLect post and this blog post from Columbia’s Dr. Andrew Gelman.
(Intercept) X_scaled
8.741880 5.010312
Lastly, with respect to standardization at least, it’s important to note that if we standardize the response variable \(\mathbf{y}\) in addition to standardizing the predictor matrix \(\mathbf{X}\), the intercept will disappear i.e., it will become zero. This is because after standardization, the means of both the predictor and response variables are equal to zero. Since the intercept is the mean of the response when the predictor is zero, the intercept is also zero. Note that because of how integers work in computer memory the value of the intercept shown below isn’t quite zero, but it is very close. For another example of this, see this Stackoverflow post.
Finally, let’s put it all together. In most real-life modeling scenarios you’ll have a mix of categorical and continuous predictors, and thus care must be taken when preparing the data. You generally will want your intercept to be meaningful - whatever that means for the data you’re analyzing. In this case, we’ll simulate data under the following scenario: our response variable \(\mathbf{y}\) is the expression of some gene of interest in each of \(n = 500\) patients, and our predictor matrix \(\mathbf{X}\) is composed of one categorical variable representing a treatment, a continuous variable representing patient age in years, and another categorical variable representing the facility at which each patient was treated.
X_df <- data.frame(age = rpois(500, lambda = 30),
treatment = sample(c("A", "B"), 500, replace = TRUE),
facility = sample(c("Hosp1", "Hosp2"), 500, replace = TRUE)) %>%
mutate(y = 2 * age + rpois(500, lambda = 10),
y = case_when(treatment == "A" ~ 0.7 * y - rpois(1, lambda = 20),
TRUE ~ y),
y = case_when(facility == "Hosp2" ~ y + rpois(1, lambda = 10),
TRUE ~ y))
The above code to might be a bit confusing - we simulate age as a Poisson random variable with a mean of 30 years, and randomly assign one of two treatments and one of two facilities to each patient. Our response variable \(\mathbf{y}\) is a function of all three predictors. We start by multiplying age by two and then adding Poisson-distributed random noise. We change the slope and subtract Poisson noise for treatment group A, and add Poisson-distributed noise for facility group Hosp2. Visualizing the data should help make sense of this process:
We can see that the lowest value of age in our dataset is 15, and thus it doesn’t really make sense to have our intercept correspond to age being equal to 0. Instead, we should center age, which will ensure that our intercept represents the expected value of \(\mathbf{y}\) for a patient of mean age that was given treatment A at facility Hosp1. The reference groups for each categorical variable are known since R sorts categorical variables alphabetically by default (though this can be overridden through functions like relevel()
).
Now that we have our centered variable, we can set up our design matrix. We use the formula syntax to specify which predictors we’re interested in.
(Intercept) age_centered treatmentB facilityHosp2
1 1 -7.25 0 0
2 1 1.75 1 0
3 1 -7.25 1 1
4 1 -0.25 0 1
5 1 3.75 1 0
6 1 0.75 1 1
We can now solve the ordinary linear regression problem by hand to obtain the coefficient vector \(\boldsymbol{\beta} = [\beta_0, \beta_1, \beta_2, \beta_3]\).
[,1]
(Intercept) 33.434628
age_centered 1.706063
treatmentB 37.137926
facilityHosp2 11.861502
We can verify the result once again using lm()
. The interpretation for \(\beta_0\) is the expected response for a patient of mean age who is taking treatment A at facility Hosp1.
We’ll next move to the more complicated case of the generalized linear model (GLM). We’ll start by defining the basic form of a GLM; the main difference from an ordinary linear model is that the model’s response variable is a transformation of the actual response variable. This transformation is taken via what we call a link function. There are some detail here I’m skipping over, but in practice the link function is usually the natural log, and can be others such as the logit (for logistic regression). The link function is usually denoted \(g(\cdot)\), which gives us the following general form of a GLM with \(p\) covariates:
\[ g(\mathbb{E}[\mathbf{y}]) = \beta_0 + \beta_1 \mathbf{X}_1 + \dots + \beta_p\mathbf{X}_p \tag{16}\]
Like ordinary linear models, GLMs are linear in their covariates (as shown above), which is what gives them their relatively easy interpretations. However, unlike with ordinary linear regression, there is no simple closed-form solution to the above equation. Instead, a solution is estimated using something like iteratively reweighted least squares or Newton’s method. As such, we won’t be able to provide exact calculations of the GLM solutions like we did above with the ordinary linear models, so we’ll stick to first providing the interpretation and then checking to make sure the fitted model matches that interpretation.
In the intercept-only model, the GLM formula becomes:
\[ \begin{aligned} g(\mathbb{E}[\mathbf{y}]) &= \beta_0 \\ \mathbb{E}[\mathbf{y}] &= g^{-1}(\beta_0) \\ \end{aligned} \tag{17}\]
If we’re using \(g(\cdot) = \text{log}(\cdot)\) (the log-link), then the inverse of \(g(\cdot)\) is \(g^{-1}(\cdot) = \text{exp}(\cdot)\). In this case, it’s easy to see that the intercept \(\beta_0\) is actually the natural log of the mean of the response variable \(\mathbf{y}\).
We can verify this in R. In this example we’ll use a Poisson GLM with a log link function, as it’s probably the simplest to understand. We’ll start by simulating a Poisson-distributed response \(\mathbf{y} \sim \text{Poisson}(5)\) with \(n = 10\).
[,1]
[1,] 4
[2,] 10
[3,] 4
[4,] 6
[5,] 8
[6,] 4
[7,] 3
[8,] 3
[9,] 5
[10,] 6
The mean of \(\mathbf{y}\) is 5.3, and the log of that quantity is:
We can fit a Poisson GLM like so. We can clearly see that \(\beta_0 = \text{log}(\bar{\mathbf{y}})\).
(Intercept)
1.667707
Thus, when we use the inverse of the link function (exponentiation) on the estimated value of \(\beta_0\), we get the value of \(\mathbb{E}[\mathbf{y}] = 5.3\).
The extension to categorical predictors is much the same as we saw before with ordinary linear models, with the primary change being that we’re now working on the log scale. Let’s add a categorical predictor to our model. First we simulate a categorical \(\mathbf{X}\):
[,1]
[1,] "A"
[2,] "B"
[3,] "A"
[4,] "A"
[5,] "B"
[6,] "A"
[7,] "A"
[8,] "B"
[9,] "B"
[10,] "B"
Again using dplyr
, we can find \(\bar{\mathbf{y}}\) and \(\text{log}(\bar{\mathbf{y}})\) for each group in our categorical variable:
X | mu | log_mu |
---|---|---|
A | 4.2 | 1.435085 |
B | 6.4 | 1.856298 |
We fit a Poisson GLM using the glm()
function, again using the natural log as our link function. We see that the intercept is equal to the log of the mean of our response variable for group A.
(Intercept) XB
1.4350845 0.4212135
Next, if we sum the coefficients, we get the log of the response variable for group B:
If we exponentiate the coefficients the intercept becomes simply the mean of \(\mathbf{y}\) for group A.
Lastly, if we exponentiate the sum of the coefficients (the order of these operations is important), we get the mean of \(\mathbf{y}\) for group B:
As with ordinary linear models, in the presence of a continuous predictor the intercept becomes the expected value of the response variable conditional on the line of best fit that we generate. First we generate another \(\mathbf{y}\) with a larger sample size and a slightly lower value of \(\lambda\). Then we generate Gamma random noise, and define the predictor \(\mathbf{X} = 3.1y + \epsilon\).
We can plot \(\mathbf{X}\) and \(\mathbf{y}\) to get an idea of what their relationship is. Since \(\mathbf{X}\) is a Gamma random variable, it doesn’t actually have any zero values; the minimum value we’ve generated is actually 3.8055922. As such, the intercept will be an extrapolation of the data that we do have.
We fit another Poisson GLM and check the coefficients. The intercept in this case is interpreted as the log of the expected value of the response when our predictor variable is equal to zero. However - since \(\mathbf{X}\) is essentially a Gamma random variable, it can actually never take a value of zero, as its support is \((0, \infty)\). As such, the interpretation of \(\beta_0\) is somewhat useless here, and thus it makes sense to center \(\mathbf{X}\) as we did earlier with the ordinary linear model.
(Intercept) X
-0.37029176 0.09311951
After centering and refitting the Poisson GLM, the intercept takes on a new interpretation - the log of \(\mathbf{y}\) when \(\mathbf{X}\) is at its mean. Note that the coefficient for \(\mathbf{X}\) does not change, as centering does not change the units of \(\mathbf{X}\).
(Intercept) X_cent
0.95596361 0.09311951
Let’s plot both the raw data and the fitted values obtained from our model. The blue horizontal line shows \(\text{exp}(\beta_0)\) i.e., the expected value of \(\mathbf{y}\) when \(\mathbf{X}\) is at its mean value (again, conditional on the line of best fit that we obtained). The vertical yellow line shows where the mean of \(\mathbf{X}\) is; since we centered \(\mathbf{X}\) this is equal to zero. Lastly, the green line shows the predicted values from our model, and this line intersects nicely with the value of the intercept as we would expect.
intercept_exp <- exp(coef(cont_mod_pois_centered)[1])
data.frame(y = y,
y_pred = fitted(cont_mod_pois_centered),
X = X_cent[, 1]) %>%
ggplot(aes(x = X, y = y)) +
geom_point(alpha = 0.8) +
geom_hline(yintercept = intercept_exp, color = "dodgerblue", size = 1) +
geom_vline(xintercept = 0, color = "goldenrod", size = 1) +
geom_line(aes(y = y_pred), color = "forestgreen", size = 1) +
labs(x = "X (centered)", y = "Y") +
theme_classic()
Lastly, we’ll again examine the most common real-life situation - models with both categorical and continuous predictors. Consider a dataset where the response is a Poisson-distributed random variable, say the number of copies of viral RNA in a sample taken from a patient. First, let’s say that our sample size is set to a generous \(n = 1000\). Next, imagine that we have two continuous predictors; the first being patient age in years, and the second being the number of months that the patient has been in treatment. We’ll simulate age as a Poisson random variable with a mean of 25 years, and months in treatment as a negative binomial random variable with a mean of 3 months and overdispersion parameter (denoted size
in the rnbinom()
function) of 4. Lastly, let’s define a categorical predictor with 3 possible treatment categories. We store all 3 predictors in a data.frame
.
age | months | treat |
---|---|---|
17 | 2 | Drug_B |
24 | 0 | Drug_C |
24 | 2 | Drug_A |
18 | 1 | Drug_C |
36 | 4 | Drug_C |
Finally, we’ll define \(\mathbf{y}\) to be a function of all 3 predictor variables along with a large amount of Poisson-distributed random noise following the distribution \(\epsilon \sim \text{Poisson}(50)\). We use the handy dplyr::case_when()
function to create \(\mathbf{y}\) as a piecewise function whose relationship to the predictor variable changes based on the treatment each patient was given. After generating \(\mathbf{y}\), we round it to ensure that it is integer-valued, since we’re focused here on using Poisson GLMs.
age | months | treat | y |
---|---|---|---|
17 | 2 | Drug_B | 79 |
24 | 0 | Drug_C | 86 |
24 | 2 | Drug_A | 104 |
18 | 1 | Drug_C | 84 |
36 | 4 | Drug_C | 86 |
We know that for our two continuous predictors, the intercept will indicate the predicted value of \(\mathbf{y}\) for patients with values of zero for those two variables. Since we also have a categorical variable, the intercept will refer to the reference group of that variable - in this case patients who were assigned Drug_A. Lastly, since we’re using a Poisson GLM, we know that the intercept will be the natural log of that quantity. Our model will be of the following form:
\[ \text{log}(\mathbb{E}[\text{Viral RNA}]) = \beta_0 + \beta_{\text{age}} X_{\text{age}} + \beta_{\text{months}} X_{\text{months}} + \beta_{\text{treat}} X_{\text{treat}} + \epsilon \tag{18}\]
Let’s visualize the data so that we can get some idea of what the relationships between our variables are, and what the intercept of the model we’re going to fit will tell us. First let’s look at the relationship between age and the response, splitting by treatment group. We add a purple vertical line showing where the overall mean of age is.
We repeat the visualization for months.
Lastly, let’s simply compare the distribution of the response between the three treatment groups using a beeswarm plot, which is a variation on the classic violin plot that I’ve preferred recently. Check out the ggbeeswarm
package docs for more. We also add a horizontal line showing \(\bar{\mathbf{y}}\) for each group.
Let’s fit a model. As previously noted, we use the log link function. Since our treatment variable has three categories, we now have two coefficients for each of the non-reference group levels. The intercept \(\beta_0\) gives us the log of the expected value of the response for a patient that is 0 years old, has spent 0 months in treatment, and is assigned to be treated with Drug_A. We can tell right away that this isn’t the most useful interpretation, as there are no patients in our study who are zero years old (and even if there were, it probably wouldn’t be that useful to know that quantity).
(Intercept) age months treatDrug_B treatDrug_C
4.18370839 0.02170154 -0.03522072 -0.12013892 -0.25924502
Since we’re using the log-link, exponentiating the intercept gives us the expected value of \(\mathbf{y}\) under the conditions we described just above.
Clearly, the age variable is a prime target for centering. While it could be useful to center months (depending on the goals of the study), there are a total of 113 patients in the study who have spent zero months in treatment, so it’s at least somewhat anchored in reality. We center age, then refit the model. We see that the estimate of \(\beta_0\) changes, while none of the coefficients for our predictors do (as expected).
(Intercept) age_cent months treatDrug_B treatDrug_C
4.72589977 0.02170154 -0.03522072 -0.12013892 -0.25924502
Exponentiating \(\beta_0\) now gives us the expected value of \(\mathbf{y}\) for a patient of average age (which is 24.984 years) who was assigned treatment Drug_A and has spent zero months being treated. This quantity is useful because it tells us the baseline pre-treatment value for the average patient who takes Drug_A.
While regression intercepts can seem tricky at first and don’t always carry a useful interpretation by default, I think they’re more useful than introductory & intermediate statistics courses give them credit for. I understand those course’s emphasis on interpreting estimates for covariates of interest of course, as those are what generally determine the success or failure of our scientific hypotheses, but the intercept can often help establish a useful baseline value against which you can compare your other estimates.
There’s a couple important takeaways I’d like to re-emphasize. First - always make sure you know the reference group of your categorical variables, and set it to something more useful if you need to! The alphabetical order used by R is often not quite what you want. Second - keep centering (or even pseudo-centering around another useful value) in mind & use it often. There’s many situations outside of just patient age where a value of 0 makes little sense in the context of your study. Lastly, when using GLMs, make sure you know your link function, its inverse, and how to interpret your coefficients on each scale.
Finally, I’ve included some references below that have been helpful to me, and which I would recommend as further reading if you’re interested in the technical details of regression & other modeling methods.
Faraway, J. Extending the Linear Model with R, 2nd Edition. Chapman and Hall. 2016.
Hastie, T. et al. The Elements of Statistical Learning, 2nd Edition. Springer. 2009.
Wakefield, J. Bayesian and Frequentist Regression Methods. Springer. 2016.
Klein, P. N. Coding the Matrix: Linear Algebra Through Computer Science Applications, 1st Edition. Newtonian Press. 2013.
Strang, G. Introduction to Linear Algebra, 5th Edition. Wellesley-Cambridge Press. 2016.
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.2 (2023-10-31)
os macOS Sonoma 14.3
system x86_64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2024-01-27
pandoc 3.1.9 @ /usr/local/bin/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
beeswarm 0.4.0 2021-06-01 [1] CRAN (R 4.3.0)
cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.0)
codetools 0.2-19 2023-02-01 [1] CRAN (R 4.3.2)
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)
digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)
dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.0)
evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.0)
fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.0)
farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.0)
fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
ggbeeswarm 0.7.2 2023-04-29 [1] CRAN (R 4.3.0)
ggplot2 * 3.4.4 2023-10-12 [1] CRAN (R 4.3.0)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)
gtable 0.3.4 2023-08-21 [1] CRAN (R 4.3.0)
htmltools 0.5.7 2023-11-03 [1] CRAN (R 4.3.0)
htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.3.0)
jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.3.0)
knitr 1.45 2023-10-30 [1] CRAN (R 4.3.0)
labeling 0.4.3 2023-08-29 [1] CRAN (R 4.3.0)
lattice 0.22-5 2023-10-24 [1] CRAN (R 4.3.0)
lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.0)
magrittr * 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
MASS 7.3-60 2023-05-04 [1] CRAN (R 4.3.0)
Matrix 1.6-4 2023-11-30 [1] CRAN (R 4.3.0)
mgcv 1.9-1 2023-12-21 [1] CRAN (R 4.3.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)
nlme 3.1-164 2023-11-27 [1] CRAN (R 4.3.0)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)
rlang 1.1.2 2023-11-04 [1] CRAN (R 4.3.0)
rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.0)
scales 1.3.0 2023-11-28 [1] CRAN (R 4.3.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)
utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.0)
vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.0)
vipor 0.4.7 2023-12-18 [1] CRAN (R 4.3.0)
withr 2.5.2 2023-10-30 [1] CRAN (R 4.3.0)
xfun 0.41 2023-11-01 [1] CRAN (R 4.3.0)
yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.0)
[1] /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library
──────────────────────────────────────────────────────────────────────────────
---
title: "Understanding Intercepts in Linear Regression Models"
author:
name: Jack R. Leary
email: j.leary@ufl.edu
orcid: 0009-0004-8821-3269
affiliations:
- name: University of Florida
department: Biostatistics
city: Gainesville
state: FL
date: today
date-format: long
format:
html:
code-fold: show
code-copy: true
code-tools: true
toc: true
toc-depth: 2
embed-resources: true
fig-format: retina
fig-width: 9
fig-height: 6
df-print: kable
link-external-newwindow: true
tbl-cap-location: bottom
fig-cap-location: bottom
number-sections: true
execute:
cache: true
freeze: auto
---
```{r setup, echo=FALSE}
set.seed(312) # lucky seed
```
# Libraries {#sec-libs}
```{r, message=FALSE, warning=FALSE, results='hide'}
library(dplyr) # data manipulation
library(ggplot2) # plots
```
# Introduction {#sec-intro}
One concept I struggled with a lot in early statistics courses was what the intercept meant in linear regression models. I tended to just ignore it unless questions specifically pertained to it, and the vast majority of homework questions focused on interpreting the effects of covariates instead. I also saw many of my master's-level students struggle with the topic in the SAS computing course I taught during Fall 2022 as well, with confusion about the effect of centering, the difference between centering and standardizing, and intercept interpretation in the different types of (generalized) linear (mixed) models being common pain points on homeworks. As such, I thought it might be useful - for myself and others - to jot down some notes on how the intercept is estimated and what it means under a variety of regression modelling frameworks.
# Matrix algebra review {#sec-linalg}
We're going to start from first principles here with a quick review on matrix algebra. Linear regression is, after, just multiplying matrices in a clever way.
## Multiplication
### Theory
First define two matrices $\mathbf{A}$ and $\mathbf{B}$, each with 2 rows and 2 columns:
$$
\begin{aligned}
\mathbf{A} &=
\begin{bmatrix}
a_{11} & a_{21} \\
a_{12} & a_{22} \\
\end{bmatrix} \\
\mathbf{B} &=
\begin{bmatrix}
b_{11} & b_{21} \\
b_{12} & b_{22} \\
\end{bmatrix} \\
\end{aligned}
$$ {#eq-1}
Their product, another matrix $C$, also has 2 rows and 2 columns, and its elements are defined like so, with $i$ specifying the row and $j$ the column of each element. What we're doing is finding the dot product of the $i^{\text{th}}$ row of $\mathbf{A}$ and the $j^{\text{th}}$ column of $\mathbf{B}$, the expanded definition of which is below.
$$
\begin{aligned}
c_{ij} &= \mathbf{A}_{i*} \cdot \mathbf{B}_{*j} \\
c_{ij} &= \sum_{k=1}^n a_{ik}b_{kj} \\
c_{ij} &= a_{i1}b_{1j} + \dots + a_{n1}b_{nj} \\
\end{aligned}
$$ {#eq-2}
As such, we can define the product of $\mathbf{A}$ and $\mathbf{B}$ like so:
$$
\begin{aligned}
\mathbf{C} &= \mathbf{A} \mathbf{B} \\
\mathbf{C} &=
\begin{bmatrix}
\mathbf{A}_{1*} \cdot \mathbf{B}_{*1} & \mathbf{A}_{2*} \cdot \mathbf{B}_{*1} \\
\mathbf{A}_{2*} \cdot \mathbf{B}_{*1} & \mathbf{A}_{2*} \cdot \mathbf{B}_{*2} \\
\end{bmatrix} \\
\mathbf{C} &=
\begin{bmatrix}
a_{11}b_{11} + a_{12}b_{21} & a_{11}b_{12} + a_{12}b_{22} \\
a_{21}b_{11} + a_{22}b_{21} & a_{21}b_{12} + a_{22}b_{22} \\
\end{bmatrix} \\
\end{aligned}
$$ {#eq-3}
**Important Note**: To multiply two matrices $\mathbf{A}$ and $\mathbf{B}$ together, the number of rows of $\mathbf{B}$ must be equal to the number of columns in $\mathbf{A}$. To generalize:
$$
\mathbf{A}_{m \times n} \cdot \mathbf{B}_{n \times p} = \mathbf{C}_{m \times p}
$$ {#eq-4}
### Example
Let's define two matrices:
$$
\begin{aligned}
\mathbf{A} &=
\begin{bmatrix}
3 & 2 \\
0 & 7 \\
\end{bmatrix} \\
\mathbf{B} &=
\begin{bmatrix}
1 & 4 \\
1 & 2 \\
\end{bmatrix} \\
\end{aligned}
$$ {#eq-5}
Their product $\mathbf{C}$ is defined as:
$$
\begin{aligned}
\mathbf{C} &=
\begin{bmatrix}
3 \times 1 + 2 \times 1 & 3 \times 4 + 2 \times 2 \\
0 \times 1 + 7 \times 1 & 0 \times 4 + 7 \times 2 \\
\end{bmatrix} \\
\mathbf{C} &=
\begin{bmatrix}
5 & 16 \\
7 & 14 \\
\end{bmatrix} \\
\end{aligned}
$$ {#eq-6}
We can check this using R:
```{r}
A_mat <- matrix(c(3, 2, 0, 7),
nrow = 2,
ncol = 2,
byrow = TRUE)
B_mat <- matrix(c(1, 4, 1, 2),
nrow = 2,
ncol = 2,
byrow = TRUE)
C_mat <- A_mat %*% B_mat
C_mat
```
## Transposition
### Theory
Very simply, the transpose of a matrix can be thought of as simply flipping the rows & columns. The transpose of an $m \times n$ matrix is thus $n \times m$. A matrix that is its own transpose is *symmetric*. Notation-wise, some write the matrix transpose as $A^T$, others as $A^\mathsf{T}$, and still others denote it by $A^\intercal$, but I personally prefer $A^\prime$. The matrix transpose is more formally defined as:
$$
\begin{aligned}
& \mathbf{A}_{m \times n} \\
& \mathbf{B}_{n \times n} = A^\prime \\
& \mathbf{B}_{i, j} = \mathbf{A}_{j, i} \\
\end{aligned}
$$ {#eq-7}
### Example
If we define the following matrix $\mathbf{A}_{2 \times 3}$, we would expect its transpose to be the $3 \times 2$ matrix $\mathbf{A}^\prime$:
$$
\begin{aligned}
\mathbf{A} &=
\begin{bmatrix}
0 & -2 & 2 \\
3 & -10 & 0 \\
\end{bmatrix} \\
\mathbf{A}^\prime &=
\begin{bmatrix}
0 & 3 \\
-2 & -10 \\
2 & 0 \\
\end{bmatrix} \\
\end{aligned}
$$ {#eq-8}
We can confirm this in R using the `t()` function:
```{r}
A <- matrix(c(0, -2, 2, 3, -10, 0),
nrow = 2,
ncol = 3,
byrow = TRUE)
t(A)
```
## Inversion
### Theory
The last matrix operation we'll go over is inversion - this one is very important & is applied all throughout statistics & computing in general. The inverse of a matrix is simply another matrix that, when multiple by the first matrix, returns the identity matrix (a $n \times n$ matrix with 1s on the diagonal and 0s everywhere else). Matrices must be square to be invertible, but not even all square matrices are invertible; the ones that aren't are called *singular*. We'll gloss over that fact though, & simply assume / fervently hope that the matrices we encounter will have an easily-computable (approximate) inverse. See [this article](https://doi.org/10.2307/2307034) for more information on inverse computation. We define the inverse of $\mathbf{A}_{n \times n}$ as:
$$
\mathbf{A} \mathbf{A}^{-1} = \mathbf{I}_n
$$ {#eq-9}
### Example
In R, the `solve()` function is used most frequently to compute the matrix inverse; since the result is an approximation, we round the results to show that the result is the identity matrix:
```{r}
A <- matrix(rnorm(9), nrow = 3, ncol = 3)
A_inv <- solve(A)
round(A %*% A_inv)
```
There are many other ways to compute a matrix inverse, as well as several different types of [pseudoinverses](https://en.wikipedia.org/wiki/Generalized_inverse). We can compute [the Moore-Penrose pseudoinverse](https://en.wikipedia.org/wiki/Moore–Penrose_inverse) using the `MASS` package:
```{r}
A_mp_inv <- MASS::ginv(A)
round(A %*% A_mp_inv)
```
## The identity matrix
### Theory
As mentioned previously, the identity matrix $\mathbf{I}_{n}$ is a square matrix composed entirely of zeroes *except* along the diagonal, which is composed of ones. This matrix carries some unique properties (which are listed below) that will be helpful to us later on.
$$
\begin{aligned}
\mathbf{I}_{n} &=
\begin{bmatrix}
1 & 0 & \cdots & 0 \\
0 & 1 & \cdots & 0 \\
\vdots & \vdots & \ddots & 0 \\
0 & 0 & 0 & 1 \\
\end{bmatrix} \\
\mathbf{I}_{n}^\prime &= \mathbf{I}_{n} \\
\mathbf{I}_{n}^{-1} &= \mathbf{I}_{n} \\
\end{aligned}
$$ {#eq-10}
### Example
We can set up a $3 \times 3$ identity matrix $\mathbf{I}_{3}$ in R using the `diag()` function:
```{r}
ident_mat <- diag(nrow = 3)
ident_mat
```
The transpose is also equal to $\mathbf{I}_{3}$:
```{r}
t(ident_mat)
```
As is the inverse:
```{r}
solve(ident_mat)
```
# Linear model {#sec-LMs}
## Setup
For now, we'll take it for granted that the solution to a linear regression problem is defined as follows:
$$
\widehat{\boldsymbol{\beta}} = \left(\mathbf{X}^\prime \mathbf{X} \right)^{-1} \mathbf{X}^\prime \mathbf{y}
$$ {#eq-11}
## The intercept-only model
The intercept-only model (also sometimes called the null model) is defined as linear regression when $\mathbf{X}$ is simply a column vector of ones:
$$
\mathbf{X} =
\begin{bmatrix}
1 \\
\vdots \\
1 \\
\end{bmatrix}
$$ {#eq-12}
We know the intercept-only model produces the mean as the one predicted value, as the mean minimizes the sum of squared errors in the absence of any other covariates. We can check this using R - we'll first generate a vector $\mathbf{y}$ consisting of 5 realizations of a random variable, such that $Y \sim \mathcal{N}(0, 3)$.
```{r}
y <- rnorm(5, mean = 0, sd = 3)
y <- matrix(y, ncol = 1)
y
```
The mean of $\mathbf{y}$ is:
```{r}
mean(y)
```
We can use R to fit an intercept-only model. We can see that the intercept coefficient $\beta_0$ is equal to the mean of $\mathbf{y}$.
```{r}
null_mod <- lm(y ~ 1)
coef(null_mod)
```
Let's use linear algebra to figure out why this is true. Once again, we know that the linear regression closed-form solution is given by the following:
$$
\widehat{\boldsymbol{\beta}} = \left(\mathbf{X}^\prime \mathbf{X} \right)^{-1} \mathbf{X}^\prime \mathbf{y}
$$ {#eq-13}
Let's first define $\mathbf{X}$ - just a column vector of 1s with $n = 5$ rows:
```{r}
X <- c(1, 1, 1, 1, 1)
X <- matrix(X, ncol = 1)
X
```
The value of $\mathbf{X}^\prime \mathbf{X}$ is given by the following - note that this is equal to our sample size $n = 5$. We knew that this quantity would be a scalar (a $1 \times 1$ matrix) since $\mathbf{X}^\prime$ has 1 row and 5 columns, and $\mathbf{X}$ has 5 rows and 1 column, thus by the rule we defined above their product has 1 row and 1 column.
```{r}
t(X) %*% X
```
The inverse of which, $\left(\mathbf{X}^\prime \mathbf{X} \right)^{-1}$, is of course $n^{-1}$:
```{r}
solve(t(X) %*% X)
```
We multiply the above by $\mathbf{X}^\prime$ again to obtain $\left(\mathbf{X}^\prime \mathbf{X} \right)^{-1} \mathbf{X}^\prime$, which gives us a constant vector of length $n$ with all values being equal to $n^{-1}$:
```{r}
solve(t(X) %*% X) %*% t(X)
```
Lastly, we multiply the above by $\mathbf{y}$. Remember how multiplying vectors works - in this case we are multiplying each element of the above vector $\left(\mathbf{X}^\prime \mathbf{X} \right)^{-1} \mathbf{X}^\prime$ with each element of $\mathbf{y}$ and adding them together. We'll define $\mathbf{Z} = \left(\mathbf{X}^\prime \mathbf{X} \right)^{-1} \mathbf{X}^\prime$ for convenience of notation:
$$
\mathbf{Z} \mathbf{y} = \sum_{i=1}^n \mathbf{Z}_i \mathbf{y}_i
$$ {#eq-14}
Since each element of $\mathbf{Z}$ is the same, $n^{-1}$, by the transitive property the above quantity is equivalent to:
$$
\begin{aligned}
\mathbf{Z} \mathbf{y} &= \left(\mathbf{X}^\prime \mathbf{X} \right)^{-1} \mathbf{X}^\prime \mathbf{y} \\
\mathbf{Z} \mathbf{y} &= \sum_{i=1}^n \mathbf{Z}_i \mathbf{y}_i \\
\mathbf{Z} \mathbf{y} &= n^{-1} \sum_{i=1}^n \mathbf{y}_i \\
\end{aligned}
$$ {#eq-15}
This is simply the sum of all the elements of $\mathbf{y}$ divided by $n$ - the mean! We can verify this with R by using linear algebra to compute the OLS solution:
```{r}
solve(t(X) %*% X) %*% t(X) %*% y
```
This is equal to simply taking the mean of $\mathbf{y}$:
```{r}
mean(y)
```
## Models with categorical predictors
In practice of course we usually build models with predictors of interest outside of the intercept. Categorical variables are composed of discrete values, each with a different meaning e.g., we could have a variable containing the type of treatment a patient has received. In order to fit models with categorical variables, it's necessary to expand a categorical variable into multiple indicator variables - variables composed of 1s and 0s depending on whether a certain observation belongs to a certain category. This is a little confusing, so let's show an example. We'll start by creating a categorical variable .
```{r}
X <- sample(c("A", "B"), size = 10, replace = TRUE)
X <- matrix(X, ncol = 1)
X
```
To convert $\mathbf{X}$ into a numeric variable that we can use in a model, we use the `model.matrix()` function. To use this function though, we need to define the model we're interested in using R's [formula syntax](https://r4ds.github.io/bookclub-tmwr/r-formula-syntax.html). The output we see shows an intercept column, which we understand, and another column composed of 1s and 0s called **XB**. This column is an indicator variable that tells us whether each observation belongs to category **B** or not - thus when **XB** is equal to 0, we know that the observation belongs to category **A**. This process of converting non-numeric categorical data to indicator variables has many names (one-hot encoding, dummifying, etc.), and there's many ways of doing it. You can read more about the various ways of doing so [here](https://stats.oarc.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis/), but for now we'll take it for granted that this is how it works under the hood in the `lm()` function that we use to fit linear models.
```{r}
X_2 <- model.matrix(~X)
X_2
```
From what we already know about matrix multiplication, we can see that the **B** group is going to have a different predicted average than the **A** group. We'll also need to generate a new response variable $\mathbf{y}$, since we've increased our sample size to $n = 10$.
```{r}
y <- rnorm(10, mean = 0, sd = 3)
y <- matrix(y, ncol = 1)
y
```
The mean of $\mathbf{y}$ for each treatment group can be computed as follows. We're going to use a little `dplyr` code to perform the summarization, as I find it a little more readable & replicable than base R. This will necessitate creating a `data.frame` to hold our $\mathbf{y}$ and $\mathbf{X}$ variables. Note that we've switched back to the categorical representation of $\mathbf{X}$, as it's more interpretable than the indicator variable version for summaries such as this.
```{r}
data.frame(X = X, y = y)
```
Here's the mean for each group:
```{r}
data.frame(X = X, y = y) %>%
with_groups(X,
summarise,
mu = mean(y))
```
Let's use the OLS formula to solve for $\boldsymbol{\beta} = [\beta_0, \beta_1]$. Note that we're once again using the design matrix version of $\mathbf{X}$ with the intercept column and indicator variable for group. We see that the intercept $\beta_0$ is equal to the mean of the **A** group - but the coefficient for the **B** group $\beta_1$ isn't! This is because $\beta_1$ doesn't have the same interpretation as $\beta_0$. While $\beta_0$ is equal to the mean of the reference group (i.e., the first level of the categorical variable, in our case group **A**), $\beta_1$ represents the difference between the mean for group **A** and the mean for group **B**.
```{r}
cat_mod_beta <- solve(t(X_2) %*% X_2) %*% t(X_2) %*% y
cat_mod_beta
```
This becomes easier to understand when we sum the coefficients and get the average for group **B**, which is `r library(dplyr); data.frame(X = X, y = y) %>% with_groups(X, summarise, mu = mean(y)) %>% filter(X == "B") %>% pull(mu)`.
```{r}
sum(cat_mod_beta)
```
This is validated by fitting a linear model with `lm()` and checking the output, which matches our manual solution:
```{r}
cat_mod <- lm(y ~ X)
coef(cat_mod)
```
To summarize: when categorical variables are used in an ordinary linear regression, the intercept represents the mean of the response variable when each of the categorical variables is at its reference level. When running regressions like this, it's important to make sure that 1) you know what the reference levels are for each of your categorical variables and 2) that those reference levels make sense. Sometimes it doesn't matter what order the categorical variables are in, but it often does - so check! A final note is that this interpretation holds when *only* categorical variables are used in your model. When continuous variables are included too, the interpretation changes. More on that in a bit.
::: {.callout-note}
Working with categorical variables (or factors, as R labels them) can be confusing. If you want to gain a deeper understanding of how factors work, check out [the chapter on them in the R for Data Science book](https://r4ds.had.co.nz/factors.html). For factor-related tools, try the [`forcats` R package](https://forcats.tidyverse.org/articles/forcats.html), which is part of the [`tidyverse`](https://www.tidyverse.org) ecosystem and makes dealing with factors a lot simpler.
:::
## Models with continuous predictors
Continuous predictors differ from categorical ones in that they do not have a discrete set of possible values. The interpretation of the intercept thus differs. For any regression, the intercept interpretation is the value of the response when all predictors are equal to zero. What "all predictors equal to zero" means depends on the types of predictors you're using; when predictors are categorical this implies that all predictors are at their reference levels (thanks to the indicator variable encoding that's done in the background). With continuous variables, being equal to zero might have a reasonable interpretation or it might not, depending on what the variable is. In this case, think of the intercept like you would in middle school when learning $y = mx + b$. The intercept, $b$, is the value of $y$ where $x = 0$, like the plot below.
```{r}
data.frame(x = rnorm(500, sd = 2)) %>%
mutate(y = x + rnorm(500, sd = 0.5)) %>%
ggplot(aes(x = x, y = y)) +
geom_point(alpha = 0.8) +
geom_vline(xintercept = 0, color = "forestgreen", size = 1) +
labs(x = "X", y = "Y") +
theme_classic(base_size = 14)
```
In some cases, this value might be meaningful - for example, if your covariate of interest was the expression of a certain gene, then zero expression has a biological meaning. In other situations it makes little sense, such as when the covariate of interest is the age of each patient in your dataset. It's unlikely that age being equal to zero would really mean much, as newborns aren't often part of clinical trials. There's ways to remedy this difficulty in interpretation, but we'll focus first on why the interpretation is the way it is.
We'll start by generating some data. The predictor variable we're interested in will be negative-binomially distributed, with $X \sim \text{NB}(10, 0.7)$. Note that we're using the parameterization of the negative binomial used in the `rnbinom()` function defaults, with the `size` and `probability` parameters. Our response variable $\mathbf{y}$ will be a function of $\mathbf{X}$ with added normally-distributed noise. Since we've increased our sample size to $n = 500$, we'll only look at the first few rows of each variable.
```{r}
X <- rnbinom(500, size = 10, prob = 0.7)
y <- 2 * X + rnorm(500, mean = 0, sd = 2)
X <- matrix(X, ncol = 1)
y <- matrix(y, ncol = 1)
data.frame(X = X[, 1], y = y[, 1]) %>%
slice_head(n = 5)
```
We can plot the data using `ggplot2` to get an idea of what the relationship between the two variables is.
```{r}
data.frame(X = X[, 1], y = y[, 1]) %>%
ggplot(aes(x = X, y = y)) +
geom_point() +
labs(x = "X", y = "y") +
theme_classic(base_size = 14)
```
Using `dplyr`, we can find the mean of $\mathbf{y}$ when $\mathbf{X} = 0$.
```{r}
data.frame(X = X[, 1], y = y) %>%
filter(X == 0) %>%
summarise(mu = mean(y))
```
Let's fit a linear model manually and see what the coefficients are. We'll first need to create the design matrix again using `model.matrix()`. This gives us a two column matrix, with the first column being the intercept (all 1s), and the second column being the negative binomial random variable $\mathbf{X}$ we just simulated. Unlike models with categorical predictors, the intercept is *not* simply equal to the expected value when $\mathbf{X} = 0$. Instead, the intercept is the expected value of the response variable conditional on the line of best fit that has been obtained i.e., conditional on the rest of the data in $\mathbf{X}$. See [this StackOverflow post](https://stackoverflow.com/questions/49085049/why-is-the-intercept-not-the-mean-at-the-continuous-predictor-being-zero) for another example.
```{r}
X_3 <- model.matrix(~X)
cont_mod_beta <- solve(t(X_3) %*% X_3) %*% t(X_3) %*% y
cont_mod_beta
```
We can validate the above by fitting a linear model with `lm()` and checking the coefficients $\boldsymbol{\beta} = [\beta_0, \beta_1]$, which are equal to our manually-computed coefficients.
```{r}
cont_mod <- lm(y ~ X)
coef(cont_mod)
```
### Centering continuous predictors
One way to make models like this more interpretable is to center continuous variables around their means. Doing so ensures that the centered version of continuous variable is equal to zero when the original version of the variable is at its mean. This can give a better interpretation to some models e.g., if the continuous variable of interest was patient age, then the intercept would be the expected value of the response for a patient of mean age. Since centering doesn't change the units of the continuous variable, only the intercept $\beta_0$ will change i.e., the coefficient for our predictor of interest will stay the same. We can validate this by creating a centered version of $\mathbf{X}$ and re-running the regression. The `scale()` function centers (subtracts the mean) *and* standardizes (divides by the standard deviation) by default, so we need to set `scale = FALSE` in order to only center the data.
```{r}
X_cent <- scale(X, scale = FALSE)
cont_mod_centered <- lm(y ~ X_cent)
coef(cont_mod_centered)
```
### Standardizing continuous predictors
Standardizing (or scaling, as R calls it) can also occasionally be useful. Dividing by the standard deviation in addition to centering results in our continuous random variable having mean 0 and standard deviation 1. This does change the units of the variable though, which is important to remember. It does not, however, change the interpretation of the intercept - which remains unchanged from the model we fit above with only centering. The coefficient $\beta_1$ differs though, and now represents the change in $\mathbf{y}$ given a one standard deviation increase in $\mathbf{X}$. For more on standardization see [this StatLect post](https://statlect.com/fundamentals-of-statistics/linear-regression-with-standardized-variables) and [this blog post from Columbia's Dr. Andrew Gelman](https://statmodeling.stat.columbia.edu/2009/07/11/when_to_standar/).
```{r}
X_scaled <- scale(X)
cont_mod_scaled <- lm(y ~ X_scaled)
coef(cont_mod_scaled)
```
Lastly, with respect to standardization at least, it's important to note that if we standardize the response variable $\mathbf{y}$ in addition to standardizing the predictor matrix $\mathbf{X}$, the intercept will disappear i.e., it will become zero. This is because after standardization, the means of both the predictor and response variables are equal to zero. Since the intercept is the mean of the response when the predictor is zero, the intercept is also zero. Note that because of how integers work in computer memory the value of the intercept shown below isn't quite zero, but it is very close. For another example of this, see [this Stackoverflow post](https://stats.stackexchange.com/questions/43036/why-does-the-y-intercept-of-a-linear-model-disappear-when-i-standardize-variable).
```{r}
y_scaled <- scale(y)
cont_mod_resp_scaled <- lm(y_scaled ~ X_scaled)
coef(cont_mod_resp_scaled)
```
## Models with categorical & continuous predictors
Finally, let's put it all together. In most real-life modeling scenarios you'll have a mix of categorical and continuous predictors, and thus care must be taken when preparing the data. You generally will want your intercept to be meaningful - whatever that means for the data you're analyzing. In this case, we'll simulate data under the following scenario: our response variable $\mathbf{y}$ is the expression of some gene of interest in each of $n = 500$ patients, and our predictor matrix $\mathbf{X}$ is composed of one categorical variable representing a treatment, a continuous variable representing patient age in years, and another categorical variable representing the facility at which each patient was treated.
```{r}
X_df <- data.frame(age = rpois(500, lambda = 30),
treatment = sample(c("A", "B"), 500, replace = TRUE),
facility = sample(c("Hosp1", "Hosp2"), 500, replace = TRUE)) %>%
mutate(y = 2 * age + rpois(500, lambda = 10),
y = case_when(treatment == "A" ~ 0.7 * y - rpois(1, lambda = 20),
TRUE ~ y),
y = case_when(facility == "Hosp2" ~ y + rpois(1, lambda = 10),
TRUE ~ y))
```
The above code to might be a bit confusing - we simulate **age** as a Poisson random variable with a mean of 30 years, and randomly assign one of two treatments and one of two facilities to each patient. Our response variable $\mathbf{y}$ is a function of all three predictors. We start by multiplying **age** by two and then adding Poisson-distributed random noise. We change the slope and subtract Poisson noise for treatment group **A**, and add Poisson-distributed noise for facility group **Hosp2**. Visualizing the data should help make sense of this process:
```{r, message=FALSE, warning=FALSE}
ggplot(X_df, aes(x = age, y = y, color = treatment)) +
facet_wrap(~facility) +
geom_point(alpha = 0.8) +
geom_smooth(mapping = aes(group = treatment),
color = "black",
method = "lm",
show.legend = FALSE) +
labs(x = "Age (Years)",
y = "Gene Expression",
color = "Treatment") +
theme_classic()
```
We can see that the lowest value of **age** in our dataset is `r min(X_df$age)`, and thus it doesn't really make sense to have our intercept correspond to **age** being equal to 0. Instead, we should center **age**, which will ensure that our intercept represents the expected value of $\mathbf{y}$ for a patient of mean age that was given treatment **A** at facility **Hosp1**. The reference groups for each categorical variable are known since R sorts categorical variables alphabetically by default (though this can be overridden through functions like `relevel()`).
```{r}
X_df <- mutate(X_df,
age_centered = scale(age, scale = FALSE))
```
Now that we have our centered variable, we can set up our design matrix. We use the formula syntax to specify which predictors we're interested in.
```{r}
X_design_mat <- model.matrix(~age_centered + treatment + facility, data = X_df)
head(X_design_mat)
```
We can now solve the ordinary linear regression problem by hand to obtain the coefficient vector $\boldsymbol{\beta} = [\beta_0, \beta_1, \beta_2, \beta_3]$.
```{r}
full_mod_beta <- solve(t(X_design_mat) %*% X_design_mat) %*% t(X_design_mat) %*% X_df$y
full_mod_beta
```
We can verify the result once again using `lm()`. The interpretation for $\beta_0$ is the expected response for a patient of mean **age** who is taking treatment **A** at facility **Hosp1**.
```{r}
full_mod <- lm(y ~ age_centered + treatment + facility, X_df)
coef(full_mod)
```
# Generalized linear model {#sec-GLMs}
## Setup
We'll next move to the more complicated case of the generalized linear model (GLM). We'll start by defining the basic form of a GLM; the main difference from an ordinary linear model is that the model's response variable is a transformation of the actual response variable. This transformation is taken via what we call a [link function](https://en.wikipedia.org/wiki/Generalized_linear_model#Link_function). There are some detail here I'm skipping over, but in practice the link function is usually the natural log, and can be others such as the logit (for logistic regression). The link function is usually denoted $g(\cdot)$, which gives us the following general form of a GLM with $p$ covariates:
$$
g(\mathbb{E}[\mathbf{y}]) = \beta_0 + \beta_1 \mathbf{X}_1 + \dots + \beta_p\mathbf{X}_p
$$ {#eq-16}
Like ordinary linear models, GLMs are linear in their covariates (as shown above), which is what gives them their relatively easy interpretations. However, unlike with ordinary linear regression, there is no simple closed-form solution to the above equation. Instead, a solution is estimated using something like iteratively reweighted least squares or Newton's method. As such, we won't be able to provide exact calculations of the GLM solutions like we did above with the ordinary linear models, so we'll stick to first providing the interpretation and then checking to make sure the fitted model matches that interpretation.
## The intercept-only model
In the intercept-only model, the GLM formula becomes:
$$
\begin{aligned}
g(\mathbb{E}[\mathbf{y}]) &= \beta_0 \\
\mathbb{E}[\mathbf{y}] &= g^{-1}(\beta_0) \\
\end{aligned}
$$ {#eq-17}
If we're using $g(\cdot) = \text{log}(\cdot)$ (the log-link), then the inverse of $g(\cdot)$ is $g^{-1}(\cdot) = \text{exp}(\cdot)$. In this case, it's easy to see that the intercept $\beta_0$ is actually the natural log of the mean of the response variable $\mathbf{y}$.
We can verify this in R. In this example we'll use a Poisson GLM with a log link function, as it's probably the simplest to understand. We'll start by simulating a Poisson-distributed response $\mathbf{y} \sim \text{Poisson}(5)$ with $n = 10$.
```{r}
y <- rpois(10, lambda = 5)
y <- matrix(y, ncol = 1)
y
```
The mean of $\mathbf{y}$ is `r mean(y)`, and the log of that quantity is:
```{r}
log(mean(y))
```
We can fit a Poisson GLM like so. We can clearly see that $\beta_0 = \text{log}(\bar{\mathbf{y}})$.
```{r}
null_mod_pois <- glm(y ~ 1, family = poisson(link = "log"))
coef(null_mod_pois)
```
Thus, when we use the inverse of the link function (exponentiation) on the estimated value of $\beta_0$, we get the value of $\mathbb{E}[\mathbf{y}] = `r mean(y)`$.
```{r}
exp(coef(null_mod_pois))
```
## Models with categorical predictors
The extension to categorical predictors is much the same as we saw before with ordinary linear models, with the primary change being that we're now working on the log scale. Let's add a categorical predictor to our model. First we simulate a categorical $\mathbf{X}$:
```{r}
X <- sample(c("A", "B"), size = 10, replace = TRUE)
X <- matrix(X, ncol = 1)
X
```
Again using `dplyr`, we can find $\bar{\mathbf{y}}$ and $\text{log}(\bar{\mathbf{y}})$ for each group in our categorical variable:
```{r}
data.frame(y = y,
X = X[, 1]) %>%
with_groups(X,
summarise,
mu = mean(y),
log_mu = log(mean(y)))
```
We fit a Poisson GLM using the `glm()` function, again using the natural log as our link function. We see that the intercept is equal to the log of the mean of our response variable for group **A**.
```{r}
cat_mod_pois <- glm(y ~ X, family = poisson(link = "log"))
coef(cat_mod_pois)
```
Next, if we sum the coefficients, we get the log of the response variable for group **B**:
```{r}
sum(coef(cat_mod_pois))
```
If we exponentiate the coefficients the intercept becomes simply the mean of $\mathbf{y}$ for group **A**.
```{r}
exp(coef(cat_mod_pois))
```
Lastly, if we exponentiate the sum of the coefficients (the order of these operations is important), we get the mean of $\mathbf{y}$ for group **B**:
```{r}
exp(sum(coef(cat_mod_pois)))
```
## Models with continuous predictors
As with ordinary linear models, in the presence of a continuous predictor the intercept becomes the expected value of the response variable conditional on the line of best fit that we generate. First we generate another $\mathbf{y}$ with a larger sample size and a slightly lower value of $\lambda$. Then we generate Gamma random noise, and define the predictor $\mathbf{X} = 3.1y + \epsilon$.
```{r}
y <- rpois(300, lambda = 3)
y <- matrix(y, ncol = 1)
X <- 3.1 * y + rgamma(300, shape = 20, rate = 4)
X <- matrix(X, ncol = 1)
```
We can plot $\mathbf{X}$ and $\mathbf{y}$ to get an idea of what their relationship is. Since $\mathbf{X}$ is a Gamma random variable, it doesn't actually have any zero values; the minimum value we've generated is actually `r min(X[, 1])`. As such, the intercept will be an extrapolation of the data that we do have.
```{r}
data.frame(y = y,
X = X[, 1]) %>%
ggplot(aes(x = X, y = y)) +
geom_point(alpha = 0.8) +
labs(x = "X", y = "Y") +
theme_classic(base_size = 14)
```
We fit another Poisson GLM and check the coefficients. The intercept in this case is interpreted as the log of the expected value of the response when our predictor variable is equal to zero. However - since $\mathbf{X}$ is essentially a [Gamma random variable](https://en.wikipedia.org/wiki/Gamma_distribution), it can actually never take a value of zero, as its support is $(0, \infty)$. As such, the interpretation of $\beta_0$ is somewhat useless here, and thus it makes sense to center $\mathbf{X}$ as we did earlier with the ordinary linear model.
```{r}
cont_mod_pois <- glm(y ~ X, family = poisson(link = "log"))
coef(cont_mod_pois)
```
After centering and refitting the Poisson GLM, the intercept takes on a new interpretation - the log of $\mathbf{y}$ when $\mathbf{X}$ is at its mean. Note that the coefficient for $\mathbf{X}$ does not change, as centering does not change the units of $\mathbf{X}$.
```{r}
X_cent <- scale(X, scale = FALSE)
cont_mod_pois_centered <- glm(y ~ X_cent, family = poisson(link = "log"))
coef(cont_mod_pois_centered)
```
Let's plot both the raw data and the fitted values obtained from our model. The blue horizontal line shows $\text{exp}(\beta_0)$ i.e., the expected value of $\mathbf{y}$ when $\mathbf{X}$ is at its mean value (again, conditional on the line of best fit that we obtained). The vertical yellow line shows where the mean of $\mathbf{X}$ is; since we centered $\mathbf{X}$ this is equal to zero. Lastly, the green line shows the predicted values from our model, and this line intersects nicely with the value of the intercept as we would expect.
```{r}
intercept_exp <- exp(coef(cont_mod_pois_centered)[1])
data.frame(y = y,
y_pred = fitted(cont_mod_pois_centered),
X = X_cent[, 1]) %>%
ggplot(aes(x = X, y = y)) +
geom_point(alpha = 0.8) +
geom_hline(yintercept = intercept_exp, color = "dodgerblue", size = 1) +
geom_vline(xintercept = 0, color = "goldenrod", size = 1) +
geom_line(aes(y = y_pred), color = "forestgreen", size = 1) +
labs(x = "X (centered)", y = "Y") +
theme_classic()
```
## Models with categorical & continuous predictors
Lastly, we'll again examine the most common real-life situation - models with both categorical and continuous predictors. Consider a dataset where the response is a Poisson-distributed random variable, say the number of copies of viral RNA in a sample taken from a patient. First, let's say that our sample size is set to a generous $n = 1000$. Next, imagine that we have two continuous predictors; the first being patient age in years, and the second being the number of months that the patient has been in treatment. We'll simulate age as a Poisson random variable with a mean of 25 years, and months in treatment as a negative binomial random variable with a mean of 3 months and overdispersion parameter (denoted `size` in the `rnbinom()` function) of 4. Lastly, let's define a categorical predictor with 3 possible treatment categories. We store all 3 predictors in a `data.frame`.
```{r}
X_age <- rpois(1000, 25)
X_months <- rnbinom(1000, size = 4, mu = 3)
X_treat <- sample(c("Drug_A", "Drug_B", "Drug_C"),
size = 1000,
replace = TRUE)
mod_df <- data.frame(age = X_age,
months = X_months,
treat = X_treat)
slice_head(mod_df, n = 5)
```
Finally, we'll define $\mathbf{y}$ to be a function of all 3 predictor variables along with a large amount of Poisson-distributed random noise following the distribution $\epsilon \sim \text{Poisson}(50)$. We use the handy `dplyr::case_when()` function to create $\mathbf{y}$ as a piecewise function whose relationship to the predictor variable changes based on the treatment each patient was given. After generating $\mathbf{y}$, we round it to ensure that it is integer-valued, since we're focused here on using Poisson GLMs.
```{r}
epsilon <- rpois(1000, 50)
mod_df <- mod_df %>%
mutate(y = case_when(treat == "Drug_A" ~ 2.25 * age - 1.2 * months,
treat == "Drug_B" ~ 2 * age - 3 * months,
treat == "Drug_C" ~ 1.75 * age - 5 * months),
y = round(y + epsilon))
slice_head(mod_df, n = 5)
```
We know that for our two continuous predictors, the intercept will indicate the predicted value of $\mathbf{y}$ for patients with values of zero for those two variables. Since we also have a categorical variable, the intercept will refer to the reference group of that variable - in this case patients who were assigned **Drug_A**. Lastly, since we're using a Poisson GLM, we know that the intercept will be the natural log of that quantity. Our model will be of the following form:
$$
\text{log}(\mathbb{E}[\text{Viral RNA}]) = \beta_0 + \beta_{\text{age}} X_{\text{age}} + \beta_{\text{months}} X_{\text{months}} + \beta_{\text{treat}} X_{\text{treat}} + \epsilon
$$ {#eq-18}
Let's visualize the data so that we can get some idea of what the relationships between our variables are, and what the intercept of the model we're going to fit will tell us. First let's look at the relationship between **age** and the response, splitting by treatment group. We add a purple vertical line showing where the overall mean of **age** is.
```{r}
ggplot(mod_df, aes(x = age, y = y, color = treat)) +
geom_vline(aes(xintercept = mean(age)), color = "purple", size = 1) +
geom_point(alpha = 0.8) +
labs(x = "Age", y = "Viral RNA", color = "Treatment") +
theme_classic(base_size = 14)
```
We repeat the visualization for **months**.
```{r}
ggplot(mod_df, aes(x = months, y = y, color = treat)) +
geom_vline(aes(xintercept = mean(months)), color = "purple", size = 1) +
geom_point(alpha = 0.8) +
labs(x = "Months Treated", y = "Viral RNA", color = "Treatment") +
theme_classic(base_size = 14)
```
Lastly, let's simply compare the distribution of the response between the three treatment groups using a beeswarm plot, which is a variation on the classic violin plot that I've preferred recently. Check out the [`ggbeeswarm` package docs]() for more. We also add a horizontal line showing $\bar{\mathbf{y}}$ for each group.
```{r}
ggplot(mod_df, aes(x = treat, y = y, color = treat)) +
ggbeeswarm::geom_quasirandom() +
stat_summary(fun = "mean",
geom = "crossbar",
width = 0.5,
size = 0.75,
color = "black") +
labs(x = "Treatment", y = "Viral RNA", color = "Treatment") +
theme_classic(base_size = 14)
```
Let's fit a model. As previously noted, we use the log link function. Since our treatment variable has three categories, we now have two coefficients for each of the non-reference group levels. The intercept $\beta_0$ gives us the log of the expected value of the response for a patient that is 0 years old, has spent 0 months in treatment, and is assigned to be treated with **Drug_A**. We can tell right away that this isn't the most useful interpretation, as there are no patients in our study who are zero years old (and even if there were, it probably wouldn't be that useful to know that quantity).
```{r}
full_mod_pois <- glm(y ~ age + months + treat,
data = mod_df,
family = poisson(link = "log"))
coef(full_mod_pois)
```
Since we're using the log-link, exponentiating the intercept gives us the expected value of $\mathbf{y}$ under the conditions we described just above.
```{r}
exp(coef(full_mod_pois))[1]
```
Clearly, the **age** variable is a prime target for centering. While it could be useful to center **months** (depending on the goals of the study), there are a total of `r sum(mod_df$months == 0)` patients in the study who have spent zero months in treatment, so it's at least somewhat anchored in reality. We center **age**, then refit the model. We see that the estimate of $\beta_0$ changes, while none of the coefficients for our predictors do (as expected).
```{r}
mod_df <- mutate(mod_df,
age_cent = scale(age, scale = FALSE))
full_mod_pois_centered <- glm(y ~ age_cent + months + treat,
data = mod_df,
family = poisson(link = "log"))
coef(full_mod_pois_centered)
```
Exponentiating $\beta_0$ now gives us the expected value of $\mathbf{y}$ for a patient of average age (which is `r mean(mod_df$age)` years) who was assigned treatment **Drug_A** and has spent zero months being treated. This quantity is useful because it tells us the baseline pre-treatment value for the average patient who takes **Drug_A**.
```{r}
exp(coef(full_mod_pois_centered))[1]
```
# Conclusions {#sec-conclusions}
While regression intercepts can seem tricky at first and don't always carry a useful interpretation by default, I think they're more useful than introductory & intermediate statistics courses give them credit for. I understand those course's emphasis on interpreting estimates for covariates of interest of course, as those are what generally determine the success or failure of our scientific hypotheses, but the intercept can often help establish a useful baseline value against which you can compare your other estimates.
There's a couple important takeaways I'd like to re-emphasize. **First** - always make sure you know the reference group of your categorical variables, and set it to something more useful if you need to! The alphabetical order used by R is often not quite what you want. **Second** - keep centering (or even pseudo-centering around another useful value) in mind & use it often. There's many situations outside of just patient age where a value of 0 makes little sense in the context of your study. **Lastly**, when using GLMs, make sure you know your link function, its inverse, and how to interpret your coefficients on each scale.
Finally, I've included some references below that have been helpful to me, and which I would recommend as further reading if you're interested in the technical details of regression & other modeling methods.
# References {#sec-refs}
1. Faraway, J. [Extending the Linear Model with R, 2nd Edition](https://doi.org/10.1201/9781315382722). *Chapman and Hall*. 2016.
2. Hastie, T. *et al*. [The Elements of Statistical Learning, 2nd Edition](https://doi.org/10.1007/b94608). *Springer*. 2009.
3. Wakefield, J. [Bayesian and Frequentist Regression Methods](https://doi.org/10.1007/978-1-4419-0925-1). *Springer*. 2016.
4. Klein, P. N. [Coding the Matrix: Linear Algebra Through Computer Science Applications, 1st Edition](http://codingthematrix.com). *Newtonian Press*. 2013.
5. Strang, G. [Introduction to Linear Algebra, 5th Edition](https://math.mit.edu/~gs/linearalgebra/ila5/indexila5.html). *Wellesley-Cambridge Press*. 2016.
# Session info {#sec-SI}
```{r}
sessioninfo::session_info()
```