Code
library(dplyr) # data manipulation
library(ggplot2) # plots
library(JuliaCall) # call Julia from R
library(reticulate) # call Python from R
April 16, 2024
In the course of my recent work on trajectory differential expression, I’ve spent a fair amount of time fitting generalized estimating equation (GEE) models. These models are like classical GLMs in that they can handle non-normally distributed response variables via some specified link function & iterative estimation of coefficients, but they differ in that they can account for the variation inherent to longitudinal or otherwise clustered datasets in which observations are measured repeatedly from multiple subjects & are thus not independent. Since scRNA-seq experiments are almost always composed of samples from multiple subjects nowadays, GEEs can be of great use when building models of gene expression, as they allow us to be more confident that our standard errors are accurate. Note that unlike (generalized) linear mixed models, the GEE is a marginal model and thus provides only a population-level fit - GLMMs are conditional models as they provide subject-specific fits.
Since single cell mRNA abundance follows the negative binomial distribution, it’s necessary that whatever GEE fitting algorithm we use support that distribution. Since the negative binomial is a two-parameter distribution, this does complicate things; the most commonly-used R packages for GEEs do not support NB models. We do have a couple options in R though, the Python statsmodels
ecosystem supports NB GEEs as well, and there’s a Julia package too. Our goal here will be to benchmark the available options in terms of runtime and model accuracy in terms of the fitted values & the estimated parameters. We’ll keep our simulated data simple, with a negative binomial response \(Y\) and a single continuously-valued covariate \(X\); thus we’ll estimate the coefficient vector \(\boldsymbol{\beta} = [\beta_0, \beta_1]^T\) using each method for each set of simulations.
We’ll call the Python code from R using reticulate
, but we’ll need to make sure to use the virtual environment I set up previously that has the statsmodels
and pandas
libraries (and their various dependencies) installed.
We’ll also need to activate a Julia environment, into which I’ve installed the GEE.jl
package and its necessary dependencies.
Let’s first simulate an example dataset to show what each further simulation should look like. Our parameters are generated from the following distributions (note that we use the mean & dispersion parameterization of the negative binomial):
\[ \begin{aligned} \boldsymbol{\beta} &\sim \mathcal{N}(\mu = 0,\: \sigma^2 = 0.25) \\ X &\sim \mathcal{N}(\mu = 0,\: \sigma^2 = 1) \\ Y &\sim \text{NB}(\mu = e^{\beta_0 + X\beta_1},\: \theta = 3) \end{aligned} \]
We’ll generate \(n = 25\) samples to start, though our actual sample size in the simulations will be larger than this. We add a uniformly-sampled dropout with a rate of 10% to our response \(Y\) to mimic the technical variation inherent to scRNA-seq experiments. Lastly, I’ll note that we’re making this example slightly less complex by not including multiple observations per subject, which we will do in our actual experiment.
set.seed(312)
sim_beta <- rnorm(2, mean = 0, sd = 0.5)
sim_df <- data.frame(X = rnorm(25, mean = 0, sd = 1)) %>%
mutate(Beta_X = sim_beta[1] + sim_beta[2] * X,
Exp_Beta_X = exp(Beta_X)) %>%
rowwise() %>%
mutate(Y = rnbinom(1, mu = Exp_Beta_X, size = 3)) %>%
ungroup()
dropout_idx <- sample(seq(nrow(sim_df)), size = round(0.1 * nrow(sim_df)))
sim_df$Y[dropout_idx] <- 0
Next, since we’re dealing with independent observations we fit an NB GLM via the MASS
package which we’ll use to generate fitted values and standard errors.
Let’s visualize the results. We see that the confidence interval around the fitted mean widens as \(X\) and \(Y\) increase. This data looks similar enough to what I see in real scRNA-seq studies, so I’m satisfied with using the above simulation scheme for a more rigorous experiment.
sim_df %>%
mutate(glm_pred = predict(nb_glm),
glm_pred_se = predict(nb_glm, se.fit = TRUE)$se.fit,
glm_fit = exp(glm_pred),
glm_fit_ci_lo = exp(glm_pred - 1.96 * glm_pred_se),
glm_fit_ci_hi = exp(glm_pred + 1.96 * glm_pred_se)) %>%
ggplot(aes(x = X, y = Y)) +
geom_point(size = 3) +
geom_line(aes(x = X, y = glm_fit),
color = "forestgreen",
size = 1) +
geom_ribbon(aes(x = X, ymin = glm_fit_ci_lo, ymax = glm_fit_ci_hi, color = NULL),
fill = "forestgreen",
alpha = 0.4) +
labs(x = "X",
y = "Y",
title = "Simulated Negative Binomial mRNA Counts",
subtitle = "Fitted values from NB GLM with Wald 95% C.I.") +
theme_classic(base_size = 14) +
theme(plot.subtitle = element_text(face = "italic", size = 12))
Using the same general simulation procedure from the example above, we’ll write a function to simulate negative binomial data suitable for using GEEs. We’ll generate multiple observations per subject, though the resulting data will be in the same format, albeit with an added subject ID column. We’ll keep things simple by allocating our total sample size evenly across subjects & keeping \(\boldsymbol{\beta}\) constant across all subjects.
sim_gee_data <- function(n.subjects = NULL,
n.per.subject = NULL,
dropout.rate = 0.1,
theta.y = 3) {
# simulate true coefficients
sim_beta <- rnorm(2, mean = 0, sd = 0.5)
# simulate X & Y per subject
subject_names <- paste0("S", seq(n.subjects))
subject_sims <- purrr::map(subject_names, function(s) {
sim_df <- data.frame(ID = s,
X = rnorm(n.per.subject, mean = 0, sd = 1)) %>%
mutate(Beta_X = sim_beta[1] + sim_beta[2] * X,
Exp_Beta_X = exp(Beta_X)) %>%
rowwise() %>%
mutate(Y = rnbinom(1, mu = Exp_Beta_X, size = theta.y)) %>%
ungroup()
})
subject_sim_df <- purrr::reduce(subject_sims, rbind) %>%
mutate(ID = as.factor(ID))
# add stochastic dropout
dropout_idx <- sample(seq(nrow(subject_sim_df)), size = round(dropout.rate * nrow(subject_sim_df)))
subject_sim_df$Y[dropout_idx] <- 0
res_list <- list(beta = sim_beta,
sim_df = subject_sim_df)
return(res_list)
}
Next, we’ll write a function to run all of our different model types. If we use the R implementations of the GEE framework, we need to provide an estimate for \(\theta\), since the estimation procedure depends on \(\theta\) being “known”. To approximate this, we provide the value of \(\theta\) estimated using the intercept-only model, which in my experience has been a good enough approximation for simpler datasets like this one.
run_model <- function(sim.data = NULL, model.type = NULL) {
# run correct model framework
if (model.type == "NB GLM") {
start_time <- Sys.time()
model_fit <- MASS::glm.nb(Y ~ X,
data = sim.data,
link = "log")
diff_time <- Sys.time() - start_time
model_preds <- predict(model_fit, type = "link")
model_coef <- coef(model_fit)
model_est_alpha <- NA_real_
} else if (model.type == "geeM") {
start_time <- Sys.time()
theta_hat <- MASS::theta.mm(y = sim.data$Y,
mu = mean(sim.data$Y),
dfr = nrow(sim.data) - 1)
model_fit <- geeM::geem(Y ~ X,
id = sim.data$ID,
data = sim.data,
family = MASS::negative.binomial(theta = theta_hat, link = "log"),
corstr = "exchangeable",
sandwich = TRUE)
diff_time <- Sys.time() - start_time
model_preds <- predict(model_fit, type = "link")
model_coef <- coef(model_fit)
model_est_alpha <- model_fit$alpha
} else if (model.type == "mmmgee") {
theta_hat <- MASS::theta.mm(y = sim.data$Y,
mu = mean(sim.data$Y),
dfr = nrow(sim.data) - 1)
start_time <- Sys.time()
model_fit <- mmmgee::geem2(Y ~ X,
id = sim.data$ID,
data = sim.data,
family = MASS::negative.binomial(theta = theta_hat, link = "log"),
corstr = "exchangeable",
sandwich = TRUE)
diff_time <- Sys.time() - start_time
model_preds <- predict(model_fit, type = "link")
model_coef <- coef(model_fit)
model_est_alpha <- model_fit$alpha
} else if (model.type == "statsmodels") {
start_time <- Sys.time()
nb_family <- sm$families$NegativeBinomial(link = sm$genmod$families$links$log)
cor_structure <- sm$cov_struct$Exchangeable()
py_gee <- sm$GEE$from_formula("Y ~ X",
"ID",
data = sim.data,
cov_struct = cor_structure,
family = nb_family)
model_fit <- py_gee$fit()
diff_time <- Sys.time() - start_time
model_preds <- model_fit$get_prediction()$linpred$predicted_mean
model_coef <- model_fit$params
model_est_alpha <- model_fit$cov_struct$dep_params
} else if (model.type == "GEE.jl") {
start_time <- Sys.time()
JuliaCall::julia_assign("sim_data", sim.data)
JuliaCall::julia_command('model_fit = gee(@formula(Y ~ X), sim_data, sim_data.ID, NegativeBinomial(), ExchangeableCor(), LogLink(), cov_type="robust");')
diff_time <- Sys.time() - start_time
model_est_alpha <- JuliaCall::julia_eval("GEE.corparams(model_fit)")
model_coef <- JuliaCall::julia_eval("GEE.coef(model_fit)")
model_preds <- log(JuliaCall::julia_eval("GEE.predict(model_fit)")) # GEE.predict() gives fitted values
}
# format results & compute model error
names(model_coef) <- c("B0", "B1")
model_rmse <- yardstick::rmse_vec(truth = sim.data$Y,
estimate = exp(model_preds))
model_huber_loss <- yardstick::huber_loss_vec(truth = sim.data$Y,
estimate = exp(model_preds),
delta = 2)
res_list <- list(model_type = model.type,
model_runtime = as.numeric(diff_time),
model_runtime_units = attributes(diff_time)$units,
model_rmse = model_rmse,
model_huber_loss = model_huber_loss,
model_beta = model_coef,
model_alpha = model_est_alpha)
return(res_list)
}
We fit each model over 100 iterations, and save the results in a list (purrr
is great for this kind of thing).
sim_list <- purrr::map(seq(100), function(i) {
set.seed(i)
sim_data <- sim_gee_data(n.subjects = 3,
n.per.subject = 400,
dropout.rate = 0.1,
theta.y = 5)
glm_res <- run_model(sim.data = sim_data$sim_df, model.type = "NB GLM")
geeM_res <- run_model(sim.data = sim_data$sim_df, model.type = "geeM")
mmmgee_res <- run_model(sim.data = sim_data$sim_df, model.type = "mmmgee")
statsmodels_res <- run_model(sim.data = sim_data$sim_df, model.type = "statsmodels")
julia_GEE_res <- run_model(sim.data = sim_data$sim_df, model.type = "GEE.jl")
all_models <- list(glm_res,
geeM_res,
mmmgee_res,
statsmodels_res,
julia_GEE_res)
overall_res_df <- data.frame(Iter = i,
Beta_0 = sim_data$beta[1],
Beta_1 = sim_data$beta[2],
Method = purrr::map_chr(all_models, \(x) x$model_type),
Est_Beta_0 = purrr::map_dbl(all_models, \(x) x$model_beta[1]),
Est_Beta_1 = purrr::map_dbl(all_models, \(x) x$model_beta[2]),
Est_Alpha = purrr::map_dbl(all_models, \(x) x$model_alpha),
RMSE = purrr::map_dbl(all_models, \(x) x$model_rmse),
Huber_Loss = purrr::map_dbl(all_models, \(x) x$model_huber_loss),
Runtime = purrr::map_dbl(all_models, \(x) x$model_runtime),
Runtime_Units = purrr::map_chr(all_models, \(x) x$model_runtime_units))
return(overall_res_df)
})
Let’s coerce the results to a dataframe & add a couple more features.
Now that we have everything simulated & recorded, we can analyze the data. Let’s start by just looking at the distribution of runtime per method. While we should have expected that the GLMs would be quicker, I’m pretty surprised by just how much faster the statsmodels
GEEs were than either R implementation. The GEE.jl
implementation is very fast too, but we can see a known issue with Julia - the first execution takes way longer than anything else.
ggplot(sim_res_df, aes(x = Method, y = Runtime_Seconds, fill = Method)) +
geom_violin(scale = "width",
draw_quantiles = 0.5,
size = 1) +
scale_fill_manual(values = paletteer::paletteer_d("ggsci::category10_d3")) +
labs(y = "Runtime (seconds)", fill = "Method") +
theme_classic(base_size = 14) +
theme(axis.title.x = element_blank()) +
guides(fill = guide_legend(override.aes = list(color = NULL)))
Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
collapsing to unique 'x' values
Moving onto model accuracy, let’s look at the RMSE & Huber loss. All methods seem to have essentially the same distribution.
sim_res_df %>%
select(Iter, Method, RMSE, Huber_Loss) %>%
tidyr::pivot_longer(cols = c(RMSE, Huber_Loss), names_to = "Metric", values_to = "Metric_Value") %>%
mutate(Metric = if_else(Metric == "RMSE", "RMSE", "Huber")) %>%
ggplot(aes(x = Method, y = Metric_Value, fill = Method)) +
facet_wrap(~Metric,
nrow = 2,
scales = "free_y") +
geom_violin(scale = "width",
draw_quantiles = 0.5,
size = 1) +
scale_fill_manual(values = paletteer::paletteer_d("ggsci::category10_d3")) +
labs(y = "Loss", fill = "Method") +
theme_classic(base_size = 14) +
theme(axis.title.x = element_blank()) +
guides(fill = guide_legend(override.aes = list(color = NULL)))
How about the estimated intercepts? It again seems that all methods have essentially the same error distribution.
ggplot(sim_res_df, aes(x = Method, y = Abs_Error_Beta0, fill = Method)) +
geom_violin(scale = "width",
draw_quantiles = 0.5,
size = 1) +
scale_fill_manual(values = paletteer::paletteer_d("ggsci::category10_d3")) +
labs(y = latex2exp::TeX(r"(Absolute Error for $\beta_0$)"), fill = "Method") +
theme_classic(base_size = 14) +
theme(axis.title.x = element_blank()) +
guides(fill = guide_legend(override.aes = list(color = NULL)))
As with the estimated values for \(\beta_0\), there is no clear difference between methods with respect to which method is better at estimating \(\beta_1\).
ggplot(sim_res_df, aes(x = Method, y = Abs_Error_Beta1, fill = Method)) +
geom_violin(scale = "width",
draw_quantiles = 0.5,
size = 1) +
scale_fill_manual(values = paletteer::paletteer_d("ggsci::category10_d3")) +
labs(y = latex2exp::TeX(r"(Absolute Error for $\beta_1$)"), fill = "Method") +
theme_classic(base_size = 14) +
theme(axis.title.x = element_blank()) +
guides(fill = guide_legend(override.aes = list(color = NULL)))
Lastly, let’s check to see if any of the GEE methods significantly differs from the others in terms of the estimated within-group correlation parameter \(\alpha\).
sim_res_df %>%
filter(!is.na(Est_Alpha)) %>%
ggplot(aes(x = Est_Alpha, fill = Method, color = Method)) +
facet_wrap(~Method, scales = "free_y") +
geom_density(alpha = 0.6, size = 1) +
scale_color_manual(values = paletteer::paletteer_d("ggsci::category10_d3")[-4]) +
scale_fill_manual(values = paletteer::paletteer_d("ggsci::category10_d3")[-4]) +
labs(x = latex2exp::TeX(r"(Estimated Correlation Parameter $\alpha$)"),
y = "Density",
fill = "Method") +
theme_classic(base_size = 14) +
guides(fill = guide_legend(override.aes = list(alpha = 1, color = NULL)))
Since the estimated coefficients and correlation parameters seem to be pretty much the same between methods, and the RMSE / Huber loss don’t differ much either, it would make sense to choose the modeling framework with the fastest runtime as the best option. For GEEs this is either of statsmodels
or GEE.jl
, which outperformed the R-based geeM
and mmmgee
quite handily (mean and median values are shown below for each method). The Julia implementation is good too, if ever so marginally slower than statsmodels
, though the time it takes to perform the first fit is somewhat annoying. In addition, the statsmodels
implementation varies less in runtime than all other options. This would seem to be an easy decision other than that it might be difficult to release & support R packages utilizing a Python library instead of a native R backend, since not all R users know Python, have it installed, or are capable of debugging its output. GEE.jl
obviously carries the same drawbacks, with the additional caveat of Julia being a much less widely-used language than either of R or Python. This is a tricky decision & I’m not sure of the right course to take yet, but I am impressed with how well statsmodels
performs and will absolutely be considering using it more in the future.
sim_res_df %>%
with_groups(Method,
summarise,
mu = mean(Runtime_Seconds),
med = median(Runtime_Seconds),
sd = sd(Runtime_Seconds)) %>%
arrange(mu) %>%
kableExtra::kbl(digits = 3,
booktabs = TRUE,
caption = "Runtime measured in seconds",
col.names = c("Method", "Mean Runtime", "Median Runtime", "S.D. Runtime")) %>%
kableExtra::kable_classic(full_width = FALSE)
Method | Mean Runtime | Median Runtime | S.D. Runtime |
---|---|---|---|
statsmodels | 0.020 | 0.019 | 0.002 |
NB GLM | 0.034 | 0.034 | 0.011 |
GEE.jl | 0.271 | 0.098 | 1.717 |
mmmgee | 2.652 | 2.495 | 0.492 |
geeM | 2.750 | 2.633 | 0.490 |
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.2.1 (2022-06-23)
os macOS Big Sur ... 10.16
system x86_64, darwin17.0
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2023-03-28
pandoc 2.19.2 @ /usr/local/bin/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0)
cli 3.3.0 2022-04-25 [1] CRAN (R 4.2.0)
codetools 0.2-18 2020-11-04 [1] CRAN (R 4.2.1)
colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0)
crayon 1.5.1 2022-03-26 [1] CRAN (R 4.2.0)
DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0)
digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.0)
dplyr * 1.0.9 2022-04-28 [1] CRAN (R 4.2.0)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
evaluate 0.16 2022-08-09 [1] CRAN (R 4.2.0)
fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0)
farver 2.1.1 2022-07-06 [1] CRAN (R 4.2.0)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
geeM 0.10.1 2018-06-18 [1] CRAN (R 4.2.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0)
ggplot2 * 3.3.6 2022-05-03 [1] CRAN (R 4.2.0)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
gtable 0.3.0 2019-03-25 [1] CRAN (R 4.2.0)
here 1.0.1 2020-12-13 [1] CRAN (R 4.2.0)
highr 0.9 2021-04-16 [1] CRAN (R 4.2.0)
htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.0)
htmlwidgets 1.5.4 2021-09-08 [1] CRAN (R 4.2.0)
httr 1.4.4 2022-08-17 [1] CRAN (R 4.2.0)
jsonlite 1.8.0 2022-02-22 [1] CRAN (R 4.2.0)
JuliaCall * 0.17.5 2022-09-08 [1] CRAN (R 4.2.0)
kableExtra 1.3.4 2021-02-20 [1] CRAN (R 4.2.0)
knitr 1.40 2022-08-24 [1] CRAN (R 4.2.0)
labeling 0.4.2 2020-10-20 [1] CRAN (R 4.2.0)
latex2exp 0.9.4 2022-03-02 [1] CRAN (R 4.2.0)
lattice 0.20-45 2021-09-22 [1] CRAN (R 4.2.1)
lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.2.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
MASS 7.3-58.1 2022-08-03 [1] CRAN (R 4.2.0)
Matrix 1.4-1 2022-03-23 [1] CRAN (R 4.2.1)
mmmgee 1.20 2019-06-21 [1] CRAN (R 4.2.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0)
mvtnorm 1.1-3 2021-10-08 [1] CRAN (R 4.2.0)
paletteer 1.5.0 2022-10-19 [1] CRAN (R 4.2.0)
pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
png 0.1-7 2013-12-03 [1] CRAN (R 4.2.0)
prismatic 1.1.1 2022-08-15 [1] CRAN (R 4.2.0)
purrr 0.3.4 2020-04-17 [1] CRAN (R 4.2.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.0)
rematch2 2.1.2 2020-05-01 [1] CRAN (R 4.2.0)
reticulate * 1.25 2022-05-11 [1] CRAN (R 4.2.0)
rlang 1.0.4 2022-07-12 [1] CRAN (R 4.2.0)
rmarkdown 2.16 2022-08-24 [1] CRAN (R 4.2.0)
rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.2.0)
rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0)
rvest 1.0.3 2022-08-19 [1] CRAN (R 4.2.0)
scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.0)
stringr 1.4.1 2022-08-20 [1] CRAN (R 4.2.0)
svglite 2.1.0 2022-02-03 [1] CRAN (R 4.2.0)
systemfonts 1.0.4 2022-02-11 [1] CRAN (R 4.2.0)
tibble 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)
tidyr 1.2.0 2022-02-01 [1] CRAN (R 4.2.0)
tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.2.0)
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0)
vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.2.0)
viridisLite 0.4.1 2022-08-22 [1] CRAN (R 4.2.0)
webshot 0.5.3 2022-04-14 [1] CRAN (R 4.2.0)
withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
xfun 0.32 2022-08-10 [1] CRAN (R 4.2.0)
xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.0)
yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0)
yardstick 1.0.0 2022-06-06 [1] CRAN (R 4.2.0)
[1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
─ Python configuration ───────────────────────────────────────────────────────
python: /Users/jack/Desktop/Python/science/venv/bin/python
libpython: /usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/config-3.8-darwin/libpython3.8.dylib
pythonhome: /Users/jack/Desktop/Python/science/venv:/Users/jack/Desktop/Python/science/venv
virtualenv: /Users/jack/Desktop/Python/science/venv/bin/activate_this.py
version: 3.8.16 (default, Dec 7 2022, 01:36:11) [Clang 14.0.0 (clang-1400.0.29.202)]
numpy: /Users/jack/Desktop/Python/science/venv/lib/python3.8/site-packages/numpy
numpy_version: 1.22.4
statsmodels: /Users/jack/Desktop/Python/science/venv/lib/python3.8/site-packages/statsmodels
NOTE: Python version was forced by use_python function
──────────────────────────────────────────────────────────────────────────────
---
title: "Benchmarking Negative Binomial GEE Model Backends"
author:
name: Jack Leary
email: j.leary@ufl.edu
orcid: 0009-0004-8821-3269
affiliations:
- name: University of Florida
department: Biostatistics
city: Gainesville
state: FL
date: today
date-format: long
format:
html:
code-fold: show
code-copy: true
code-tools: true
toc: true
embed-resources: true
fig-format: retina
df-print: kable
link-external-newwindow: true
execute:
cache: true
freeze: auto
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(message = FALSE, warning = FALSE, comment = NA)
```
# Introduction
In the course of my recent work on [trajectory differential expression](https://jr-leary7.github.io/quarto-site/tutorials/scLANE_Trajectory_DE.html), I've spent a fair amount of time fitting [generalized estimating equation](https://online.stat.psu.edu/stat504/lesson/12/12.1) (GEE) models. These models are like classical GLMs in that they can handle non-normally distributed response variables via some specified link function & iterative estimation of coefficients, but they differ in that they can account for the variation inherent to longitudinal or otherwise clustered datasets in which observations are measured repeatedly from multiple subjects & are thus not independent. Since scRNA-seq experiments are almost always composed of samples from multiple subjects nowadays, GEEs can be of great use when building models of gene expression, as they allow us to be more confident that our standard errors are accurate. Note that unlike (generalized) linear mixed models, the GEE is a *marginal* model and thus provides only a population-level fit - GLMMs are conditional models as they provide subject-specific fits.
Since [single cell mRNA abundance follows the negative binomial distribution](https://doi.org/10.1186/s13059-021-02584-9), it's necessary that whatever GEE fitting algorithm we use support that distribution. Since the negative binomial is a two-parameter distribution, this does complicate things; the most commonly-used R packages for GEEs do not support NB models. We do have a couple options in R though, the Python `statsmodels` ecosystem supports NB GEEs as well, and there's a Julia package too. Our goal here will be to benchmark the available options in terms of runtime and model accuracy in terms of the fitted values & the estimated parameters. We'll keep our simulated data simple, with a negative binomial response $Y$ and a single continuously-valued covariate $X$; thus we'll estimate the coefficient vector $\boldsymbol{\beta} = [\beta_0, \beta_1]^T$ using each method for each set of simulations.
# Libraries
## R
```{r, results='hide'}
library(dplyr) # data manipulation
library(ggplot2) # plots
library(JuliaCall) # call Julia from R
library(reticulate) # call Python from R
```
## Python
We'll call the Python code from R using `reticulate`, but we'll need to make sure to use the virtual environment I set up previously that has the `statsmodels` and `pandas` libraries (and their various dependencies) installed.
```{r, results='hide'}
use_virtualenv("~/Desktop/Python/science/venv/", required = TRUE)
sm <- import("statsmodels.api")
pd <- import("pandas")
```
## Julia
We'll also need to activate a Julia environment, into which I've installed the [`GEE.jl` package](https://github.com/kshedden/GEE.jl/tree/main) and its necessary dependencies.
```{r, results='hide'}
julia_setup(verbose = FALSE)
julia_command('using Pkg; Pkg.activate("/Users/jack/Desktop/Julia/science/");')
julia_command("using Distributions, GLM, GEE, DataFrames;")
```
# Data
Let's first simulate an example dataset to show what each further simulation should look like. Our parameters are generated from the following distributions (note that we use the mean & dispersion parameterization of the negative binomial):
$$
\begin{aligned}
\boldsymbol{\beta} &\sim \mathcal{N}(\mu = 0,\: \sigma^2 = 0.25) \\
X &\sim \mathcal{N}(\mu = 0,\: \sigma^2 = 1) \\
Y &\sim \text{NB}(\mu = e^{\beta_0 + X\beta_1},\: \theta = 3)
\end{aligned}
$$
We'll generate $n = 25$ samples to start, though our actual sample size in the simulations will be larger than this. We add a uniformly-sampled dropout with a rate of 10% to our response $Y$ to mimic the technical variation inherent to scRNA-seq experiments. Lastly, I'll note that we're making this example slightly less complex by not including multiple observations per subject, which we will do in our actual experiment.
```{r}
set.seed(312)
sim_beta <- rnorm(2, mean = 0, sd = 0.5)
sim_df <- data.frame(X = rnorm(25, mean = 0, sd = 1)) %>%
mutate(Beta_X = sim_beta[1] + sim_beta[2] * X,
Exp_Beta_X = exp(Beta_X)) %>%
rowwise() %>%
mutate(Y = rnbinom(1, mu = Exp_Beta_X, size = 3)) %>%
ungroup()
dropout_idx <- sample(seq(nrow(sim_df)), size = round(0.1 * nrow(sim_df)))
sim_df$Y[dropout_idx] <- 0
```
Next, since we're dealing with independent observations we fit an NB GLM via the `MASS` package which we'll use to generate fitted values and standard errors.
```{r}
nb_glm <- MASS::glm.nb(Y ~ X,
data = sim_df,
link = "log")
```
Let's visualize the results. We see that the confidence interval around the fitted mean widens as $X$ and $Y$ increase. This data looks similar enough to what I see in real scRNA-seq studies, so I'm satisfied with using the above simulation scheme for a more rigorous experiment.
```{r}
sim_df %>%
mutate(glm_pred = predict(nb_glm),
glm_pred_se = predict(nb_glm, se.fit = TRUE)$se.fit,
glm_fit = exp(glm_pred),
glm_fit_ci_lo = exp(glm_pred - 1.96 * glm_pred_se),
glm_fit_ci_hi = exp(glm_pred + 1.96 * glm_pred_se)) %>%
ggplot(aes(x = X, y = Y)) +
geom_point(size = 3) +
geom_line(aes(x = X, y = glm_fit),
color = "forestgreen",
size = 1) +
geom_ribbon(aes(x = X, ymin = glm_fit_ci_lo, ymax = glm_fit_ci_hi, color = NULL),
fill = "forestgreen",
alpha = 0.4) +
labs(x = "X",
y = "Y",
title = "Simulated Negative Binomial mRNA Counts",
subtitle = "Fitted values from NB GLM with Wald 95% C.I.") +
theme_classic(base_size = 14) +
theme(plot.subtitle = element_text(face = "italic", size = 12))
```
# Simulations
## Helper Functions
Using the same general simulation procedure from the example above, we'll write a function to simulate negative binomial data suitable for using GEEs. We'll generate multiple observations per subject, though the resulting data will be in the same format, albeit with an added subject ID column. We'll keep things simple by allocating our total sample size evenly across subjects & keeping $\boldsymbol{\beta}$ constant across all subjects.
```{r}
sim_gee_data <- function(n.subjects = NULL,
n.per.subject = NULL,
dropout.rate = 0.1,
theta.y = 3) {
# simulate true coefficients
sim_beta <- rnorm(2, mean = 0, sd = 0.5)
# simulate X & Y per subject
subject_names <- paste0("S", seq(n.subjects))
subject_sims <- purrr::map(subject_names, function(s) {
sim_df <- data.frame(ID = s,
X = rnorm(n.per.subject, mean = 0, sd = 1)) %>%
mutate(Beta_X = sim_beta[1] + sim_beta[2] * X,
Exp_Beta_X = exp(Beta_X)) %>%
rowwise() %>%
mutate(Y = rnbinom(1, mu = Exp_Beta_X, size = theta.y)) %>%
ungroup()
})
subject_sim_df <- purrr::reduce(subject_sims, rbind) %>%
mutate(ID = as.factor(ID))
# add stochastic dropout
dropout_idx <- sample(seq(nrow(subject_sim_df)), size = round(dropout.rate * nrow(subject_sim_df)))
subject_sim_df$Y[dropout_idx] <- 0
res_list <- list(beta = sim_beta,
sim_df = subject_sim_df)
return(res_list)
}
```
Next, we'll write a function to run all of our different model types. If we use the R implementations of the GEE framework, we need to provide an estimate for $\theta$, since the estimation procedure depends on $\theta$ being "known". To approximate this, we provide the value of $\theta$ estimated using the intercept-only model, which in my experience has been a good enough approximation for simpler datasets like this one.
```{r, results='hide'}
run_model <- function(sim.data = NULL, model.type = NULL) {
# run correct model framework
if (model.type == "NB GLM") {
start_time <- Sys.time()
model_fit <- MASS::glm.nb(Y ~ X,
data = sim.data,
link = "log")
diff_time <- Sys.time() - start_time
model_preds <- predict(model_fit, type = "link")
model_coef <- coef(model_fit)
model_est_alpha <- NA_real_
} else if (model.type == "geeM") {
start_time <- Sys.time()
theta_hat <- MASS::theta.mm(y = sim.data$Y,
mu = mean(sim.data$Y),
dfr = nrow(sim.data) - 1)
model_fit <- geeM::geem(Y ~ X,
id = sim.data$ID,
data = sim.data,
family = MASS::negative.binomial(theta = theta_hat, link = "log"),
corstr = "exchangeable",
sandwich = TRUE)
diff_time <- Sys.time() - start_time
model_preds <- predict(model_fit, type = "link")
model_coef <- coef(model_fit)
model_est_alpha <- model_fit$alpha
} else if (model.type == "mmmgee") {
theta_hat <- MASS::theta.mm(y = sim.data$Y,
mu = mean(sim.data$Y),
dfr = nrow(sim.data) - 1)
start_time <- Sys.time()
model_fit <- mmmgee::geem2(Y ~ X,
id = sim.data$ID,
data = sim.data,
family = MASS::negative.binomial(theta = theta_hat, link = "log"),
corstr = "exchangeable",
sandwich = TRUE)
diff_time <- Sys.time() - start_time
model_preds <- predict(model_fit, type = "link")
model_coef <- coef(model_fit)
model_est_alpha <- model_fit$alpha
} else if (model.type == "statsmodels") {
start_time <- Sys.time()
nb_family <- sm$families$NegativeBinomial(link = sm$genmod$families$links$log)
cor_structure <- sm$cov_struct$Exchangeable()
py_gee <- sm$GEE$from_formula("Y ~ X",
"ID",
data = sim.data,
cov_struct = cor_structure,
family = nb_family)
model_fit <- py_gee$fit()
diff_time <- Sys.time() - start_time
model_preds <- model_fit$get_prediction()$linpred$predicted_mean
model_coef <- model_fit$params
model_est_alpha <- model_fit$cov_struct$dep_params
} else if (model.type == "GEE.jl") {
start_time <- Sys.time()
JuliaCall::julia_assign("sim_data", sim.data)
JuliaCall::julia_command('model_fit = gee(@formula(Y ~ X), sim_data, sim_data.ID, NegativeBinomial(), ExchangeableCor(), LogLink(), cov_type="robust");')
diff_time <- Sys.time() - start_time
model_est_alpha <- JuliaCall::julia_eval("GEE.corparams(model_fit)")
model_coef <- JuliaCall::julia_eval("GEE.coef(model_fit)")
model_preds <- log(JuliaCall::julia_eval("GEE.predict(model_fit)")) # GEE.predict() gives fitted values
}
# format results & compute model error
names(model_coef) <- c("B0", "B1")
model_rmse <- yardstick::rmse_vec(truth = sim.data$Y,
estimate = exp(model_preds))
model_huber_loss <- yardstick::huber_loss_vec(truth = sim.data$Y,
estimate = exp(model_preds),
delta = 2)
res_list <- list(model_type = model.type,
model_runtime = as.numeric(diff_time),
model_runtime_units = attributes(diff_time)$units,
model_rmse = model_rmse,
model_huber_loss = model_huber_loss,
model_beta = model_coef,
model_alpha = model_est_alpha)
return(res_list)
}
```
## Running the Experiment
We fit each model over 100 iterations, and save the results in a list (`purrr` is great for this kind of thing).
```{r}
sim_list <- purrr::map(seq(100), function(i) {
set.seed(i)
sim_data <- sim_gee_data(n.subjects = 3,
n.per.subject = 400,
dropout.rate = 0.1,
theta.y = 5)
glm_res <- run_model(sim.data = sim_data$sim_df, model.type = "NB GLM")
geeM_res <- run_model(sim.data = sim_data$sim_df, model.type = "geeM")
mmmgee_res <- run_model(sim.data = sim_data$sim_df, model.type = "mmmgee")
statsmodels_res <- run_model(sim.data = sim_data$sim_df, model.type = "statsmodels")
julia_GEE_res <- run_model(sim.data = sim_data$sim_df, model.type = "GEE.jl")
all_models <- list(glm_res,
geeM_res,
mmmgee_res,
statsmodels_res,
julia_GEE_res)
overall_res_df <- data.frame(Iter = i,
Beta_0 = sim_data$beta[1],
Beta_1 = sim_data$beta[2],
Method = purrr::map_chr(all_models, \(x) x$model_type),
Est_Beta_0 = purrr::map_dbl(all_models, \(x) x$model_beta[1]),
Est_Beta_1 = purrr::map_dbl(all_models, \(x) x$model_beta[2]),
Est_Alpha = purrr::map_dbl(all_models, \(x) x$model_alpha),
RMSE = purrr::map_dbl(all_models, \(x) x$model_rmse),
Huber_Loss = purrr::map_dbl(all_models, \(x) x$model_huber_loss),
Runtime = purrr::map_dbl(all_models, \(x) x$model_runtime),
Runtime_Units = purrr::map_chr(all_models, \(x) x$model_runtime_units))
return(overall_res_df)
})
```
Let's coerce the results to a dataframe & add a couple more features.
```{r}
sim_res_df <- purrr::reduce(sim_list, rbind) %>%
mutate(Abs_Error_Beta0 = abs(Beta_0 - Est_Beta_0),
Abs_Error_Beta1 = abs(Beta_1 - Est_Beta_1),
Runtime_Seconds = if_else(Runtime_Units == "secs", Runtime, Runtime * 60))
```
# Analysis
Now that we have everything simulated & recorded, we can analyze the data. Let's start by just looking at the distribution of runtime per method. While we should have expected that the GLMs would be quicker, I'm pretty surprised by just how much faster the `statsmodels` GEEs were than either R implementation. The `GEE.jl` implementation is very fast too, but we can see a known issue with Julia - [the first execution takes way longer than anything else](https://julialang.org/blog/2021/01/precompile_tutorial/).
```{r}
ggplot(sim_res_df, aes(x = Method, y = Runtime_Seconds, fill = Method)) +
geom_violin(scale = "width",
draw_quantiles = 0.5,
size = 1) +
scale_fill_manual(values = paletteer::paletteer_d("ggsci::category10_d3")) +
labs(y = "Runtime (seconds)", fill = "Method") +
theme_classic(base_size = 14) +
theme(axis.title.x = element_blank()) +
guides(fill = guide_legend(override.aes = list(color = NULL)))
```
Moving onto model accuracy, let's look at the RMSE & Huber loss. All methods seem to have essentially the same distribution.
```{r}
sim_res_df %>%
select(Iter, Method, RMSE, Huber_Loss) %>%
tidyr::pivot_longer(cols = c(RMSE, Huber_Loss), names_to = "Metric", values_to = "Metric_Value") %>%
mutate(Metric = if_else(Metric == "RMSE", "RMSE", "Huber")) %>%
ggplot(aes(x = Method, y = Metric_Value, fill = Method)) +
facet_wrap(~Metric,
nrow = 2,
scales = "free_y") +
geom_violin(scale = "width",
draw_quantiles = 0.5,
size = 1) +
scale_fill_manual(values = paletteer::paletteer_d("ggsci::category10_d3")) +
labs(y = "Loss", fill = "Method") +
theme_classic(base_size = 14) +
theme(axis.title.x = element_blank()) +
guides(fill = guide_legend(override.aes = list(color = NULL)))
```
How about the estimated intercepts? It again seems that all methods have essentially the same error distribution.
```{r}
ggplot(sim_res_df, aes(x = Method, y = Abs_Error_Beta0, fill = Method)) +
geom_violin(scale = "width",
draw_quantiles = 0.5,
size = 1) +
scale_fill_manual(values = paletteer::paletteer_d("ggsci::category10_d3")) +
labs(y = latex2exp::TeX(r"(Absolute Error for $\beta_0$)"), fill = "Method") +
theme_classic(base_size = 14) +
theme(axis.title.x = element_blank()) +
guides(fill = guide_legend(override.aes = list(color = NULL)))
```
As with the estimated values for $\beta_0$, there is no clear difference between methods with respect to which method is better at estimating $\beta_1$.
```{r}
ggplot(sim_res_df, aes(x = Method, y = Abs_Error_Beta1, fill = Method)) +
geom_violin(scale = "width",
draw_quantiles = 0.5,
size = 1) +
scale_fill_manual(values = paletteer::paletteer_d("ggsci::category10_d3")) +
labs(y = latex2exp::TeX(r"(Absolute Error for $\beta_1$)"), fill = "Method") +
theme_classic(base_size = 14) +
theme(axis.title.x = element_blank()) +
guides(fill = guide_legend(override.aes = list(color = NULL)))
```
Lastly, let's check to see if any of the GEE methods significantly differs from the others in terms of the estimated within-group correlation parameter $\alpha$.
```{r}
sim_res_df %>%
filter(!is.na(Est_Alpha)) %>%
ggplot(aes(x = Est_Alpha, fill = Method, color = Method)) +
facet_wrap(~Method, scales = "free_y") +
geom_density(alpha = 0.6, size = 1) +
scale_color_manual(values = paletteer::paletteer_d("ggsci::category10_d3")[-4]) +
scale_fill_manual(values = paletteer::paletteer_d("ggsci::category10_d3")[-4]) +
labs(x = latex2exp::TeX(r"(Estimated Correlation Parameter $\alpha$)"),
y = "Density",
fill = "Method") +
theme_classic(base_size = 14) +
guides(fill = guide_legend(override.aes = list(alpha = 1, color = NULL)))
```
# Conclusions
Since the estimated coefficients and correlation parameters seem to be pretty much the same between methods, and the RMSE / Huber loss don't differ much either, it would make sense to choose the modeling framework with the fastest runtime as the best option. For GEEs this is either of `statsmodels` or `GEE.jl`, which outperformed the R-based `geeM` and `mmmgee` quite handily (mean and median values are shown below for each method). The Julia implementation is good too, if ever so marginally slower than `statsmodels`, though the time it takes to perform the first fit is somewhat annoying. In addition, the `statsmodels` implementation varies less in runtime than all other options. This would seem to be an easy decision other than that it might be difficult to release & support R packages utilizing a Python library instead of a native R backend, since not all R users know Python, have it installed, or are capable of debugging its output. `GEE.jl` obviously carries the same drawbacks, with the additional caveat of Julia being a much less widely-used language than either of R or Python. This is a tricky decision & I'm not sure of the right course to take yet, but I am impressed with how well `statsmodels` performs and will absolutely be considering using it more in the future.
```{r}
sim_res_df %>%
with_groups(Method,
summarise,
mu = mean(Runtime_Seconds),
med = median(Runtime_Seconds),
sd = sd(Runtime_Seconds)) %>%
arrange(mu) %>%
kableExtra::kbl(digits = 3,
booktabs = TRUE,
caption = "Runtime measured in seconds",
col.names = c("Method", "Mean Runtime", "Median Runtime", "S.D. Runtime")) %>%
kableExtra::kable_classic(full_width = FALSE)
```
# Session Info
```{r}
sessioninfo::session_info()
```