Code
::install_github("jr-leary7/scLANE") remotes
scLANE
April 16, 2024
In this tutorial we’ll walk through a basic trajectory differential expression analysis. We’ll use the scLANE
R package, which we developed with the goal of providing accurate and biologically interpretable models of expression over pseudotime. At the end are some best-practices recommendations, along with a short list of references we used in developing the method & writing the accompanying manuscript. If you want to skip the preprocessing steps and get right into the analysis, head to Section 5.
If you haven’t already, install the development version (currently v0.7.9) of scLANE
from the GitHub repository.
Next, we’ll load the packages we need to process, analyze, & visualize our data.
library(dplyr) # data manipulation
library(scLANE) # trajectory DE
library(Seurat) # scRNA-seq tools
library(ggplot2) # plot creation
library(patchwork) # plot combination
library(slingshot) # pseudotime estimation
library(reticulate) # Python interface
library(ComplexHeatmap) # heatmaps
rename <- dplyr::rename
We’ll also define a utility function to make our plots cleaner to read & easier to make.
And consistent color palettes will make our plots easier to understand.
We’ll load the well-known pancreatic endocrinogenesis data from Bastidas-Ponce et al (2019), which comes with the scVelo
Python library & has been used in several pseudotime inference / RNA velocity method papers as a good benchmark dataset due to the simplicity of the underlying trajectory manifold.
The AnnData
object contains data that we’ll need to extract, specifically the spliced & unspliced mRNA counts matrices (stored in AnnData.layers
) and the cell-level metadata (which is in AnnData.obs
).
AnnData object with n_obs × n_vars = 3696 × 27998
obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score'
var: 'highly_variable_genes'
uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
layers: 'spliced', 'unspliced'
obsp: 'distances', 'connectivities'
The reticulate
package allows us to pass the counts matrices & metadata from Python back to R.
While downloading this dataset requires a Python installation as well as the installation of the scVelo
Python library (and its dependencies), running scLANE
is done purely in R & requires no Python whatsoever.
We’ll use the spliced mRNA counts as our default assay, and also define a new assay containing the total (spliced + unspliced) mRNA in each cell. Lastly, we remove genes with non-zero spliced mRNA in 3 or fewer cells.
spliced_counts <- Matrix::Matrix(t(py$spliced_counts), sparse = TRUE)
unspliced_counts <- Matrix::Matrix(t(py$unspliced_counts), sparse = TRUE)
rna_counts <- spliced_counts + unspliced_counts
colnames(rna_counts) <- colnames(spliced_counts) <- colnames(unspliced_counts) <- py$adata$obs_names$to_list()
rownames(rna_counts) <- rownames(spliced_counts) <- rownames(unspliced_counts) <- py$adata$var_names$to_list()
spliced_assay <- CreateAssayObject(counts = spliced_counts)
spliced_assay@key <- "spliced_"
unspliced_assay <- CreateAssayObject(counts = unspliced_counts)
unspliced_assay@key <- "unspliced_"
rna_assay <- CreateAssayObject(counts = rna_counts)
rna_assay@key <- "rna_"
meta_data <- py$adata$obs %>%
mutate(cell_name = rownames(.),
.before = 1) %>%
rename(celltype = clusters,
celltype_coarse = clusters_coarse) %>%
mutate(nCount_spliced = colSums(spliced_counts),
nFeature_spliced = colSums(spliced_counts > 0),
nCount_unspliced = colSums(unspliced_counts),
nFeature_unspliced = colSums(unspliced_counts > 0),
nCount_RNA = colSums(rna_counts),
nFeature_RNA = colSums(rna_counts > 0))
seu <- CreateSeuratObject(counts = spliced_assay,
assay = "spliced",
project = "Mm_Panc_Endo",
meta.data = meta_data)
seu@assays$unspliced <- unspliced_assay
seu@assays$RNA <- rna_assay
seu <- seu[rowSums(seu@assays$spliced) > 3, ]
We preprocess the counts using a typical pipeline with QC, normalization & scaling, dimension reduction, and graph-based clustering via the Leiden algorithm.
seu <- PercentageFeatureSet(seu,
pattern = "^mt-",
col.name = "percent_mito",
assay = "spliced") %>%
PercentageFeatureSet(pattern = "^Rp[sl]",
col.name = "percent_ribo",
assay = "spliced") %>%
NormalizeData(assay = "spliced", verbose = FALSE) %>%
NormalizeData(assay = "unspliced", verbose = FALSE) %>%
NormalizeData(assay = "RNA", verbose = FALSE) %>%
FindVariableFeatures(assay = "spliced",
nfeatures = 3000,
verbose = FALSE) %>%
ScaleData(assay = "spliced",
vars.to.regress = c("percent_mito", "percent_ribo"),
model.use = "poisson",
verbose = FALSE) %>%
RunPCA(assay = "spliced",
npcs = 30,
approx = TRUE,
seed.use = 312,
verbose = FALSE) %>%
RunUMAP(reduction = "pca",
dims = 1:30,
n.components = 2,
metric = "cosine",
seed.use = 312,
verbose = FALSE) %>%
FindNeighbors(reduction = "pca",
k.param = 30,
nn.method = "annoy",
annoy.metric = "cosine",
verbose = FALSE) %>%
FindClusters(algorithm = 4,
method = "igraph",
resolution = 0.5,
random.seed = 312,
verbose = FALSE)
Let’s visualize the results on our UMAP embedding. The clustering generally agrees with the celltype labels, though there is some overclustering in the ductal cells & underclustering in the mature endocrine celltypes.
p0 <- Embeddings(seu, "umap") %>%
as.data.frame() %>%
magrittr::set_colnames(c("UMAP_1", "UMAP_2")) %>%
mutate(leiden = seu$seurat_clusters) %>%
ggplot(aes(x = UMAP_1, y = UMAP_2, color = leiden)) +
geom_point(size = 1.5,
alpha = 0.75,
stroke = 0) +
scale_color_manual(values = palette_cluster) +
labs(x = "UMAP 1",
y = "UMAP 2",
color = "Leiden") +
theme_scLANE(umap = TRUE) +
guide_umap()
p1 <- Embeddings(seu, "umap") %>%
as.data.frame() %>%
magrittr::set_colnames(c("UMAP_1", "UMAP_2")) %>%
mutate(celltype = seu$celltype) %>%
ggplot(aes(x = UMAP_1, y = UMAP_2, color = celltype)) +
geom_point(size = 1.5,
alpha = 0.75,
stroke = 0) +
scale_color_manual(values = palette_celltype) +
labs(x = "UMAP 1",
y = "UMAP 2",
color = "Celltype") +
theme_scLANE(umap = TRUE) +
guide_umap()
p2 <- (p0 / p1) +
plot_layout(guides = "collect", axes = "collect")
p2
We’ll start by fitting a trajectory using the slingshot
R package. We define cluster 5 as the starting cluster, since in this case we’re already aware of the dataset’s underlying biology. After generating the estimates for each cell, we rescale the ordering to be defined on \([0, 1]\). This has no effect on the trajectory DE results however, and is mostly an aesthetic choice.
sling_res <- slingshot(Embeddings(seu, "umap"),
start.clus = "5",
clusterLabels = seu$seurat_clusters,
approx_points = 500)
sling_pt <- slingPseudotime(sling_res) %>%
as.data.frame() %>%
magrittr::set_colnames(c("PT")) %>%
mutate(PT = (PT - min(PT)) / (max(PT) - min(PT)))
seu <- AddMetaData(seu,
metadata = sling_pt,
col.name = "sling_pt")
Let’s visualize the results on our UMAP embedding. They match what we would expect (knowing the biological background of the data), with ductal cells at the start of the process and endocrine celltypes such as alpha, beta, & delta cells at the end of it.
p3 <- Embeddings(seu, "umap") %>%
as.data.frame() %>%
magrittr::set_colnames(c("UMAP_1", "UMAP_2")) %>%
mutate(PT = sling_pt$PT) %>%
ggplot(aes(x = UMAP_1, y = UMAP_2, color = PT)) +
geom_point(size = 1.5,
alpha = 0.75,
stroke = 0) +
labs(x = "UMAP 1",
y = "UMAP 2",
color = "Pseudotime") +
scale_color_gradientn(colors = palette_heatmap,
labels = scales::label_number(accuracy = 0.01)) +
theme_scLANE(umap = TRUE)
p4 <- (p3 / p1) +
plot_layout(guides = "collect", axes = "collect")
p4
Next, we prepare the primary inputs to scLANE
: our Seurat
object with the spliced counts set as the default assay, a dataframe containing our estimated pseudotime ordering, a vector of size factors to use as an offset for sequencing depth in each model, and a set of genes whose dynamics we want to model. scLANE
parallelizes over genes in order to speed up the computation at the expense of using a little more memory. The models are fit using NB GLMs with optimal spline knots identified empirically, and differential expression is quantified using a likelihood ratio test of the fitted model vs. a constant (intercept-only) model. In practice, genes designated as HVGs are usually the best candidates for modeling, so we choose the top 3,000 HVGs as our input.
The testing of the HVG set on its own is also justified by the reality that almost all trajectories are inferred using some sort of dimension-reduced space, and those embeddings are nearly universally generated using a set of HVGs. As such, genes not included in the HVG set actually have no direct relationship with the estimated trajectory, & it’s generally safe to exclude them from trajectory analyses.
Registered S3 method overwritten by 'bit':
method from
print.ri gamlss
scLANE testing completed for 3000 genes across 1 lineage in 22.681 mins
After tidying the TDE results with getResultsDE()
, we pull a sample of 6 genes from the results & display their test statistics. By default, any gene with an adjusted p-value less than 0.01 is predicted to be dynamic, though this threshold can be easily adjusted.
select(scLANE_res_tidy,
Gene,
Test_Stat,
P_Val,
P_Val_Adj,
Gene_Dynamic_Overall) %>%
mutate(Gene_Dynamic_Overall = if_else(Gene_Dynamic_Overall == 1, "Dynamic", "Static")) %>%
with_groups(Gene_Dynamic_Overall,
slice_sample,
n = 3) %>%
kableExtra::kbl(digits = 4,
booktabs = TRUE,
col.names = c("Gene", "LRT stat.", "P-value", "Adj. p-value", "Predicted gene status")) %>%
kableExtra::kable_classic(full_width = FALSE, "hover")
Gene | LRT stat. | P-value | Adj. p-value | Predicted gene status |
---|---|---|---|---|
Prc1 | 1002.5987 | 0.0000 | 0.0000 | Dynamic |
Ddx3y | 34.2841 | 0.0000 | 0.0000 | Dynamic |
Ceacam10 | 31.5036 | 0.0000 | 0.0001 | Dynamic |
Lims2 | 2.4696 | 0.1161 | 1.0000 | Static |
Gm9801 | NA | NA | NA | Static |
Jakmip3 | 15.8731 | 0.0004 | 0.1152 | Static |
Next, we can use the plotModels()
function to visualize the fitted models from scLANE
and compare them to other modeling methods. The gene Neurog3 is strongly associated with epithelial cell differentiation, and indeed we see a clear, nonlinear transcriptional dynamic across pseudotime for that gene. A traditional GLM fails to capture that nonlinearity, and while a GAM fits the trend smoothly, it seems to overfit the dynamics near the boundaries of pseudotime - a known issue with additive models. Only the scLANE
model accurately models the rapid upregulation and equally swift downregulation of the transcription factor neurogenin-3 (Neurog3) over pseudotime, in addition to identifying the cutpoint in pseudotime at which downregulation begins.
Using the getFittedValues()
function allows us to generate predictions from the models we fit, which we then use to visualize the dynamics of a few genes that are known to be strongly associated with the differentiation of immature progenitors into mature endocrine phenotypes (source 1, source 2). For all four genes, the fitted models show knots chosen in the area of pseudotime around the pre-endocrine cells. This tells us that these driver genes are being upregulated in precursor celltypes & are driving differentiation into the mature celltypes such as alpha & beta cells, after which the genes are downregulated.
p6 <- getFittedValues(scLANE_models,
genes = c("Chga", "Chgb", "Fev", "Cck"),
pt = sling_pt,
expr.mat = seu,
size.factor.offset = cell_offset,
cell.meta.data = select(seu@meta.data, celltype, celltype_coarse)) %>%
ggplot(aes(x = pt, y = rna_log1p)) +
facet_wrap(~gene,
ncol = 2,
scales = "free_y") +
geom_point(aes(color = celltype),
size = 2,
alpha = 0.75,
stroke = 0) +
geom_vline(data = data.frame(gene = "Chga", knot = unique(scLANE_models$Chga$Lineage_A$MARGE_Slope_Data$Breakpoint)),
mapping = aes(xintercept = knot),
linetype = "dashed",
color = "grey20") +
geom_vline(data = data.frame(gene = "Chgb", knot = unique(scLANE_models$Chgb$Lineage_A$MARGE_Slope_Data$Breakpoint)),
mapping = aes(xintercept = knot),
linetype = "dashed",
color = "grey20") +
geom_vline(data = data.frame(gene = "Cck", knot = unique(scLANE_models$Cck$Lineage_A$MARGE_Slope_Data$Breakpoint)),
mapping = aes(xintercept = knot),
linetype = "dashed",
color = "grey20") +
geom_vline(data = data.frame(gene = "Fev", knot = unique(scLANE_models$Fev$Lineage_A$MARGE_Slope_Data$Breakpoint)),
mapping = aes(xintercept = knot),
linetype = "dashed",
color = "grey20") +
geom_ribbon(aes(ymin = scLANE_ci_ll_log1p, ymax = scLANE_ci_ul_log1p),
linewidth = 0,
fill = "grey70",
alpha = 0.9) +
geom_line(aes(y = scLANE_pred_log1p),
color = "black",
linewidth = 0.75) +
scale_x_continuous(labels = scales::label_number(accuracy = 0.01)) +
scale_color_manual(values = palette_celltype) +
labs(x = "Pseudotime",
y = "Normalized Expression") +
theme_scLANE() +
theme(legend.title = element_blank(),
strip.text.x = element_text(face = "italic")) +
guide_umap()
p6
On the other hand, if we use additive models the “peak” of expression is placed among the mature endocrine celltypes - which doesn’t make biological sense if we know that these genes are driving that process of differentiation. This can of course be tweaked by changing the degree or degrees of freedom of the underlying basis spline, but choosing a “best” value for those hyperparameters can be difficult, whereas scLANE
identifies optimal parameters internally by default. In addition, the knots chosen by scLANE
for each gene can be informative with respect to the underlying biology, whereas the knots from GAMs are evenly spaced at quantiles & carry no biological interpretation.
p7 <- getFittedValues(scLANE_models,
genes = c("Chga", "Chgb", "Fev", "Cck"),
pt = sling_pt,
expr.mat = seu,
size.factor.offset = cell_offset,
cell.meta.data = select(seu@meta.data, celltype, celltype_coarse)) %>%
mutate(rna_raw = rna / size_factor, .before = 7) %>%
with_groups(gene,
mutate,
GAM_fitted_link = predict(nbGAM(expr = rna_raw,
pt = sling_pt,
Y.offset = cell_offset,
spline.df = 3)),
GAM_se_link = predict(nbGAM(expr = rna_raw,
pt = sling_pt,
Y.offset = cell_offset,
spline.df = 3), se.fit = TRUE)[[2]]) %>%
mutate(GAM_pred = exp(GAM_fitted_link) * cell_offset,
GAM_ci_ll = exp(GAM_fitted_link - qnorm(0.975) * GAM_se_link) * cell_offset,
GAM_ci_ul = exp(GAM_fitted_link + qnorm(0.975) * GAM_se_link) * cell_offset,
GAM_pred_log1p = log1p(GAM_pred),
GAM_ci_ll_log1p = log1p(GAM_ci_ll),
GAM_ci_ul_log1p = log1p(GAM_ci_ul)) %>%
ggplot(aes(x = pt, y = rna_log1p)) +
facet_wrap(~gene,
ncol = 2,
scales = "free_y") +
geom_point(aes(color = celltype),
size = 2,
alpha = 0.75,
stroke = 0) +
geom_ribbon(aes(ymin = GAM_ci_ll_log1p, ymax = GAM_ci_ul_log1p),
linewidth = 0,
fill = "grey70",
alpha = 0.9) +
geom_line(aes(y = GAM_pred_log1p),
color = "black",
linewidth = 0.75) +
scale_x_continuous(labels = scales::label_number(accuracy = 0.01)) +
scale_color_manual(values = palette_celltype) +
labs(x = "Pseudotime",
y = "Normalized Expression") +
theme_scLANE() +
theme(legend.title = element_blank(),
strip.text.x = element_text(face = "italic")) +
guide_umap()
p7
Let’s take a broader view of the dataset by examining the distribution of adaptively chosen knots from our models. We limit the analysis to the set of 2452 genes determined to be dynamic.
We’ll plot a histogram of the knot values along with a ridgeplot of the pseudotime distribution for each celltype. We see that a large number of the selected knots are placed at the beginning of the trajectory, around where the ductal cells transition into endocrine progenitors. A smaller set of knots is placed about halfway through the trajectory, which we’ve annotated as the point at which pre-endocrine cells begin differentiating into mature endocrine phenotypes.
p8 <- ggplot(knot_df, aes(x = knot)) +
geom_histogram(aes(y = after_stat(density)),
color = "black",
fill = "white",
linewidth = 0.5) +
geom_density(fill = "deepskyblue3",
alpha = 0.5,
color = "deepskyblue4",
linewidth = 1) +
scale_x_continuous(limits = c(0, 1), labels = scales::label_number(accuracy = 0.01)) +
labs(x = "Pseudotime") +
theme_scLANE() +
theme(axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank())
p9 <- data.frame(celltype = seu$celltype,
pt = seu$sling_pt) %>%
ggplot(aes(x = pt, y = celltype, fill = celltype, color = celltype)) +
ggridges::geom_density_ridges(alpha = 0.75, size = 1, scale = 0.95) +
scale_x_continuous(labels = scales::label_number(accuracy = 0.01), limits = c(0, 1)) +
scale_fill_manual(values = palette_celltype) +
scale_color_manual(values = palette_celltype) +
labs(x = "Pseudotime") +
theme_scLANE() +
theme(axis.title.y = element_blank(),
legend.title = element_blank()) +
guide_umap()
p10 <- (p8 / p9) +
plot_layout(heights = c(1/4, 3/4), axes = "collect")
p10
We can extract a matrix of fitted dynamics using smoothedCountsMatrix()
. Next, the embedGenes()
function reduces dimensionality with PCA, clusters the genes using the Leiden algorithm, & embeds the genes in two dimensions with UMAP.
First we’ll visualize the gene clusters on the first two PCs.
The UMAP embedding shows that even with the relatively small number of genes, clear patterns are visible.
We can also plot a heatmap of the dynamic genes; this requires a bit of setup, for which we’ll use the ComplexHeatmap
package. We scale each gene, and clip values to be on \([-6, 6]\). The columns (cells) of the heatmap are ordered by estimated pseudotime, and the rows (genes) are ordered by expression peak.
col_anno_df <- select(seu@meta.data,
cell_name,
celltype,
sling_pt) %>%
mutate(celltype = as.factor(celltype)) %>%
arrange(sling_pt)
gene_order <- sortGenesHeatmap(smoothed_counts$Lineage_A, pt.vec = sling_pt$PT)
heatmap_mat <- t(scale(smoothed_counts$Lineage_A))
heatmap_mat[heatmap_mat > 6] <- 6
heatmap_mat[heatmap_mat < -6] <- -6
colnames(heatmap_mat) <- seu$cell_name
heatmap_mat <- heatmap_mat[, col_anno_df$cell_name]
heatmap_mat <- heatmap_mat[gene_order, ]
palette_celltype_hm <- as.character(palette_celltype[1:length(unique(seu$celltype))])
names(palette_celltype_hm) <- levels(col_anno_df$celltype)
col_anno <- HeatmapAnnotation(Celltype = col_anno_df$celltype,
Pseudotime = col_anno_df$sling_pt,
col = list(Celltype = palette_celltype_hm,
Pseudotime = circlize::colorRamp2(seq(0, 1, by = 0.25), palette_heatmap)),
show_legend = TRUE,
show_annotation_name = FALSE,
gap = unit(1, "mm"),
border = TRUE)
palette_cluster_hm <- as.character(paletteer::paletteer_d("ggsci::default_igv")[1:length(unique(gene_embed$leiden))])
names(palette_cluster_hm) <- as.character(unique(gene_embed$leiden))
row_anno <- HeatmapAnnotation(Cluster = as.factor(gene_embed$leiden),
col = list(Cluster = palette_cluster_hm),
show_legend = TRUE,
show_annotation_name = FALSE,
annotation_legend_param = list(title = "Gene\nCluster"),
gap = unit(1, "mm"),
border = TRUE,
which = "row")
The heatmap shows clear dynamic patterns across pseudotime; these patterns are often referred to as expression cascades, and represent periodic up- and down-regulation of different gene programs during the course of the underlying biological process.
Heatmap(matrix = heatmap_mat,
name = "Spliced\nmRNA",
col = circlize::colorRamp2(colors = viridis::inferno(50),
breaks = seq(min(heatmap_mat), max(heatmap_mat), length.out = 50)),
cluster_columns = FALSE,
width = 12,
height = 6,
column_title = "",
cluster_rows = FALSE,
top_annotation = col_anno,
left_annotation = row_anno,
show_column_names = FALSE,
show_row_names = FALSE,
use_raster = TRUE,
raster_by_magick = TRUE,
raster_quality = 5)
Using our gene clusters & the gprofiler2
package, we run an enrichment analysis against the biological process (BP) set of gene ontologies. We make sure to order the genes in each cluster by their test statistics by joining to the results table from scLANE
.
gene_clust_list <- purrr::map(unique(gene_embed$leiden), \(x) {
filter(gene_embed, leiden == x) %>%
inner_join(scLANE_res_tidy, by = c("gene" = "Gene")) %>%
arrange(desc(Test_Stat)) %>%
pull(gene)
})
names(gene_clust_list) <- paste0("Leiden_", unique(gene_embed$leiden))
enrich_res <- gprofiler2::gost(gene_clust_list,
organism = "mmusculus",
ordered_query = TRUE,
multi_query = FALSE,
sources = "GO:BP",
significant = TRUE)
A look at the top 5 most-significant GO terms for each gene cluster reveals heterogeneous functionalities across groups of genes.
mutate(enrich_res$result,
query = gsub("Leiden_", "", query)) %>%
rename(cluster = query) %>%
with_groups(cluster,
slice_head,
n = 5) %>%
select(cluster,
term_name,
p_value,
term_size,
query_size,
intersection_size,
term_id) %>%
kableExtra::kbl(digits = 3,
booktabs = TRUE,
col.names = c("Gene Cluster", "Term Name", "Adj. P-value", "Term Size",
"Query Size", "Intersection Size", "Term ID")) %>%
kableExtra::kable_classic(c("hover"), full_width = FALSE)
Gene Cluster | Term Name | Adj. P-value | Term Size | Query Size | Intersection Size | Term ID |
---|---|---|---|---|---|---|
0 | tube development | 0 | 1181 | 335 | 66 | GO:0035295 |
0 | anatomical structure development | 0 | 6269 | 326 | 157 | GO:0048856 |
0 | multicellular organism development | 0 | 4879 | 329 | 136 | GO:0007275 |
0 | developmental process | 0 | 6854 | 326 | 164 | GO:0032502 |
0 | tube morphogenesis | 0 | 927 | 335 | 55 | GO:0035239 |
1 | signal release | 0 | 587 | 178 | 34 | GO:0023061 |
1 | export from cell | 0 | 982 | 316 | 55 | GO:0140352 |
1 | secretion by cell | 0 | 913 | 289 | 50 | GO:0032940 |
1 | peptide hormone secretion | 0 | 316 | 214 | 28 | GO:0030072 |
1 | hormone transport | 0 | 405 | 316 | 36 | GO:0009914 |
2 | nitrogen compound transport | 0 | 1959 | 160 | 46 | GO:0071705 |
2 | regulation of secretion by cell | 0 | 667 | 156 | 27 | GO:1903530 |
2 | regulation of hormone levels | 0 | 639 | 156 | 26 | GO:0010817 |
2 | secretion by cell | 0 | 913 | 156 | 30 | GO:0032940 |
2 | regulation of transport | 0 | 1749 | 274 | 56 | GO:0051049 |
3 | organonitrogen compound metabolic process | 0 | 5744 | 523 | 220 | GO:1901564 |
3 | system development | 0 | 4117 | 522 | 173 | GO:0048731 |
3 | positive regulation of biological process | 0 | 6589 | 528 | 234 | GO:0048518 |
3 | positive regulation of cellular process | 0 | 6078 | 528 | 218 | GO:0048522 |
3 | multicellular organism development | 0 | 4879 | 523 | 187 | GO:0007275 |
4 | cell cycle | 0 | 1808 | 303 | 173 | GO:0007049 |
4 | cell cycle process | 0 | 1256 | 285 | 149 | GO:0022402 |
4 | mitotic cell cycle | 0 | 852 | 323 | 125 | GO:0000278 |
4 | mitotic cell cycle process | 0 | 718 | 285 | 112 | GO:1903047 |
4 | chromosome segregation | 0 | 408 | 263 | 86 | GO:0007059 |
5 | multicellular organism development | 0 | 4879 | 314 | 126 | GO:0007275 |
5 | system development | 0 | 4117 | 314 | 112 | GO:0048731 |
5 | anatomical structure development | 0 | 6269 | 307 | 141 | GO:0048856 |
5 | developmental process | 0 | 6854 | 341 | 160 | GO:0032502 |
5 | animal organ development | 0 | 3260 | 341 | 99 | GO:0048513 |
We can use the geneProgramScoring()
function to add module scores for each gene cluster to our Seurat
object.
peptide_program <- filter(enrich_res$result, term_name == "peptide hormone secretion") %>%
arrange(p_value) %>%
slice_head(n = 1) %>%
pull(query)
peptide_program_name <- gsub("Leiden_", "cluster_", peptide_program)
seu <- geneProgramScoring(seu,
genes = gene_embed$gene,
gene.clusters = gene_embed$leiden)
Visualizing the scores on our UMAP embedding shows us that the peptide program is highly-enriched only in mature endocrine cells. This makes sense biologically as mature endocrine celltypes’ primary roles are to produce peptides such as glucagon (alpha cells), insulin (beta cells), somatostatin (ductal cells), and pancreatic polypeptide (gamma cells).
p13 <- Embeddings(seu, "umap") %>%
as.data.frame() %>%
magrittr::set_colnames(c("UMAP_1", "UMAP_2")) %>%
mutate(peptide_program_score = seu@meta.data[, peptide_program_name]) %>%
ggplot(aes(x = UMAP_1, y = UMAP_2, color = peptide_program_score)) +
geom_point(size = 1.5, alpha = 0.75, stroke = 0) +
labs(x = "UMAP 1",
y = "UMAP 2",
color = "Program Score") +
scale_color_gradientn(colors = palette_heatmap,
labels = scales::label_number(accuracy = 0.1)) +
theme_scLANE(umap = TRUE)
p14 <- (p13 / p1) +
plot_layout(guides = "collect", axes = "collect")
p14
We can also visualize the trend in the peptide program scores over time, which confirms the biological conclusions we came to by inspecting the UMAP.
p15 <- data.frame(PT = sling_pt$PT,
peptide_program_score = seu@meta.data[, peptide_program_name],
celltype = seu$celltype) %>%
ggplot(aes(x = PT, y = peptide_program_score, color = celltype)) +
geom_point(alpha = 0.75,
stroke = 0,
size = 2) +
geom_smooth(color = "black",
method = "loess",
linewidth = 0.75) +
scale_color_manual(values = palette_celltype) +
labs(x = "Pseudotime", y = "Peptide Program Score") +
theme_scLANE() +
theme(legend.title = element_blank()) +
guide_umap()
p15
Next, in order to identify which genes are “drivers” of a certain gene program, we can use the geneProgramDrivers()
function to correlate normalized expression with program scores.
We display the top 10 most-correlated genes here. For example, carboxypeptidase E (Cpe) is involved in the production of insulin (source).
Gene | Correlation | P-value | Adj. P-value |
---|---|---|---|
Cpe | 0.849 | 0 | 0 |
Gnas | 0.824 | 0 | 0 |
Fam183b | 0.824 | 0 | 0 |
Pcsk1n | 0.810 | 0 | 0 |
Sparc | -0.774 | 0 | 0 |
Aplp1 | 0.767 | 0 | 0 |
Hmgn3 | 0.757 | 0 | 0 |
Isl1 | 0.757 | 0 | 0 |
1700086L19Rik | 0.755 | 0 | 0 |
Spp1 | -0.752 | 0 | 0 |
Indeed, normalized expression of Cpe is high across all mature endocrine celltypes, with alpha and beta cells showing the highest overall mean expression.
p16 <- data.frame(celltype = seu$celltype,
rna = seu@assays$spliced@data["Cpe", ]) %>%
ggplot(aes(x = celltype, y = rna, color = celltype)) +
ggbeeswarm::geom_quasirandom(alpha = 0.75,
size = 2,
stroke = 0,
show.legend = FALSE) +
scale_color_manual(values = palette_celltype) +
stat_summary(fun = "mean",
geom = "point",
color = "black",
size = 3) +
labs(y = "Normalized Expression") +
theme_scLANE() +
theme(axis.title.x = element_blank())
p16
Lastly, we perform an enrichment analysis for our set of dynamic genes. We’ll focus on terms from the GO biological process (BP) set, as those are generally the easiest to interpret.
Overall, there are 1341 unique significantly-enriched GO:BP terms for our trajectory.
Term | P-value | Source |
---|---|---|
multi-organism reproductive process | 0.000 | GO:BP |
response to electrical stimulus | 0.006 | GO:BP |
regulation of nuclear division | 0.000 | GO:BP |
meiotic nuclear division | 0.000 | GO:BP |
response to transforming growth factor beta | 0.000 | GO:BP |
regulation of fibroblast proliferation | 0.000 | GO:BP |
cellular response to carbohydrate stimulus | 0.000 | GO:BP |
double-strand break repair via homologous recombination | 0.018 | GO:BP |
carbohydrate homeostasis | 0.000 | GO:BP |
positive regulation of calcium ion transport | 0.046 | GO:BP |
Hopefully this vignette has been a useful introduction to running the scLANE
software and using its outputs to help better understand biology at single-cell resolution. If you have questions about how the models work or are interpreted, software issues, or simply want to compare results feel free to open an issue on the GitHub repository or reach out via email to j.leary@ufl.edu.
Bastidas-Ponce, Aimée et al. Comprehensive single cell mRNA profiling reveals a detailed roadmap for pancreatic endocrinogenesis. Development (2019).
Street, Kelly et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics (2018).
Stoklosa, Jakub & David Warton. A generalized estimating equation approach to multivariate adaptive regression splines. Journal of Computational and Graphical Statistics (2018).
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.2 (2023-10-31)
os macOS Sonoma 14.3.1
system x86_64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2024-04-16
pandoc 3.1.12.3 @ /usr/local/bin/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
abind 1.4-5 2016-07-21 [1] CRAN (R 4.3.0)
backports 1.4.1 2021-12-13 [1] CRAN (R 4.3.0)
beeswarm 0.4.0 2021-06-01 [1] CRAN (R 4.3.0)
bigassertr 0.1.6 2023-01-10 [1] CRAN (R 4.3.0)
bigparallelr 0.3.2 2021-10-02 [1] CRAN (R 4.3.0)
bigstatsr 1.5.12 2022-10-14 [1] CRAN (R 4.3.0)
Biobase * 2.62.0 2023-10-24 [1] Bioconductor
BiocGenerics * 0.48.1 2023-11-01 [1] Bioconductor
BiocNeighbors 1.20.1 2023-12-18 [1] Bioconductor 3.18 (R 4.3.2)
BiocParallel 1.36.0 2023-10-24 [1] Bioconductor
bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.0)
bitops 1.0-7 2021-04-24 [1] CRAN (R 4.3.0)
bluster 1.12.0 2023-10-24 [1] Bioconductor
boot 1.3-28.1 2022-11-22 [1] CRAN (R 4.3.2)
broom 1.0.5 2023-06-09 [1] CRAN (R 4.3.0)
broom.mixed 0.2.9.4 2022-04-17 [1] CRAN (R 4.3.0)
circlize 0.4.15 2022-05-10 [1] CRAN (R 4.3.0)
cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.0)
clue 0.3-65 2023-09-23 [1] CRAN (R 4.3.0)
cluster 2.1.6 2023-12-01 [1] CRAN (R 4.3.0)
coda 0.19-4 2020-09-30 [1] CRAN (R 4.3.0)
codetools 0.2-19 2023-02-01 [1] CRAN (R 4.3.2)
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)
ComplexHeatmap * 2.18.0 2023-10-24 [1] Bioconductor
cowplot 1.1.2 2023-12-15 [1] CRAN (R 4.3.0)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.0)
data.table 1.14.10 2023-12-08 [1] CRAN (R 4.3.0)
DelayedArray 0.28.0 2023-10-24 [1] Bioconductor
DelayedMatrixStats 1.24.0 2023-10-24 [1] Bioconductor
deldir 2.0-2 2023-11-23 [1] CRAN (R 4.3.0)
digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)
doParallel 1.0.17 2022-02-07 [1] CRAN (R 4.3.0)
doSNOW 1.0.20 2022-02-04 [1] CRAN (R 4.3.0)
dotCall64 1.1-1 2023-11-28 [1] CRAN (R 4.3.0)
dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.0)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.3.0)
emmeans 1.10.0 2024-01-23 [1] CRAN (R 4.3.2)
estimability 1.4.1 2022-08-05 [1] CRAN (R 4.3.0)
evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.0)
fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.0)
farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.0)
fastDummies 1.7.3 2023-07-06 [1] CRAN (R 4.3.0)
fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
ff 4.0.9 2023-01-25 [1] CRAN (R 4.3.0)
fitdistrplus 1.1-11 2023-04-25 [1] CRAN (R 4.3.0)
flock 0.7 2016-11-12 [1] CRAN (R 4.3.0)
forcats 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)
foreach 1.5.2 2022-02-02 [1] CRAN (R 4.3.0)
furrr 0.3.1 2022-08-15 [1] CRAN (R 4.3.0)
future 1.33.1 2023-12-22 [1] CRAN (R 4.3.0)
future.apply 1.11.1 2023-12-21 [1] CRAN (R 4.3.0)
gamlss 5.4-20 2023-10-04 [1] CRAN (R 4.3.0)
gamlss.data 6.0-2 2021-11-07 [1] CRAN (R 4.3.0)
gamlss.dist 6.1-1 2023-08-23 [1] CRAN (R 4.3.0)
geeM 0.10.1 2018-06-18 [1] CRAN (R 4.3.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
GenomeInfoDb * 1.38.5 2023-12-28 [1] Bioconductor 3.18 (R 4.3.2)
GenomeInfoDbData 1.2.11 2023-12-22 [1] Bioconductor
GenomicRanges * 1.54.1 2023-10-29 [1] Bioconductor
GetoptLong 1.0.5 2020-12-15 [1] CRAN (R 4.3.0)
ggbeeswarm 0.7.2 2023-04-29 [1] CRAN (R 4.3.0)
ggh4x 0.2.8 2024-01-23 [1] CRAN (R 4.3.2)
ggplot2 * 3.4.4 2023-10-12 [1] CRAN (R 4.3.0)
ggrepel 0.9.5 2024-01-10 [1] CRAN (R 4.3.0)
ggridges 0.5.5 2023-12-15 [1] CRAN (R 4.3.0)
glm2 * 1.2.1 2018-08-11 [1] CRAN (R 4.3.0)
glmmTMB 1.1.8 2023-10-07 [1] CRAN (R 4.3.0)
GlobalOptions 0.1.2 2020-06-10 [1] CRAN (R 4.3.0)
globals 0.16.2 2022-11-21 [1] CRAN (R 4.3.0)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)
goftest 1.2-3 2021-10-07 [1] CRAN (R 4.3.0)
gprofiler2 0.2.2 2023-06-14 [1] CRAN (R 4.3.0)
gridExtra 2.3 2017-09-09 [1] CRAN (R 4.3.0)
gtable 0.3.4 2023-08-21 [1] CRAN (R 4.3.0)
highr 0.10 2022-12-22 [1] CRAN (R 4.3.0)
htmltools 0.5.7 2023-11-03 [1] CRAN (R 4.3.0)
htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.3.0)
httpuv 1.6.13 2023-12-06 [1] CRAN (R 4.3.0)
httr 1.4.7 2023-08-15 [1] CRAN (R 4.3.0)
ica 1.0-3 2022-07-08 [1] CRAN (R 4.3.0)
igraph 2.0.1.1 2024-01-30 [1] CRAN (R 4.3.2)
IRanges * 2.36.0 2023-10-24 [1] Bioconductor
irlba 2.3.5.1 2022-10-03 [1] CRAN (R 4.3.0)
iterators 1.0.14 2022-02-05 [1] CRAN (R 4.3.0)
jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.3.0)
kableExtra 1.3.4 2021-02-20 [1] CRAN (R 4.3.0)
KernSmooth 2.23-22 2023-07-10 [1] CRAN (R 4.3.2)
knitr 1.45 2023-10-30 [1] CRAN (R 4.3.0)
labeling 0.4.3 2023-08-29 [1] CRAN (R 4.3.0)
later 1.3.2 2023-12-06 [1] CRAN (R 4.3.0)
lattice 0.22-5 2023-10-24 [1] CRAN (R 4.3.0)
lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.3.0)
leiden 0.4.3.1 2023-11-17 [1] CRAN (R 4.3.0)
lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.0)
listenv 0.9.0 2022-12-16 [1] CRAN (R 4.3.0)
lme4 1.1-35.1 2023-11-05 [1] CRAN (R 4.3.0)
lmtest 0.9-40 2022-03-21 [1] CRAN (R 4.3.0)
magick 2.8.2 2023-12-20 [1] CRAN (R 4.3.0)
magrittr * 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
MASS 7.3-60 2023-05-04 [1] CRAN (R 4.3.0)
Matrix 1.6-5 2024-01-11 [1] CRAN (R 4.3.0)
MatrixGenerics * 1.14.0 2023-10-24 [1] Bioconductor
matrixStats * 1.2.0 2023-12-11 [1] CRAN (R 4.3.0)
mgcv 1.9-1 2023-12-21 [1] CRAN (R 4.3.0)
mime 0.12 2021-09-28 [1] CRAN (R 4.3.0)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.3.0)
minqa 1.2.6 2023-09-11 [1] CRAN (R 4.3.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)
mvtnorm 1.2-4 2023-11-27 [1] CRAN (R 4.3.0)
nlme 3.1-164 2023-11-27 [1] CRAN (R 4.3.0)
nloptr 2.0.3 2022-05-26 [1] CRAN (R 4.3.0)
numDeriv 2016.8-1.1 2019-06-06 [1] CRAN (R 4.3.0)
paletteer 1.6.0 2024-01-21 [1] CRAN (R 4.3.0)
parallelly 1.36.0 2023-05-26 [1] CRAN (R 4.3.0)
patchwork * 1.2.0 2024-01-08 [1] CRAN (R 4.3.0)
pbapply 1.7-2 2023-06-27 [1] CRAN (R 4.3.0)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)
plotly 4.10.4 2024-01-13 [1] CRAN (R 4.3.0)
plyr 1.8.9 2023-10-02 [1] CRAN (R 4.3.0)
png 0.1-8 2022-11-29 [1] CRAN (R 4.3.0)
polyclip 1.10-6 2023-09-27 [1] CRAN (R 4.3.0)
princurve * 2.1.6 2021-01-18 [1] CRAN (R 4.3.0)
prismatic 1.1.1 2022-08-15 [1] CRAN (R 4.3.0)
progressr 0.14.0 2023-08-10 [1] CRAN (R 4.3.0)
promises 1.2.1 2023-08-10 [1] CRAN (R 4.3.0)
ps 1.7.5 2023-04-18 [1] CRAN (R 4.3.0)
purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)
RANN 2.6.1 2019-01-08 [1] CRAN (R 4.3.0)
RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.3.0)
Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.3.0)
RcppAnnoy 0.0.21 2023-07-02 [1] CRAN (R 4.3.0)
RcppEigen 0.3.3.9.4 2023-11-02 [1] CRAN (R 4.3.0)
RcppHNSW 0.5.0 2023-09-19 [1] CRAN (R 4.3.0)
RCurl 1.98-1.13 2023-11-02 [1] CRAN (R 4.3.0)
rematch2 2.1.2 2020-05-01 [1] CRAN (R 4.3.0)
reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.3.0)
reticulate * 1.34.0 2023-10-12 [1] CRAN (R 4.3.0)
rjson 0.2.21 2022-01-09 [1] CRAN (R 4.3.0)
rlang 1.1.2 2023-11-04 [1] CRAN (R 4.3.0)
rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.0)
rmio 0.4.0 2022-02-17 [1] CRAN (R 4.3.0)
ROCR 1.0-11 2020-05-02 [1] CRAN (R 4.3.0)
RSpectra 0.16-1 2022-04-24 [1] CRAN (R 4.3.0)
rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)
Rtsne 0.17 2023-12-07 [1] CRAN (R 4.3.0)
rvest 1.0.3 2022-08-19 [1] CRAN (R 4.3.0)
S4Arrays 1.2.0 2023-10-24 [1] Bioconductor
S4Vectors * 0.40.2 2023-11-23 [1] Bioconductor
scales 1.3.0 2023-11-28 [1] CRAN (R 4.3.0)
scattermore 1.2 2023-06-12 [1] CRAN (R 4.3.0)
scLANE * 0.7.9 2024-04-13 [1] Bioconductor
sctransform 0.4.1 2023-10-19 [1] CRAN (R 4.3.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
Seurat * 5.0.1 2023-11-17 [1] CRAN (R 4.3.0)
SeuratObject * 5.0.1 2023-11-17 [1] CRAN (R 4.3.0)
shape 1.4.6 2021-05-19 [1] CRAN (R 4.3.0)
shiny 1.8.0 2023-11-17 [1] CRAN (R 4.3.0)
SingleCellExperiment * 1.24.0 2023-10-24 [1] Bioconductor
slingshot * 2.10.0 2023-10-24 [1] Bioconductor
snow 0.4-4 2021-10-27 [1] CRAN (R 4.3.0)
sp * 2.1-2 2023-11-26 [1] CRAN (R 4.3.0)
spam 2.10-0 2023-10-23 [1] CRAN (R 4.3.0)
SparseArray 1.2.3 2023-12-25 [1] Bioconductor 3.18 (R 4.3.2)
sparseMatrixStats 1.14.0 2023-10-24 [1] Bioconductor
spatstat.data 3.0-3 2023-10-24 [1] CRAN (R 4.3.0)
spatstat.explore 3.2-5 2023-10-22 [1] CRAN (R 4.3.0)
spatstat.geom 3.2-7 2023-10-20 [1] CRAN (R 4.3.0)
spatstat.random 3.2-2 2023-11-29 [1] CRAN (R 4.3.0)
spatstat.sparse 3.0-3 2023-10-24 [1] CRAN (R 4.3.0)
spatstat.utils 3.0-4 2023-10-24 [1] CRAN (R 4.3.0)
stringi 1.8.3 2023-12-11 [1] CRAN (R 4.3.0)
stringr 1.5.1 2023-11-14 [1] CRAN (R 4.3.0)
SummarizedExperiment * 1.32.0 2023-10-24 [1] Bioconductor
survival 3.5-7 2023-08-14 [1] CRAN (R 4.3.2)
svglite 2.1.3 2023-12-08 [1] CRAN (R 4.3.0)
systemfonts 1.0.5 2023-10-09 [1] CRAN (R 4.3.0)
tensor 1.5 2012-05-05 [1] CRAN (R 4.3.0)
tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
tidyr 1.3.1 2024-01-24 [1] CRAN (R 4.3.2)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)
TMB 1.9.11 2024-04-03 [1] CRAN (R 4.3.2)
TrajectoryUtils * 1.10.0 2023-10-24 [1] Bioconductor
UCell 2.6.2 2023-11-06 [1] Bioconductor
utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.0)
uwot 0.1.16 2023-06-29 [1] CRAN (R 4.3.0)
vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.0)
vipor 0.4.7 2023-12-18 [1] CRAN (R 4.3.0)
viridis 0.6.4 2023-07-22 [1] CRAN (R 4.3.0)
viridisLite 0.4.2 2023-05-02 [1] CRAN (R 4.3.0)
webshot 0.5.5 2023-06-26 [1] CRAN (R 4.3.0)
withr 2.5.2 2023-10-30 [1] CRAN (R 4.3.0)
xfun 0.41 2023-11-01 [1] CRAN (R 4.3.0)
xml2 1.3.6 2023-12-04 [1] CRAN (R 4.3.0)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.3.0)
XVector 0.42.0 2023-10-24 [1] Bioconductor
yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.0)
zlibbioc 1.48.0 2023-10-24 [1] Bioconductor
zoo 1.8-12 2023-04-13 [1] CRAN (R 4.3.0)
[1] /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library
─ Python configuration ───────────────────────────────────────────────────────
python: /Users/jack/Desktop/PhD/Research/Python_Envs/personal_site/bin/python
libpython: /usr/local/opt/python@3.11/Frameworks/Python.framework/Versions/3.11/lib/python3.11/config-3.11-darwin/libpython3.11.dylib
pythonhome: /Users/jack/Desktop/PhD/Research/Python_Envs/personal_site:/Users/jack/Desktop/PhD/Research/Python_Envs/personal_site
version: 3.11.6 (main, Nov 2 2023, 04:52:24) [Clang 14.0.3 (clang-1403.0.22.14.1)]
numpy: /Users/jack/Desktop/PhD/Research/Python_Envs/personal_site/lib/python3.11/site-packages/numpy
numpy_version: 1.26.3
NOTE: Python version was forced by use_python() function
──────────────────────────────────────────────────────────────────────────────
---
title: "Interpretable scRNA-seq Trajectory DE with `scLANE`"
author:
name: Jack R. Leary
email: j.leary@ufl.edu
orcid: 0009-0004-8821-3269
affiliations:
- name: University of Florida
department: Department of Biostatistics
city: Gainesville
state: FL
date: today
date-format: long
format:
html:
code-fold: show
code-copy: true
code-tools: true
toc: true
toc-depth: 2
embed-resources: true
fig-format: retina
fig-width: 9
fig-height: 6
df-print: kable
link-external-newwindow: true
tbl-cap-location: bottom
fig-cap-location: bottom
number-sections: true
execute:
cache: false
freeze: auto
---
```{r setup}
#| include: false
knitr::opts_chunk$set(comment = NA)
reticulate::use_virtualenv("~/Desktop/PhD/Research/Python_Envs/personal_site/", required = TRUE)
set.seed(312)
```
# Introduction {#sec-intro}
In this tutorial we'll walk through a basic trajectory differential expression analysis. We'll use the `scLANE` R package, which we developed with the goal of providing accurate and biologically interpretable models of expression over pseudotime. At the end are some best-practices recommendations, along with a short list of references we used in developing the method & writing the accompanying manuscript. If you want to skip the preprocessing steps and get right into the analysis, head to @sec-traj.
# Libraries {#sec-libs}
If you haven't already, install the development version (currently v`r packageVersion("scLANE")`) of `scLANE` from [the GitHub repository](https://github.com/jr-leary7/scLANE).
```{r}
#| eval: false
remotes::install_github("jr-leary7/scLANE")
```
Next, we'll load the packages we need to process, analyze, & visualize our data.
```{r}
#| message: false
#| warning: false
#| results: hide
library(dplyr) # data manipulation
library(scLANE) # trajectory DE
library(Seurat) # scRNA-seq tools
library(ggplot2) # plot creation
library(patchwork) # plot combination
library(slingshot) # pseudotime estimation
library(reticulate) # Python interface
library(ComplexHeatmap) # heatmaps
rename <- dplyr::rename
```
# Helper functions {#sec-fns}
We'll also define a utility function to make our plots cleaner to read & easier to make.
```{r}
guide_umap <- function(key.size = 4) {
ggplot2::guides(color = ggplot2::guide_legend(override.aes = list(size = key.size,
alpha = 1,
stroke = 0.25)))
}
```
And consistent color palettes will make our plots easier to understand.
```{r}
palette_cluster <- paletteer::paletteer_d("ggsci::default_jama")
palette_celltype <- paletteer::paletteer_d("ggsci::category20_d3")
palette_heatmap <- paletteer::paletteer_d("wesanderson::Zissou1")
```
# Data {#sec-data}
We'll load the well-known pancreatic endocrinogenesis data from [Bastidas-Ponce *et al* (2019)](https://doi.org/10.1242/dev.173849), which comes with the `scVelo` Python library & has been used in several pseudotime inference / RNA velocity method papers as a good benchmark dataset due to the simplicity of the underlying trajectory manifold.
```{python}
#| results: hide
import scvelo as scv
adata = scv.datasets.pancreas()
```
The `AnnData` object contains data that we'll need to extract, specifically the spliced & unspliced mRNA counts matrices (stored in `AnnData.layers`) and the cell-level metadata (which is in `AnnData.obs`).
```{python}
adata
```
## Conversion from Python
The `reticulate` package allows us to pass the counts matrices & metadata from Python back to R.
```{python}
spliced_counts = adata.layers['spliced'].toarray()
unspliced_counts = adata.layers['unspliced'].toarray()
```
::: {.callout-note}
While downloading this dataset requires a Python installation as well as the installation of the `scVelo` Python library (and its dependencies), running `scLANE` is done purely in R & requires no Python whatsoever.
:::
We'll use the spliced mRNA counts as our default assay, and also define a new assay containing the total (spliced + unspliced) mRNA in each cell. Lastly, we remove genes with non-zero spliced mRNA in 3 or fewer cells.
```{r}
spliced_counts <- Matrix::Matrix(t(py$spliced_counts), sparse = TRUE)
unspliced_counts <- Matrix::Matrix(t(py$unspliced_counts), sparse = TRUE)
rna_counts <- spliced_counts + unspliced_counts
colnames(rna_counts) <- colnames(spliced_counts) <- colnames(unspliced_counts) <- py$adata$obs_names$to_list()
rownames(rna_counts) <- rownames(spliced_counts) <- rownames(unspliced_counts) <- py$adata$var_names$to_list()
spliced_assay <- CreateAssayObject(counts = spliced_counts)
spliced_assay@key <- "spliced_"
unspliced_assay <- CreateAssayObject(counts = unspliced_counts)
unspliced_assay@key <- "unspliced_"
rna_assay <- CreateAssayObject(counts = rna_counts)
rna_assay@key <- "rna_"
meta_data <- py$adata$obs %>%
mutate(cell_name = rownames(.),
.before = 1) %>%
rename(celltype = clusters,
celltype_coarse = clusters_coarse) %>%
mutate(nCount_spliced = colSums(spliced_counts),
nFeature_spliced = colSums(spliced_counts > 0),
nCount_unspliced = colSums(unspliced_counts),
nFeature_unspliced = colSums(unspliced_counts > 0),
nCount_RNA = colSums(rna_counts),
nFeature_RNA = colSums(rna_counts > 0))
seu <- CreateSeuratObject(counts = spliced_assay,
assay = "spliced",
project = "Mm_Panc_Endo",
meta.data = meta_data)
seu@assays$unspliced <- unspliced_assay
seu@assays$RNA <- rna_assay
seu <- seu[rowSums(seu@assays$spliced) > 3, ]
```
```{r}
#| echo: false
#| message: false
#| warning: false
#| results: hide
system("rm -rf ./data") # remove data/ directory created by downloading pancreas .h5ad file
rm(meta_data, rna_assay, rna_counts, spliced_assay, spliced_counts, unspliced_assay, unspliced_counts)
```
## Preprocessing
We preprocess the counts using a typical pipeline with QC, normalization & scaling, dimension reduction, and graph-based clustering via the Leiden algorithm.
```{r}
#| message: false
#| warning: false
seu <- PercentageFeatureSet(seu,
pattern = "^mt-",
col.name = "percent_mito",
assay = "spliced") %>%
PercentageFeatureSet(pattern = "^Rp[sl]",
col.name = "percent_ribo",
assay = "spliced") %>%
NormalizeData(assay = "spliced", verbose = FALSE) %>%
NormalizeData(assay = "unspliced", verbose = FALSE) %>%
NormalizeData(assay = "RNA", verbose = FALSE) %>%
FindVariableFeatures(assay = "spliced",
nfeatures = 3000,
verbose = FALSE) %>%
ScaleData(assay = "spliced",
vars.to.regress = c("percent_mito", "percent_ribo"),
model.use = "poisson",
verbose = FALSE) %>%
RunPCA(assay = "spliced",
npcs = 30,
approx = TRUE,
seed.use = 312,
verbose = FALSE) %>%
RunUMAP(reduction = "pca",
dims = 1:30,
n.components = 2,
metric = "cosine",
seed.use = 312,
verbose = FALSE) %>%
FindNeighbors(reduction = "pca",
k.param = 30,
nn.method = "annoy",
annoy.metric = "cosine",
verbose = FALSE) %>%
FindClusters(algorithm = 4,
method = "igraph",
resolution = 0.5,
random.seed = 312,
verbose = FALSE)
```
Let's visualize the results on our UMAP embedding. The clustering generally agrees with the celltype labels, though there is some overclustering in the ductal cells & underclustering in the mature endocrine celltypes.
```{r}
#| code-fold: true
#| fig-cap: Cluster & celltype labels in UMAP space.
#| label: fig-umap_labels
p0 <- Embeddings(seu, "umap") %>%
as.data.frame() %>%
magrittr::set_colnames(c("UMAP_1", "UMAP_2")) %>%
mutate(leiden = seu$seurat_clusters) %>%
ggplot(aes(x = UMAP_1, y = UMAP_2, color = leiden)) +
geom_point(size = 1.5,
alpha = 0.75,
stroke = 0) +
scale_color_manual(values = palette_cluster) +
labs(x = "UMAP 1",
y = "UMAP 2",
color = "Leiden") +
theme_scLANE(umap = TRUE) +
guide_umap()
p1 <- Embeddings(seu, "umap") %>%
as.data.frame() %>%
magrittr::set_colnames(c("UMAP_1", "UMAP_2")) %>%
mutate(celltype = seu$celltype) %>%
ggplot(aes(x = UMAP_1, y = UMAP_2, color = celltype)) +
geom_point(size = 1.5,
alpha = 0.75,
stroke = 0) +
scale_color_manual(values = palette_celltype) +
labs(x = "UMAP 1",
y = "UMAP 2",
color = "Celltype") +
theme_scLANE(umap = TRUE) +
guide_umap()
p2 <- (p0 / p1) +
plot_layout(guides = "collect", axes = "collect")
p2
```
# Trajectory inference {#sec-traj}
## Pseudotime estimation
We'll start by fitting a trajectory using the `slingshot` R package. We define cluster 5 as the starting cluster, since in this case we're already aware of the dataset's underlying biology. After generating the estimates for each cell, we rescale the ordering to be defined on $[0, 1]$. This has no effect on the trajectory DE results however, and is mostly an aesthetic choice.
```{r}
sling_res <- slingshot(Embeddings(seu, "umap"),
start.clus = "5",
clusterLabels = seu$seurat_clusters,
approx_points = 500)
sling_pt <- slingPseudotime(sling_res) %>%
as.data.frame() %>%
magrittr::set_colnames(c("PT")) %>%
mutate(PT = (PT - min(PT)) / (max(PT) - min(PT)))
seu <- AddMetaData(seu,
metadata = sling_pt,
col.name = "sling_pt")
```
Let's visualize the results on our UMAP embedding. They match what we would expect (knowing the biological background of the data), with ductal cells at the start of the process and endocrine celltypes such as alpha, beta, & delta cells at the end of it.
```{r}
#| code-fold: true
#| fig-cap: Estimated pseudotime in UMAP space.
#| label: fig-umap_pseudotime
p3 <- Embeddings(seu, "umap") %>%
as.data.frame() %>%
magrittr::set_colnames(c("UMAP_1", "UMAP_2")) %>%
mutate(PT = sling_pt$PT) %>%
ggplot(aes(x = UMAP_1, y = UMAP_2, color = PT)) +
geom_point(size = 1.5,
alpha = 0.75,
stroke = 0) +
labs(x = "UMAP 1",
y = "UMAP 2",
color = "Pseudotime") +
scale_color_gradientn(colors = palette_heatmap,
labels = scales::label_number(accuracy = 0.01)) +
theme_scLANE(umap = TRUE)
p4 <- (p3 / p1) +
plot_layout(guides = "collect", axes = "collect")
p4
```
## Trajectory DE testing
Next, we prepare the primary inputs to `scLANE`: our `Seurat` object with the spliced counts set as the default assay, a dataframe containing our estimated pseudotime ordering, a vector of size factors to use as an offset for sequencing depth in each model, and a set of genes whose dynamics we want to model. `scLANE` parallelizes over genes in order to speed up the computation at the expense of using a little more memory. The models are fit using NB GLMs with optimal spline knots identified empirically, and differential expression is quantified using a likelihood ratio test of the fitted model vs. a constant (intercept-only) model. In practice, genes designated as HVGs are usually the best candidates for modeling, so we choose the top 3,000 HVGs as our input.
::: {.callout-note}
The testing of the HVG set on its own is also justified by the reality that almost all trajectories are inferred using some sort of dimension-reduced space, and those embeddings are nearly universally generated using a set of HVGs. As such, genes not included in the HVG set actually have no direct relationship with the estimated trajectory, & it's generally safe to exclude them from trajectory analyses.
:::
```{r, results='hold'}
top3k_hvg <- HVFInfo(seu) %>%
arrange(desc(variance.standardized)) %>%
slice_head(n = 3000) %>%
rownames(.)
cell_offset <- createCellOffset(seu)
scLANE_models <- testDynamic(seu,
pt = sling_pt,
genes = top3k_hvg,
size.factor.offset = cell_offset,
n.cores = 6L,
verbose = FALSE)
scLANE_res_tidy <- getResultsDE(scLANE_models)
```
After tidying the TDE results with `getResultsDE()`, we pull a sample of 6 genes from the results & display their test statistics. By default, any gene with an adjusted *p*-value less than 0.01 is predicted to be dynamic, though this threshold can be easily adjusted.
```{r}
#| code-fold: true
#| tbl-cap: TDE test results from scLANE.
#| label: tbl-scLANE_res
select(scLANE_res_tidy,
Gene,
Test_Stat,
P_Val,
P_Val_Adj,
Gene_Dynamic_Overall) %>%
mutate(Gene_Dynamic_Overall = if_else(Gene_Dynamic_Overall == 1, "Dynamic", "Static")) %>%
with_groups(Gene_Dynamic_Overall,
slice_sample,
n = 3) %>%
kableExtra::kbl(digits = 4,
booktabs = TRUE,
col.names = c("Gene", "LRT stat.", "P-value", "Adj. p-value", "Predicted gene status")) %>%
kableExtra::kable_classic(full_width = FALSE, "hover")
```
Next, we can use the `plotModels()` function to visualize the fitted models from `scLANE` and compare them to other modeling methods. The gene [*Neurog3*](https://www.ncbi.nlm.nih.gov/gene/11925) is strongly associated with epithelial cell differentiation, and indeed we see a clear, nonlinear transcriptional dynamic across pseudotime for that gene. A traditional GLM fails to capture that nonlinearity, and while a GAM fits the trend smoothly, it seems to overfit the dynamics near the boundaries of pseudotime - [a known issue with additive models](https://www.routledge.com/Generalized-Additive-Models-An-Introduction-with-R-Second-Edition/Wood/p/book/9781498728331). Only the `scLANE` model accurately models the rapid upregulation and equally swift downregulation of the transcription factor neurogenin-3 (*Neurog3*) over pseudotime, in addition to identifying the cutpoint in pseudotime at which downregulation begins.
```{r}
#| code-fold: true
#| fig-cap: Modeling framework comparison.
#| fig-width: 9
#| fig-height: 5
#| label: fig-model_comp
p5 <- plotModels(scLANE_models,
gene = "Neurog3",
pt = sling_pt,
expr.mat = seu,
size.factor.offset = cell_offset,
plot.glm = TRUE,
plot.gam = TRUE) +
scale_color_manual(values = c("forestgreen"))
p5
```
# Downstream analysis {#sec-analysis}
## Gene dynamics plots
Using the `getFittedValues()` function allows us to generate predictions from the models we fit, which we then use to visualize the dynamics of a few genes that are known to be strongly associated with the differentiation of immature progenitors into mature endocrine phenotypes ([source 1](https://doi.org/10.1038/s41467-018-06176-3), [source 2](https://doi.org/10.1242/dev.173849)). For all four genes, the fitted models show knots chosen in the area of pseudotime around the pre-endocrine cells. This tells us that these driver genes are being upregulated in precursor celltypes & are driving differentiation into the mature celltypes such as alpha & beta cells, after which the genes are downregulated.
```{r}
#| code-fold: true
#| fig-cap: scLANE models of endocrinogenesis drivers.
#| fig-width: 12
#| fig-height: 6
#| label: fig-model_scLANE
p6 <- getFittedValues(scLANE_models,
genes = c("Chga", "Chgb", "Fev", "Cck"),
pt = sling_pt,
expr.mat = seu,
size.factor.offset = cell_offset,
cell.meta.data = select(seu@meta.data, celltype, celltype_coarse)) %>%
ggplot(aes(x = pt, y = rna_log1p)) +
facet_wrap(~gene,
ncol = 2,
scales = "free_y") +
geom_point(aes(color = celltype),
size = 2,
alpha = 0.75,
stroke = 0) +
geom_vline(data = data.frame(gene = "Chga", knot = unique(scLANE_models$Chga$Lineage_A$MARGE_Slope_Data$Breakpoint)),
mapping = aes(xintercept = knot),
linetype = "dashed",
color = "grey20") +
geom_vline(data = data.frame(gene = "Chgb", knot = unique(scLANE_models$Chgb$Lineage_A$MARGE_Slope_Data$Breakpoint)),
mapping = aes(xintercept = knot),
linetype = "dashed",
color = "grey20") +
geom_vline(data = data.frame(gene = "Cck", knot = unique(scLANE_models$Cck$Lineage_A$MARGE_Slope_Data$Breakpoint)),
mapping = aes(xintercept = knot),
linetype = "dashed",
color = "grey20") +
geom_vline(data = data.frame(gene = "Fev", knot = unique(scLANE_models$Fev$Lineage_A$MARGE_Slope_Data$Breakpoint)),
mapping = aes(xintercept = knot),
linetype = "dashed",
color = "grey20") +
geom_ribbon(aes(ymin = scLANE_ci_ll_log1p, ymax = scLANE_ci_ul_log1p),
linewidth = 0,
fill = "grey70",
alpha = 0.9) +
geom_line(aes(y = scLANE_pred_log1p),
color = "black",
linewidth = 0.75) +
scale_x_continuous(labels = scales::label_number(accuracy = 0.01)) +
scale_color_manual(values = palette_celltype) +
labs(x = "Pseudotime",
y = "Normalized Expression") +
theme_scLANE() +
theme(legend.title = element_blank(),
strip.text.x = element_text(face = "italic")) +
guide_umap()
p6
```
On the other hand, if we use additive models the "peak" of expression is placed among the mature endocrine celltypes - which doesn't make biological sense if we know that these genes are driving that process of differentiation. This can of course be tweaked by changing the degree or degrees of freedom of the underlying basis spline, but choosing a "best" value for those hyperparameters can be difficult, whereas `scLANE` identifies optimal parameters internally by default. In addition, the knots chosen by `scLANE` for each gene can be informative with respect to the underlying biology, whereas the knots from GAMs are evenly spaced at quantiles & carry no biological interpretation.
```{r}
#| code-fold: true
#| fig-cap: Additive models of endocrinogenesis drivers.
#| fig-width: 12
#| fig-height: 6
#| label: fig-model_GAM
p7 <- getFittedValues(scLANE_models,
genes = c("Chga", "Chgb", "Fev", "Cck"),
pt = sling_pt,
expr.mat = seu,
size.factor.offset = cell_offset,
cell.meta.data = select(seu@meta.data, celltype, celltype_coarse)) %>%
mutate(rna_raw = rna / size_factor, .before = 7) %>%
with_groups(gene,
mutate,
GAM_fitted_link = predict(nbGAM(expr = rna_raw,
pt = sling_pt,
Y.offset = cell_offset,
spline.df = 3)),
GAM_se_link = predict(nbGAM(expr = rna_raw,
pt = sling_pt,
Y.offset = cell_offset,
spline.df = 3), se.fit = TRUE)[[2]]) %>%
mutate(GAM_pred = exp(GAM_fitted_link) * cell_offset,
GAM_ci_ll = exp(GAM_fitted_link - qnorm(0.975) * GAM_se_link) * cell_offset,
GAM_ci_ul = exp(GAM_fitted_link + qnorm(0.975) * GAM_se_link) * cell_offset,
GAM_pred_log1p = log1p(GAM_pred),
GAM_ci_ll_log1p = log1p(GAM_ci_ll),
GAM_ci_ul_log1p = log1p(GAM_ci_ul)) %>%
ggplot(aes(x = pt, y = rna_log1p)) +
facet_wrap(~gene,
ncol = 2,
scales = "free_y") +
geom_point(aes(color = celltype),
size = 2,
alpha = 0.75,
stroke = 0) +
geom_ribbon(aes(ymin = GAM_ci_ll_log1p, ymax = GAM_ci_ul_log1p),
linewidth = 0,
fill = "grey70",
alpha = 0.9) +
geom_line(aes(y = GAM_pred_log1p),
color = "black",
linewidth = 0.75) +
scale_x_continuous(labels = scales::label_number(accuracy = 0.01)) +
scale_color_manual(values = palette_celltype) +
labs(x = "Pseudotime",
y = "Normalized Expression") +
theme_scLANE() +
theme(legend.title = element_blank(),
strip.text.x = element_text(face = "italic")) +
guide_umap()
p7
```
## Distribution of knot locations
Let's take a broader view of the dataset by examining the distribution of adaptively chosen knots from our models. We limit the analysis to the set of `r sum(scLANE_res_tidy$Gene_Dynamic_Overall)` genes determined to be dynamic.
```{r}
dyn_genes <- filter(scLANE_res_tidy, Gene_Dynamic_Overall == 1) %>%
pull(Gene)
knot_df <- getKnotDist(scLANE_models, dyn_genes)
```
We'll plot a histogram of the knot values along with a ridgeplot of the pseudotime distribution for each celltype. We see that a large number of the selected knots are placed at the beginning of the trajectory, around where the ductal cells transition into endocrine progenitors. A smaller set of knots is placed about halfway through the trajectory, which we've annotated as the point at which pre-endocrine cells begin differentiating into mature endocrine phenotypes.
```{r, message=FALSE, warning=FALSE}
#| code-fold: true
#| fig-cap: Distribution of adaptively-chosen knots from scLANE.
#| fig-width: 9
#| fig-height: 6
#| label: fig-knot_dist
p8 <- ggplot(knot_df, aes(x = knot)) +
geom_histogram(aes(y = after_stat(density)),
color = "black",
fill = "white",
linewidth = 0.5) +
geom_density(fill = "deepskyblue3",
alpha = 0.5,
color = "deepskyblue4",
linewidth = 1) +
scale_x_continuous(limits = c(0, 1), labels = scales::label_number(accuracy = 0.01)) +
labs(x = "Pseudotime") +
theme_scLANE() +
theme(axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank())
p9 <- data.frame(celltype = seu$celltype,
pt = seu$sling_pt) %>%
ggplot(aes(x = pt, y = celltype, fill = celltype, color = celltype)) +
ggridges::geom_density_ridges(alpha = 0.75, size = 1, scale = 0.95) +
scale_x_continuous(labels = scales::label_number(accuracy = 0.01), limits = c(0, 1)) +
scale_fill_manual(values = palette_celltype) +
scale_color_manual(values = palette_celltype) +
labs(x = "Pseudotime") +
theme_scLANE() +
theme(axis.title.y = element_blank(),
legend.title = element_blank()) +
guide_umap()
p10 <- (p8 / p9) +
plot_layout(heights = c(1/4, 3/4), axes = "collect")
p10
```
## Dynamic gene clustering
We can extract a matrix of fitted dynamics using `smoothedCountsMatrix()`. Next, the `embedGenes()` function reduces dimensionality with PCA, clusters the genes using the Leiden algorithm, & embeds the genes in two dimensions with UMAP.
```{r}
#| message: false
#| warning: false
smoothed_counts <- smoothedCountsMatrix(scLANE_models,
pt = sling_pt,
genes = dyn_genes,
size.factor.offset = cell_offset)
gene_embed <- embedGenes(log1p(smoothed_counts$Lineage_A), resolution.param = 0.2)
```
First we'll visualize the gene clusters on the first two PCs.
```{r}
#| code-fold: true
#| fig-cap: Unsupervised clustering of genes in PCA space.
#| label: fig-gene_pca
p11 <- ggplot(gene_embed, aes(x = pc1, y = pc2, color = leiden)) +
geom_point(size = 2,
alpha = 0.75,
stroke = 0) +
labs(x = "PC 1",
y = "PC 2",
color = "Leiden Cluster") +
paletteer::scale_color_paletteer_d("ggsci::default_igv") +
theme_scLANE(umap = TRUE) +
guide_umap()
p11
```
The UMAP embedding shows that even with the relatively small number of genes, clear patterns are visible.
```{r}
#| code-fold: true
#| fig-cap: Unsupervised clustering of genes in UMAP space.
#| label: fig-gene_umap
p12 <- ggplot(gene_embed, aes(x = umap1, y = umap2, color = leiden)) +
geom_point(size = 2,
alpha = 0.75,
stroke = 0) +
labs(x = "UMAP 1",
y = "UMAP 2",
color = "Gene Cluster") +
paletteer::scale_color_paletteer_d("ggsci::default_igv") +
theme_scLANE(umap = TRUE) +
guide_umap()
p12
```
## Expression cascades
We can also plot a heatmap of the dynamic genes; this requires a bit of setup, for which we'll use the `ComplexHeatmap` package. We scale each gene, and clip values to be on $[-6, 6]$. The columns (cells) of the heatmap are ordered by estimated pseudotime, and the rows (genes) are ordered by expression peak.
```{r}
col_anno_df <- select(seu@meta.data,
cell_name,
celltype,
sling_pt) %>%
mutate(celltype = as.factor(celltype)) %>%
arrange(sling_pt)
gene_order <- sortGenesHeatmap(smoothed_counts$Lineage_A, pt.vec = sling_pt$PT)
heatmap_mat <- t(scale(smoothed_counts$Lineage_A))
heatmap_mat[heatmap_mat > 6] <- 6
heatmap_mat[heatmap_mat < -6] <- -6
colnames(heatmap_mat) <- seu$cell_name
heatmap_mat <- heatmap_mat[, col_anno_df$cell_name]
heatmap_mat <- heatmap_mat[gene_order, ]
palette_celltype_hm <- as.character(palette_celltype[1:length(unique(seu$celltype))])
names(palette_celltype_hm) <- levels(col_anno_df$celltype)
col_anno <- HeatmapAnnotation(Celltype = col_anno_df$celltype,
Pseudotime = col_anno_df$sling_pt,
col = list(Celltype = palette_celltype_hm,
Pseudotime = circlize::colorRamp2(seq(0, 1, by = 0.25), palette_heatmap)),
show_legend = TRUE,
show_annotation_name = FALSE,
gap = unit(1, "mm"),
border = TRUE)
palette_cluster_hm <- as.character(paletteer::paletteer_d("ggsci::default_igv")[1:length(unique(gene_embed$leiden))])
names(palette_cluster_hm) <- as.character(unique(gene_embed$leiden))
row_anno <- HeatmapAnnotation(Cluster = as.factor(gene_embed$leiden),
col = list(Cluster = palette_cluster_hm),
show_legend = TRUE,
show_annotation_name = FALSE,
annotation_legend_param = list(title = "Gene\nCluster"),
gap = unit(1, "mm"),
border = TRUE,
which = "row")
```
The heatmap shows clear dynamic patterns across pseudotime; these patterns are often referred to as *expression cascades*, and represent periodic up- and down-regulation of different gene programs during the course of the underlying biological process.
```{r}
#| code-fold: true
#| message: false
#| fig-width: 12
#| fig-height: 6
#| fig-cap: Expression cascades of dynamic genes.
#| label: fig-exp_cascade
Heatmap(matrix = heatmap_mat,
name = "Spliced\nmRNA",
col = circlize::colorRamp2(colors = viridis::inferno(50),
breaks = seq(min(heatmap_mat), max(heatmap_mat), length.out = 50)),
cluster_columns = FALSE,
width = 12,
height = 6,
column_title = "",
cluster_rows = FALSE,
top_annotation = col_anno,
left_annotation = row_anno,
show_column_names = FALSE,
show_row_names = FALSE,
use_raster = TRUE,
raster_by_magick = TRUE,
raster_quality = 5)
```
## Gene programs
Using our gene clusters & the `gprofiler2` package, we run an enrichment analysis against [the biological process (BP) set of gene ontologies](http://geneontology.org/docs/ontology-documentation/). We make sure to order the genes in each cluster by their test statistics by joining to the results table from `scLANE`.
```{r}
gene_clust_list <- purrr::map(unique(gene_embed$leiden), \(x) {
filter(gene_embed, leiden == x) %>%
inner_join(scLANE_res_tidy, by = c("gene" = "Gene")) %>%
arrange(desc(Test_Stat)) %>%
pull(gene)
})
names(gene_clust_list) <- paste0("Leiden_", unique(gene_embed$leiden))
enrich_res <- gprofiler2::gost(gene_clust_list,
organism = "mmusculus",
ordered_query = TRUE,
multi_query = FALSE,
sources = "GO:BP",
significant = TRUE)
```
A look at the top 5 most-significant GO terms for each gene cluster reveals heterogeneous functionalities across groups of genes.
```{r}
#| code-fold: true
#| tbl-cap: The top-5 biological process GO terms per cluster.
#| label: tbl-go_bp
mutate(enrich_res$result,
query = gsub("Leiden_", "", query)) %>%
rename(cluster = query) %>%
with_groups(cluster,
slice_head,
n = 5) %>%
select(cluster,
term_name,
p_value,
term_size,
query_size,
intersection_size,
term_id) %>%
kableExtra::kbl(digits = 3,
booktabs = TRUE,
col.names = c("Gene Cluster", "Term Name", "Adj. P-value", "Term Size",
"Query Size", "Intersection Size", "Term ID")) %>%
kableExtra::kable_classic(c("hover"), full_width = FALSE)
```
We can use the `geneProgramScoring()` function to add module scores for each gene cluster to our `Seurat` object.
```{r}
peptide_program <- filter(enrich_res$result, term_name == "peptide hormone secretion") %>%
arrange(p_value) %>%
slice_head(n = 1) %>%
pull(query)
peptide_program_name <- gsub("Leiden_", "cluster_", peptide_program)
seu <- geneProgramScoring(seu,
genes = gene_embed$gene,
gene.clusters = gene_embed$leiden)
```
Visualizing the scores on our UMAP embedding shows us that the peptide program is highly-enriched only in mature endocrine cells. This makes sense biologically as mature endocrine celltypes' primary roles are to produce peptides such as glucagon (alpha cells), insulin (beta cells), somatostatin (ductal cells), and pancreatic polypeptide (gamma cells).
```{r}
#| code-fold: true
#| fig-cap: Enrichment of peptide production gene program.
#| label: fig-peptide_enr
p13 <- Embeddings(seu, "umap") %>%
as.data.frame() %>%
magrittr::set_colnames(c("UMAP_1", "UMAP_2")) %>%
mutate(peptide_program_score = seu@meta.data[, peptide_program_name]) %>%
ggplot(aes(x = UMAP_1, y = UMAP_2, color = peptide_program_score)) +
geom_point(size = 1.5, alpha = 0.75, stroke = 0) +
labs(x = "UMAP 1",
y = "UMAP 2",
color = "Program Score") +
scale_color_gradientn(colors = palette_heatmap,
labels = scales::label_number(accuracy = 0.1)) +
theme_scLANE(umap = TRUE)
p14 <- (p13 / p1) +
plot_layout(guides = "collect", axes = "collect")
p14
```
We can also visualize the trend in the peptide program scores over time, which confirms the biological conclusions we came to by inspecting the UMAP.
```{r}
#| code-fold: true
#| message: false
#| fig-cap: Peptide production gene program scores over pseudotime.
#| label: fig-peptide_smoother
p15 <- data.frame(PT = sling_pt$PT,
peptide_program_score = seu@meta.data[, peptide_program_name],
celltype = seu$celltype) %>%
ggplot(aes(x = PT, y = peptide_program_score, color = celltype)) +
geom_point(alpha = 0.75,
stroke = 0,
size = 2) +
geom_smooth(color = "black",
method = "loess",
linewidth = 0.75) +
scale_color_manual(values = palette_celltype) +
labs(x = "Pseudotime", y = "Peptide Program Score") +
theme_scLANE() +
theme(legend.title = element_blank()) +
guide_umap()
p15
```
Next, in order to identify which genes are "drivers" of a certain gene program, we can use the `geneProgramDrivers()` function to correlate normalized expression with program scores.
```{r}
program_drivers <- geneProgramDrivers(seu,
genes = gene_embed$gene,
gene.program = seu@meta.data[, peptide_program_name],
verbose = FALSE)
```
We display the top 10 most-correlated genes here. For example, carboxypeptidase E (*Cpe*) is involved in the production of insulin ([source](https://doi.org/10.1073/pnas.1323066111)).
```{r}
#| code-fold: true
#| tbl-cap: The top-10 driver genes for the peptide production gene program.
#| label: tbl-driver_genes
slice_head(program_drivers, n = 10) %>%
kableExtra::kbl(digits = 3,
booktabs = TRUE,
row.names = FALSE,
col.names = c("Gene", "Correlation", "P-value", "Adj. P-value")) %>%
kableExtra::kable_classic(c("hover"), full_width = FALSE)
```
Indeed, normalized expression of *Cpe* is high across all mature endocrine celltypes, with alpha and beta cells showing the highest overall mean expression.
```{r}
#| code-fold: true
#| message: false
#| fig-cap: Beeswarm plots of driver gene expression by celltype.
#| label: fig-driver_gene_exp
#| fig-width: 9
#| fig-height: 5
p16 <- data.frame(celltype = seu$celltype,
rna = seu@assays$spliced@data["Cpe", ]) %>%
ggplot(aes(x = celltype, y = rna, color = celltype)) +
ggbeeswarm::geom_quasirandom(alpha = 0.75,
size = 2,
stroke = 0,
show.legend = FALSE) +
scale_color_manual(values = palette_celltype) +
stat_summary(fun = "mean",
geom = "point",
color = "black",
size = 3) +
labs(y = "Normalized Expression") +
theme_scLANE() +
theme(axis.title.x = element_blank())
p16
```
## Dynamic gene enrichment
Lastly, we perform an enrichment analysis for our set of dynamic genes. We'll focus on terms from the GO biological process (BP) set, as those are generally the easiest to interpret.
```{r}
dyn_gene_enrich <- enrichDynamicGenes(scLANE_res_tidy, species = "Mmusculus")
dyn_go_bp_terms <- filter(dyn_gene_enrich$result,
source == "GO:BP",
p_value < 0.05)
```
Overall, there are `r length(unique(dyn_go_bp_terms$term_id))` unique significantly-enriched GO:BP terms for our trajectory.
```{r}
#| code-fold: true
#| tbl-cap: Random sample of the biological processes enriched for the dynamic gene set.
#| label: tbl-bp_terms
select(dyn_go_bp_terms, term_name, p_value, source) %>%
slice_sample(n = 10) %>%
kableExtra::kbl(digits = 3,
booktabs = TRUE,
row.names = FALSE,
col.names = c("Term", "P-value", "Source")) %>%
kableExtra::kable_classic(c("hover"), full_width = FALSE)
```
# Conclusions {#sec-conclusions}
Hopefully this vignette has been a useful introduction to running the `scLANE` software and using its outputs to help better understand biology at single-cell resolution. If you have questions about how the models work or are interpreted, software issues, or simply want to compare results feel free to open an issue on [the GitHub repository](https://github.com/jr-leary7/scLANE) or reach out via email to <j.leary@ufl.edu>.
# References {#sec-refs}
1. Bastidas-Ponce, Aimée *et al*. [Comprehensive single cell mRNA profiling reveals a detailed roadmap for pancreatic endocrinogenesis](https://doi.org/10.1242/dev.173849). *Development* (2019).
2. Street, Kelly *et al*. [Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics](https://doi.org/10.1186/s12864-018-4772-0). *BMC Genomics* (2018).
3. Stoklosa, Jakub & David Warton. [A generalized estimating equation approach to multivariate adaptive regression splines](https://doi.org/10.1080/10618600.2017.1360780). *Journal of Computational and Graphical Statistics* (2018).
# Session Info {#sec-SI}
```{r}
sessioninfo::session_info()
```