Code
:: install_github("jr-leary7/SCISSORS") remotes
SCISSORS
April 16, 2024
In this tutorial we’ll walk through a basic single cell analysis, with a focus on fine-tuning clustering results using the SCISSORS
package, which I wrote during my time at UNC Chapel Hill.
If you haven’t already, install the development version (currently v1.2.0) of SCISSORS
from the GitHub repository.
Next, we’ll load the packages we need to process our single cell data, recluster the cells, and visualize the results.
We’ll load in the well-known PBMC3k data from 10X Genomics, which is often used in example workflows such as the Seurat
clustering vignette and the scanpy
vignette. If you haven’t already downloaded the dataset, this function will download the raw data for you and load it into your R session.
We’ll do some minor quality-control checking first by filtering out cells with a high percentage of mitochondrial reads or very low or high numbers of detected genes.
We’ll process the raw counts in the usual fashion: QC, normalization, identification of highly variable genes (HVGs), linear & non-linear dimension reduction, and a broad clustering that will (hopefully) capture our major celltypes. When computing the shared nearest-neighbor (SNN) graph, we use the heuristic \(k = \sqrt{n}\) for the number of nearest-neighbors to consider for each cell. This ensures that the clustering will be broad i.e., a smaller number of large clusters will be returned instead of a larger number of small clusters.
pbmc <- NormalizeData(pbmc,
normalization.method = "LogNormalize",
verbose = FALSE) %>%
FindVariableFeatures(selection.method = "vst",
nfeatures = 3000,
verbose = FALSE) %>%
CellCycleScoring(s.features = cc.genes.updated.2019$s.genes,
g2m.features = cc.genes.updated.2019$g2m.genes,
set.ident = FALSE) %>%
AddMetaData(metadata = c(.$S.Score - .$G2M.Score), col.name = "CC_difference") %>%
ScaleData(vars.to.regress = "CC_difference", verbose = FALSE) %>%
RunPCA(features = VariableFeatures(.),
npcs = 50,
verbose = FALSE,
seed.use = 312) %>%
RunUMAP(reduction = "pca",
dims = 1:20,
n.components = 2,
metric = "cosine",
seed.use = 312,
verbose = FALSE) %>%
FindNeighbors(reduction = "pca",
dims = 1:20,
k.param = sqrt(ncol(.)),
nn.method = "annoy",
annoy.metric = "cosine",
verbose = FALSE) %>%
FindClusters(resolution = 0.3,
random.seed = 312,
verbose = FALSE)
Let’s visualize the principal components. Notable genes in PC 1 include MALAT1, high abundance of which is a common artifact of 10X-sequenced data. PC 2 seems to separate NK cells (NKG7, GZMB) and myeloid cells (HLA-DRA, CD79A). PC 3 is composed of variation that could originate from platelets (PPBP). PCs 4-6 look like they separate several types of monocytic, T, NK, and dendritic cells.
We visualize the Louvain clustering via a UMAP plot. We see 5 major clusters, which we’ll annotate next.
First we identify CD8+ T-cells via CD8A, and CD4+ T-cells with IL7R. Lastly, FCGR3A (aka CD16) is specific to CD16+ monocytes. We can combine the plots using the excellent patchwork
package.
p1 <- FeaturePlot(pbmc, features = "CD8A", pt.size = 1) +
scale_color_gradientn(colours = paletteer_d("wesanderson::Zissou1")) +
labs(x = "UMAP 1", y = "UMAP 2") +
theme(axis.ticks = element_blank(),
axis.text = element_blank())
p2 <- FeaturePlot(pbmc, features = "IL7R", pt.size = 1) +
scale_color_gradientn(colours = paletteer_d("wesanderson::Zissou1")) +
labs(x = "UMAP 1", y = "UMAP 2") +
theme(axis.ticks = element_blank(),
axis.text = element_blank())
p1 / p2
Next, we use HLA-DRA to broadly identify monocytic cells, and FCGR3A (aka CD16) to single out the CD16+ monocytes.
p1 <- FeaturePlot(pbmc, features = "HLA-DRA", pt.size = 1) +
scale_color_gradientn(colours = paletteer_d("wesanderson::Zissou1")) +
labs(x = "UMAP 1", y = "UMAP 2") +
theme(axis.ticks = element_blank(),
axis.text = element_blank())
p2 <- FeaturePlot(pbmc, features = "FCGR3A", pt.size = 1) +
scale_color_gradientn(colours = paletteer_d("wesanderson::Zissou1")) +
labs(x = "UMAP 1", y = "UMAP 2") +
theme(axis.ticks = element_blank(),
axis.text = element_blank())
p1 / p2
Lastly, abundance of MS4A1 points out a cluster of B cells.
We’ll add broad celltype labels to our object’s metadata.
pbmc@meta.data <- mutate(pbmc@meta.data,
broad_celltype = case_when(seurat_clusters == 0 ~ "CD4+ T",
seurat_clusters == 1 ~ "Monocyte",
seurat_clusters == 2 ~ "CD8+ T",
seurat_clusters == 3 ~ "B",
seurat_clusters == 4 ~ "CD16+ Monocyte",
TRUE ~ NA_character_),
broad_celltype = factor(broad_celltype, levels = c("CD4+ T",
"Monocyte",
"CD8+ T",
"B",
"CD16+ Monocyte")))
And visualize the results.
From the plot above, there appears to be some visible subgroups in the monocyte cluster. With that being said - I would generally be very cautious about using UMAPs alone to define heterogeneous groups. In general, I would suggest using something like silhouette score distributions, other clustering statistics, or biological knowledge to determine subclustering targets. We can do this below using ComputeSilhouetteScores()
, which returns a silhouette score for each individual cell. Visualizing the results can help us identify which clusters are “poor” fits. For more information, check out the Wikipedia article on clustering scores.
We can see that the B cell and CD16+ monocyte clusters seem to be well-fit, but the other clusters are less so. We’ll focus on the other monocyte cluster, as it seems to have the highest variance.
sil_scores %>%
left_join(distinct(pbmc@meta.data, seurat_clusters, broad_celltype),
by = c("Cluster" = "seurat_clusters")) %>%
ggplot(aes(x = broad_celltype, y = Score, fill = broad_celltype)) +
geom_violin(scale = "width",
color = "black",
draw_quantiles = 0.5,
size = 0.75) +
scale_fill_paletteer_d("ggsci::nrc_npg") +
labs(y = "Silhouette Score", fill = "Broad Celltype") +
theme_classic(base_size = 14) +
theme(panel.grid.major.y = element_line(),
axis.title.x = element_blank())
[1] "Reclustering cells in cluster 1 using k = 20 & resolution = 0.2; S = 0.419"
Let’s check out the UMAP embedding:
Highly-specific abundance of FCER1A allows us to identify the dendritic cells in cluster 2.
data.frame(exp = mono_reclust@assays$RNA@data["FCER1A", ],
label = mono_reclust$seurat_clusters) %>%
ggplot(aes(x = label, y = exp, fill = label)) +
geom_violin(scale = "width",
color = "black",
draw_quantiles = 0.5,
size = 0.75) +
scale_fill_paletteer_d("MetBrewer::Egypt") +
labs(y = "FCER1A", fill = "Subcluster") +
theme_classic(base_size = 14) +
theme(panel.grid.major.y = element_line(),
axis.title.x = element_blank())
Both cluster 0 & cluster 1 seem to be CD14+, and cluster 1 appears to have slightly higher (but still low) abundance of FCGR3A.
p1 <- data.frame(exp = mono_reclust@assays$RNA@data["CD14", ],
label = mono_reclust$seurat_clusters) %>%
filter(label %in% c(0, 1)) %>%
ggplot(aes(x = label, y = exp, fill = label)) +
geom_violin(scale = "width",
color = "black",
draw_quantiles = 0.5,
size = 0.75) +
scale_fill_paletteer_d("MetBrewer::Egypt") +
labs(y = "CD14", fill = "Subcluster") +
theme_classic(base_size = 14) +
theme(panel.grid.major.y = element_line(),
axis.title.x = element_blank())
p2 <- data.frame(exp = mono_reclust@assays$RNA@data["FCGR3A", ],
label = mono_reclust$seurat_clusters) %>%
filter(label %in% c(0, 1)) %>%
ggplot(aes(x = label, y = exp, fill = label)) +
geom_violin(scale = "width",
color = "black",
draw_quantiles = 0.5,
size = 0.75) +
scale_fill_paletteer_d("MetBrewer::Egypt") +
labs(y = "FCGR3A", fill = "Subcluster") +
theme_classic(base_size = 14) +
theme(panel.grid.major.y = element_line(),
axis.title.x = element_blank())
p1 / p2
From Kapellos et al (2019), we know that intermediate monocytes have high abundance of CD14, low but non-zero abundance of CD16 (which again is denoted FCGR3A in this dataset), and can be identified through higher abundance of other markers like HLA-DPB1 and CD74 in comparison to CD14+ monocytes. With all this information, we’ll conclude that cluster 0 is likely composed of CD14+ monocytes and cluster 1 of intermediate monocytes.
p1 <- data.frame(exp = mono_reclust@assays$RNA@data["HLA-DQB1", ],
label = mono_reclust$seurat_clusters) %>%
filter(label %in% c(0, 1)) %>%
ggplot(aes(x = label, y = exp, fill = label)) +
geom_violin(scale = "width",
color = "black",
draw_quantiles = 0.5,
size = 0.75) +
scale_fill_paletteer_d("MetBrewer::Egypt") +
labs(y = "HLA-DQB1", fill = "Subcluster") +
theme_classic(base_size = 14) +
theme(panel.grid.major.y = element_line(),
axis.title.x = element_blank())
p2 <- data.frame(exp = mono_reclust@assays$RNA@data["CD74", ],
label = mono_reclust$seurat_clusters) %>%
filter(label %in% c(0, 1)) %>%
ggplot(aes(x = label, y = exp, fill = label)) +
geom_violin(scale = "width",
color = "black",
draw_quantiles = 0.5,
size = 0.75) +
scale_fill_paletteer_d("MetBrewer::Egypt") +
labs(y = "CD74", fill = "Subcluster") +
theme_classic(base_size = 14) +
theme(panel.grid.major.y = element_line(),
axis.title.x = element_blank())
p1 / p2
Lastly, we can tell that cluster 3 is composed of platelets thanks to high abundance of PPBP.
data.frame(exp = mono_reclust@assays$RNA@data["PPBP", ],
label = mono_reclust$seurat_clusters) %>%
ggplot(aes(x = label, y = exp, fill = label)) +
geom_violin(scale = "width",
color = "black",
draw_quantiles = 0.5,
size = 0.75) +
scale_fill_paletteer_d("MetBrewer::Egypt") +
labs(y = "PPBP", fill = "Subcluster") +
theme_classic(base_size = 14) +
theme(panel.grid.major.y = element_line(),
axis.title.x = element_blank())
We can add the new subcluster labels back in to our original object using IntegrateSubclusters()
. We also add labels to the original object reflecting the subcluster annotations.
pbmc <- IntegrateSubclusters(pbmc, reclust.results = mono_reclust)
pbmc@meta.data <- mutate(pbmc@meta.data,
celltype = case_when(seurat_clusters == 0 ~ "CD4+ T",
seurat_clusters == 1 ~ "Platelet",
seurat_clusters == 2 ~ "CD8+ T",
seurat_clusters == 3 ~ "B",
seurat_clusters == 4 ~ "CD16+ Monocyte",
seurat_clusters == 5 ~ "CD14+ Monocyte",
seurat_clusters == 6 ~ "Intermediate Monocyte",
seurat_clusters == 7 ~ "Dendritic Cell",
TRUE ~ NA_character_))
Here’s the final celltype annotations on our original UMAP embedding.
We were able to use SCISSORS
to estimate a biologically meaningful subclustering that is to some degree supported by canonical marker genes from the literature. I’ll note that while these monocyte subtype annotations might be useful, we didn’t analyze the entire dataset; those familiar with the PBMC3k dataset will know that what we labelled the CD8+ T cell cluster actually also contains NK cells. This analysis is not meant to be exhaustive or final, and serves mostly to show how and why SCISSORS
is used. Please reach out to me with questions on the package via email, or by opening an issueon the GitHub repository.
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.2.1 (2022-06-23)
os macOS Big Sur ... 10.16
system x86_64, darwin17.0
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2023-11-02
pandoc 2.19.2 @ /usr/local/bin/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
abind 1.4-5 2016-07-21 [1] CRAN (R 4.2.0)
AnnotationDbi 1.58.0 2022-04-26 [1] Bioconductor
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0)
Biobase 2.56.0 2022-04-26 [1] Bioconductor
BiocFileCache 2.4.0 2022-04-26 [1] Bioconductor
BiocGenerics 0.42.0 2022-04-26 [1] Bioconductor
biomaRt 2.52.0 2022-04-26 [1] Bioconductor
Biostrings 2.64.1 2022-08-18 [1] Bioconductor
bit 4.0.4 2020-08-04 [1] CRAN (R 4.2.0)
bit64 4.0.5 2020-08-30 [1] CRAN (R 4.2.0)
bitops 1.0-7 2021-04-24 [1] CRAN (R 4.2.0)
blob 1.2.3 2022-04-10 [1] CRAN (R 4.2.0)
cachem 1.0.6 2021-08-19 [1] CRAN (R 4.2.0)
cli 3.6.1 2023-03-23 [1] CRAN (R 4.2.0)
cluster 2.1.4 2022-08-22 [1] CRAN (R 4.2.0)
codetools 0.2-18 2020-11-04 [1] CRAN (R 4.2.1)
colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0)
coop 0.6-3 2021-09-19 [1] CRAN (R 4.2.0)
cowplot 1.1.1 2020-12-30 [1] CRAN (R 4.2.0)
crayon 1.5.1 2022-03-26 [1] CRAN (R 4.2.0)
curl 4.3.2 2021-06-23 [1] CRAN (R 4.2.0)
data.table 1.14.2 2021-09-27 [1] CRAN (R 4.2.0)
DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0)
dbplyr 2.2.1 2022-06-27 [1] CRAN (R 4.2.0)
deldir 1.0-6 2021-10-23 [1] CRAN (R 4.2.0)
digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.0)
doParallel 1.0.17 2022-02-07 [1] CRAN (R 4.2.0)
dplyr * 1.0.9 2022-04-28 [1] CRAN (R 4.2.0)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
evaluate 0.16 2022-08-09 [1] CRAN (R 4.2.0)
fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0)
farver 2.1.1 2022-07-06 [1] CRAN (R 4.2.0)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
filelock 1.0.2 2018-10-05 [1] CRAN (R 4.2.0)
fitdistrplus 1.1-8 2022-03-10 [1] CRAN (R 4.2.0)
foreach 1.5.2 2022-02-02 [1] CRAN (R 4.2.0)
future 1.27.0 2022-07-22 [1] CRAN (R 4.2.0)
future.apply 1.9.0 2022-04-25 [1] CRAN (R 4.2.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0)
GenomeInfoDb 1.32.3 2022-08-09 [1] Bioconductor
GenomeInfoDbData 1.2.8 2022-08-29 [1] Bioconductor
ggplot2 * 3.4.2 2023-04-03 [1] CRAN (R 4.2.0)
ggrepel 0.9.1 2021-01-15 [1] CRAN (R 4.2.0)
ggridges 0.5.3 2021-01-08 [1] CRAN (R 4.2.0)
globals 0.16.1 2022-08-28 [1] CRAN (R 4.2.1)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
goftest 1.2-3 2021-10-07 [1] CRAN (R 4.2.0)
gridExtra 2.3 2017-09-09 [1] CRAN (R 4.2.0)
gtable 0.3.0 2019-03-25 [1] CRAN (R 4.2.0)
hms 1.1.2 2022-08-19 [1] CRAN (R 4.2.0)
htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.0)
htmlwidgets 1.5.4 2021-09-08 [1] CRAN (R 4.2.0)
httpuv 1.6.5 2022-01-05 [1] CRAN (R 4.2.0)
httr 1.4.4 2022-08-17 [1] CRAN (R 4.2.0)
ica 1.0-3 2022-07-08 [1] CRAN (R 4.2.0)
igraph 1.3.4 2022-07-19 [1] CRAN (R 4.2.0)
IRanges 2.30.1 2022-08-18 [1] Bioconductor
irlba 2.3.5 2021-12-06 [1] CRAN (R 4.2.0)
iterators 1.0.14 2022-02-05 [1] CRAN (R 4.2.0)
jsonlite 1.8.0 2022-02-22 [1] CRAN (R 4.2.0)
KEGGREST 1.36.3 2022-07-14 [1] Bioconductor
KernSmooth 2.23-20 2021-05-03 [1] CRAN (R 4.2.1)
knitr 1.40 2022-08-24 [1] CRAN (R 4.2.0)
labeling 0.4.2 2020-10-20 [1] CRAN (R 4.2.0)
later 1.3.0 2021-08-18 [1] CRAN (R 4.2.0)
lattice 0.20-45 2021-09-22 [1] CRAN (R 4.2.1)
lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.2.0)
leiden 0.4.2 2022-05-09 [1] CRAN (R 4.2.0)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0)
listenv 0.8.0 2019-12-05 [1] CRAN (R 4.2.0)
lmtest 0.9-40 2022-03-21 [1] CRAN (R 4.2.0)
magrittr * 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
MASS 7.3-58.1 2022-08-03 [1] CRAN (R 4.2.0)
Matrix 1.4-1 2022-03-23 [1] CRAN (R 4.2.1)
matrixStats 0.62.0 2022-04-19 [1] CRAN (R 4.2.0)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.2.0)
mgcv 1.8-40 2022-03-29 [1] CRAN (R 4.2.1)
mime 0.12 2021-09-28 [1] CRAN (R 4.2.0)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.2.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0)
nlme 3.1-159 2022-08-09 [1] CRAN (R 4.2.0)
paletteer * 1.5.0 2022-10-19 [1] CRAN (R 4.2.0)
parallelly 1.32.1 2022-07-21 [1] CRAN (R 4.2.0)
patchwork * 1.1.2 2022-08-19 [1] CRAN (R 4.2.0)
pbapply 1.5-0 2021-09-16 [1] CRAN (R 4.2.0)
pbmc3k.SeuratData * 3.1.4 2022-11-11 [1] local
phateR 1.0.7 2021-02-12 [1] CRAN (R 4.2.0)
pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
plotly 4.10.0 2021-10-09 [1] CRAN (R 4.2.0)
plyr 1.8.7 2022-03-24 [1] CRAN (R 4.2.0)
png 0.1-7 2013-12-03 [1] CRAN (R 4.2.0)
polyclip 1.10-0 2019-03-14 [1] CRAN (R 4.2.0)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.2.0)
prismatic 1.1.1 2022-08-15 [1] CRAN (R 4.2.0)
progress 1.2.2 2019-05-16 [1] CRAN (R 4.2.0)
progressr 0.10.1 2022-06-03 [1] CRAN (R 4.2.0)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.2.0)
purrr 0.3.4 2020-04-17 [1] CRAN (R 4.2.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
RANN 2.6.1 2019-01-08 [1] CRAN (R 4.2.0)
rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.2.0)
RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.2.0)
Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.0)
RcppAnnoy 0.0.19 2021-07-30 [1] CRAN (R 4.2.0)
RCurl 1.98-1.8 2022-07-30 [1] CRAN (R 4.2.0)
rematch2 2.1.2 2020-05-01 [1] CRAN (R 4.2.0)
reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.2.0)
reticulate 1.28 2023-01-27 [1] CRAN (R 4.2.0)
rgeos 0.5-9 2021-12-15 [1] CRAN (R 4.2.0)
rlang 1.1.1 2023-04-28 [1] CRAN (R 4.2.0)
rmarkdown 2.16 2022-08-24 [1] CRAN (R 4.2.0)
ROCR 1.0-11 2020-05-02 [1] CRAN (R 4.2.0)
rpart 4.1.16 2022-01-24 [1] CRAN (R 4.2.1)
RSQLite 2.2.16 2022-08-17 [1] CRAN (R 4.2.0)
Rtsne 0.16 2022-04-17 [1] CRAN (R 4.2.0)
S4Vectors 0.34.0 2022-04-26 [1] Bioconductor
scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.0)
scattermore 0.8 2022-02-14 [1] CRAN (R 4.2.0)
SCISSORS * 1.2.0 2023-04-10 [1] local
sctransform 0.3.4 2022-08-20 [1] CRAN (R 4.2.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
Seurat * 4.1.1 2022-05-02 [1] CRAN (R 4.2.0)
SeuratData * 0.2.2 2022-08-29 [1] Github (satijalab/seurat-data@d6a8ce6)
SeuratObject * 4.1.0 2022-05-01 [1] CRAN (R 4.2.0)
shiny 1.7.2 2022-07-19 [1] CRAN (R 4.2.0)
sp * 1.5-0 2022-06-05 [1] CRAN (R 4.2.0)
spatstat.core 2.4-4 2022-05-18 [1] CRAN (R 4.2.0)
spatstat.data 3.0-0 2022-10-21 [1] CRAN (R 4.2.0)
spatstat.geom 3.0-3 2022-10-25 [1] CRAN (R 4.2.0)
spatstat.random 3.0-1 2022-11-03 [1] CRAN (R 4.2.0)
spatstat.sparse 3.0-0 2022-10-21 [1] CRAN (R 4.2.0)
spatstat.utils 3.0-1 2022-10-19 [1] CRAN (R 4.2.0)
stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.0)
stringr 1.4.1 2022-08-20 [1] CRAN (R 4.2.0)
survival 3.4-0 2022-08-09 [1] CRAN (R 4.2.0)
tensor 1.5 2012-05-05 [1] CRAN (R 4.2.0)
tibble 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)
tidyr 1.2.0 2022-02-01 [1] CRAN (R 4.2.0)
tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.2.0)
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0)
uwot 0.1.16 2023-06-29 [1] CRAN (R 4.2.0)
vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.2.0)
viridisLite 0.4.1 2022-08-22 [1] CRAN (R 4.2.0)
withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
xfun 0.32 2022-08-10 [1] CRAN (R 4.2.0)
XML 3.99-0.10 2022-06-09 [1] CRAN (R 4.2.0)
xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.0)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.2.0)
XVector 0.36.0 2022-04-26 [1] Bioconductor
yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0)
zlibbioc 1.42.0 2022-04-26 [1] Bioconductor
zoo 1.8-10 2022-04-15 [1] CRAN (R 4.2.0)
[1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
──────────────────────────────────────────────────────────────────────────────
---
title: "scRNA-seq Reclustering with `SCISSORS`"
author:
name: Jack Leary
email: j.leary@ufl.edu
orcid: 0009-0004-8821-3269
affiliations:
- name: University of Florida
department: Biostatistics
city: Gainesville
state: FL
date: today
date-format: long
format:
html:
code-fold: show
code-copy: true
code-tools: true
toc: true
embed-resources: true
fig-format: retina
df-print: kable
link-external-newwindow: true
execute:
cache: true
freeze: auto
---
# Introduction
In this tutorial we'll walk through a basic single cell analysis, with a focus on fine-tuning clustering results using the `SCISSORS` package, which I wrote during my time at UNC Chapel Hill.
# Libraries
If you haven't already, install the development version (currently v`r packageVersion("SCISSORS")`) of `SCISSORS` from [the GitHub repository](https://github.com/jr-leary7/SCISSORS).
```{r, eval=FALSE}
remotes:: install_github("jr-leary7/SCISSORS")
```
Next, we'll load the packages we need to process our single cell data, recluster the cells, and visualize the results.
```{r, message=FALSE, warning=FALSE, results='hide'}
library(dplyr) # data manipulation
library(Seurat) # scRNA-seq tools
library(ggplot2) # plot utilities
library(SCISSORS) # scRNA-seq reclustering
library(paletteer) # color palettes
library(patchwork) # plot combination
library(SeuratData) # datasets
```
# Data
We'll load in the well-known [PBMC3k data from 10X Genomics](https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k), which is often used in example workflows such as the [`Seurat` clustering vignette](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html) and the [`scanpy` vignette](https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html). If you haven't already downloaded the dataset, this function will download the raw data for you and load it into your R session.
```{r}
pbmc <- LoadData("pbmc3k")
```
# Analysis
## Preprocessing
We'll do some minor quality-control checking first by filtering out cells with a high percentage of mitochondrial reads or very low or high numbers of detected genes.
```{r}
pbmc <- PercentageFeatureSet(pbmc,
pattern = "^MT-",
col.name = "percent_MT")
pbmc <- pbmc[, pbmc$nFeature_RNA >= 200 & pbmc$nFeature_RNA <= 2500 & pbmc$percent_MT <= 10]
```
We'll process the raw counts in the usual fashion: QC, normalization, identification of highly variable genes (HVGs), linear & non-linear dimension reduction, and a broad clustering that will (hopefully) capture our major celltypes. When computing the shared nearest-neighbor (SNN) graph, we use the heuristic $k = \sqrt{n}$ for the number of nearest-neighbors to consider for each cell. This ensures that the clustering will be broad i.e., a smaller number of large clusters will be returned instead of a larger number of small clusters.
```{r, results='hide', message=FALSE, warning=FALSE}
pbmc <- NormalizeData(pbmc,
normalization.method = "LogNormalize",
verbose = FALSE) %>%
FindVariableFeatures(selection.method = "vst",
nfeatures = 3000,
verbose = FALSE) %>%
CellCycleScoring(s.features = cc.genes.updated.2019$s.genes,
g2m.features = cc.genes.updated.2019$g2m.genes,
set.ident = FALSE) %>%
AddMetaData(metadata = c(.$S.Score - .$G2M.Score), col.name = "CC_difference") %>%
ScaleData(vars.to.regress = "CC_difference", verbose = FALSE) %>%
RunPCA(features = VariableFeatures(.),
npcs = 50,
verbose = FALSE,
seed.use = 312) %>%
RunUMAP(reduction = "pca",
dims = 1:20,
n.components = 2,
metric = "cosine",
seed.use = 312,
verbose = FALSE) %>%
FindNeighbors(reduction = "pca",
dims = 1:20,
k.param = sqrt(ncol(.)),
nn.method = "annoy",
annoy.metric = "cosine",
verbose = FALSE) %>%
FindClusters(resolution = 0.3,
random.seed = 312,
verbose = FALSE)
```
Let's visualize the principal components. Notable genes in PC 1 include *MALAT1*, high abundance of which is [a common artifact of 10X-sequenced data](https://kb.10xgenomics.com/hc/en-us/articles/360004729092-Why-do-I-see-high-levels-of-Malat1-in-my-gene-expression-data-). PC 2 seems to separate NK cells (*NKG7*, *GZMB*) and myeloid cells (*HLA-DRA*, *CD79A*). PC 3 is composed of variation that could originate from platelets (*PPBP*). PCs 4-6 look like they separate several types of monocytic, T, NK, and dendritic cells.
```{r}
DimHeatmap(pbmc,
reduction = "pca",
dims = 1:6,
nfeatures = 15,
combine = TRUE)
```
We visualize the Louvain clustering via a UMAP plot. We see 5 major clusters, which we'll annotate next.
```{r, message=FALSE, warning=FALSE}
DimPlot(pbmc, pt.size = 1) +
scale_color_paletteer_d("ggsci::nrc_npg") +
labs(x = "UMAP 1", y = "UMAP 2") +
theme(axis.ticks = element_blank(),
axis.text = element_blank())
```
## Broad Annotations
First we identify CD8+ T-cells via *CD8A*, and CD4+ T-cells with *IL7R*. Lastly, *FCGR3A* (aka *CD16*) is specific to CD16+ monocytes. We can combine the plots using the excellent `patchwork` package.
```{r, message=FALSE, warning=FALSE}
p1 <- FeaturePlot(pbmc, features = "CD8A", pt.size = 1) +
scale_color_gradientn(colours = paletteer_d("wesanderson::Zissou1")) +
labs(x = "UMAP 1", y = "UMAP 2") +
theme(axis.ticks = element_blank(),
axis.text = element_blank())
p2 <- FeaturePlot(pbmc, features = "IL7R", pt.size = 1) +
scale_color_gradientn(colours = paletteer_d("wesanderson::Zissou1")) +
labs(x = "UMAP 1", y = "UMAP 2") +
theme(axis.ticks = element_blank(),
axis.text = element_blank())
p1 / p2
```
Next, we use *HLA-DRA* to broadly identify monocytic cells, and *FCGR3A* (aka *CD16*) to single out the CD16+ monocytes.
```{r, message=FALSE, warning=FALSE}
p1 <- FeaturePlot(pbmc, features = "HLA-DRA", pt.size = 1) +
scale_color_gradientn(colours = paletteer_d("wesanderson::Zissou1")) +
labs(x = "UMAP 1", y = "UMAP 2") +
theme(axis.ticks = element_blank(),
axis.text = element_blank())
p2 <- FeaturePlot(pbmc, features = "FCGR3A", pt.size = 1) +
scale_color_gradientn(colours = paletteer_d("wesanderson::Zissou1")) +
labs(x = "UMAP 1", y = "UMAP 2") +
theme(axis.ticks = element_blank(),
axis.text = element_blank())
p1 / p2
```
Lastly, abundance of *MS4A1* points out a cluster of B cells.
```{r, message=FALSE, warning=FALSE}
FeaturePlot(pbmc, features = "MS4A1", pt.size = 1) +
scale_color_gradientn(colours = paletteer_d("wesanderson::Zissou1")) +
labs(x = "UMAP 1", y = "UMAP 2") +
theme(axis.ticks = element_blank(),
axis.text = element_blank())
```
We'll add broad celltype labels to our object's metadata.
```{r}
pbmc@meta.data <- mutate(pbmc@meta.data,
broad_celltype = case_when(seurat_clusters == 0 ~ "CD4+ T",
seurat_clusters == 1 ~ "Monocyte",
seurat_clusters == 2 ~ "CD8+ T",
seurat_clusters == 3 ~ "B",
seurat_clusters == 4 ~ "CD16+ Monocyte",
TRUE ~ NA_character_),
broad_celltype = factor(broad_celltype, levels = c("CD4+ T",
"Monocyte",
"CD8+ T",
"B",
"CD16+ Monocyte")))
```
And visualize the results.
```{r, message=FALSE, warning=FALSE}
DimPlot(pbmc, pt.size = 1, group.by = "broad_celltype") +
scale_color_paletteer_d("ggsci::nrc_npg") +
labs(x = "UMAP 1",
y = "UMAP 2",
color = "Broad Celltype") +
theme(axis.ticks = element_blank(),
axis.text = element_blank(),
plot.title = element_blank())
```
## Reclustering
From the plot above, there appears to be some visible subgroups in the monocyte cluster. With that being said - I would generally be very cautious about using UMAPs alone to define heterogeneous groups. In general, I would suggest using something like silhouette score distributions, other clustering statistics, or biological knowledge to determine subclustering targets. We can do this below using `ComputeSilhouetteScores()`, which returns a silhouette score for each individual cell. Visualizing the results can help us identify which clusters are "poor" fits. For more information, check out [the Wikipedia article on clustering scores](https://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_and_assessment).
```{r}
sil_scores <- ComputeSilhouetteScores(pbmc, avg = FALSE)
```
We can see that the B cell and CD16+ monocyte clusters seem to be well-fit, but the other clusters are less so. We'll focus on the other monocyte cluster, as it seems to have the highest variance.
```{r, message=FALSE, warning=FALSE}
sil_scores %>%
left_join(distinct(pbmc@meta.data, seurat_clusters, broad_celltype),
by = c("Cluster" = "seurat_clusters")) %>%
ggplot(aes(x = broad_celltype, y = Score, fill = broad_celltype)) +
geom_violin(scale = "width",
color = "black",
draw_quantiles = 0.5,
size = 0.75) +
scale_fill_paletteer_d("ggsci::nrc_npg") +
labs(y = "Silhouette Score", fill = "Broad Celltype") +
theme_classic(base_size = 14) +
theme(panel.grid.major.y = element_line(),
axis.title.x = element_blank())
```
### Monocytes
```{r, message=FALSE, warning=FALSE}
mono_reclust <- ReclusterCells(pbmc,
which.clust = 1,
use.parallel = FALSE,
n.HVG = 3000,
n.PC = 15,
k.vals = c(20, 30, 40),
resolution.vals = c(.2, .3, .4),
random.seed = 312)
```
Let's check out the UMAP embedding:
```{r, message=FALSE, warning=FALSE}
DimPlot(mono_reclust) +
scale_color_paletteer_d("MetBrewer::Egypt") +
labs(x = "UMAP 1",
y = "UMAP 2",
color = "Subcluster") +
theme(axis.ticks = element_blank(),
axis.text = element_blank(),
plot.title = element_blank())
```
Highly-specific abundance of *FCER1A* allows us to identify the dendritic cells in cluster 2.
```{r, warning=FALSE, message=FALSE}
data.frame(exp = mono_reclust@assays$RNA@data["FCER1A", ],
label = mono_reclust$seurat_clusters) %>%
ggplot(aes(x = label, y = exp, fill = label)) +
geom_violin(scale = "width",
color = "black",
draw_quantiles = 0.5,
size = 0.75) +
scale_fill_paletteer_d("MetBrewer::Egypt") +
labs(y = "FCER1A", fill = "Subcluster") +
theme_classic(base_size = 14) +
theme(panel.grid.major.y = element_line(),
axis.title.x = element_blank())
```
Both cluster 0 & cluster 1 seem to be CD14+, and cluster 1 appears to have slightly higher (but still low) abundance of *FCGR3A*.
```{r, message=FALSE, warning=FALSE}
p1 <- data.frame(exp = mono_reclust@assays$RNA@data["CD14", ],
label = mono_reclust$seurat_clusters) %>%
filter(label %in% c(0, 1)) %>%
ggplot(aes(x = label, y = exp, fill = label)) +
geom_violin(scale = "width",
color = "black",
draw_quantiles = 0.5,
size = 0.75) +
scale_fill_paletteer_d("MetBrewer::Egypt") +
labs(y = "CD14", fill = "Subcluster") +
theme_classic(base_size = 14) +
theme(panel.grid.major.y = element_line(),
axis.title.x = element_blank())
p2 <- data.frame(exp = mono_reclust@assays$RNA@data["FCGR3A", ],
label = mono_reclust$seurat_clusters) %>%
filter(label %in% c(0, 1)) %>%
ggplot(aes(x = label, y = exp, fill = label)) +
geom_violin(scale = "width",
color = "black",
draw_quantiles = 0.5,
size = 0.75) +
scale_fill_paletteer_d("MetBrewer::Egypt") +
labs(y = "FCGR3A", fill = "Subcluster") +
theme_classic(base_size = 14) +
theme(panel.grid.major.y = element_line(),
axis.title.x = element_blank())
p1 / p2
```
From [Kapellos *et al* (2019)](https://doi.org/10.3389/fimmu.2019.02035), we know that intermediate monocytes have high abundance of *CD14*, low but non-zero abundance of *CD16* (which again is denoted *FCGR3A* in this dataset), and can be identified through higher abundance of other markers like *HLA-DPB1* and *CD74* in comparison to CD14+ monocytes. With all this information, we'll conclude that cluster 0 is likely composed of CD14+ monocytes and cluster 1 of intermediate monocytes.
```{r, message=FALSE, warning=FALSE}
p1 <- data.frame(exp = mono_reclust@assays$RNA@data["HLA-DQB1", ],
label = mono_reclust$seurat_clusters) %>%
filter(label %in% c(0, 1)) %>%
ggplot(aes(x = label, y = exp, fill = label)) +
geom_violin(scale = "width",
color = "black",
draw_quantiles = 0.5,
size = 0.75) +
scale_fill_paletteer_d("MetBrewer::Egypt") +
labs(y = "HLA-DQB1", fill = "Subcluster") +
theme_classic(base_size = 14) +
theme(panel.grid.major.y = element_line(),
axis.title.x = element_blank())
p2 <- data.frame(exp = mono_reclust@assays$RNA@data["CD74", ],
label = mono_reclust$seurat_clusters) %>%
filter(label %in% c(0, 1)) %>%
ggplot(aes(x = label, y = exp, fill = label)) +
geom_violin(scale = "width",
color = "black",
draw_quantiles = 0.5,
size = 0.75) +
scale_fill_paletteer_d("MetBrewer::Egypt") +
labs(y = "CD74", fill = "Subcluster") +
theme_classic(base_size = 14) +
theme(panel.grid.major.y = element_line(),
axis.title.x = element_blank())
p1 / p2
```
Lastly, we can tell that cluster 3 is composed of platelets thanks to high abundance of *PPBP*.
```{r, message=FALSE, warning=FALSE}
data.frame(exp = mono_reclust@assays$RNA@data["PPBP", ],
label = mono_reclust$seurat_clusters) %>%
ggplot(aes(x = label, y = exp, fill = label)) +
geom_violin(scale = "width",
color = "black",
draw_quantiles = 0.5,
size = 0.75) +
scale_fill_paletteer_d("MetBrewer::Egypt") +
labs(y = "PPBP", fill = "Subcluster") +
theme_classic(base_size = 14) +
theme(panel.grid.major.y = element_line(),
axis.title.x = element_blank())
```
We can add the new subcluster labels back in to our original object using `IntegrateSubclusters()`. We also add labels to the original object reflecting the subcluster annotations.
```{r}
pbmc <- IntegrateSubclusters(pbmc, reclust.results = mono_reclust)
pbmc@meta.data <- mutate(pbmc@meta.data,
celltype = case_when(seurat_clusters == 0 ~ "CD4+ T",
seurat_clusters == 1 ~ "Platelet",
seurat_clusters == 2 ~ "CD8+ T",
seurat_clusters == 3 ~ "B",
seurat_clusters == 4 ~ "CD16+ Monocyte",
seurat_clusters == 5 ~ "CD14+ Monocyte",
seurat_clusters == 6 ~ "Intermediate Monocyte",
seurat_clusters == 7 ~ "Dendritic Cell",
TRUE ~ NA_character_))
```
Here's the final celltype annotations on our original UMAP embedding.
```{r, message=FALSE, warning=FALSE}
DimPlot(pbmc, group.by = "celltype", pt.size = 1) +
scale_color_paletteer_d("ggsci::default_nejm") +
labs(x = "UMAP 1",
y = "UMAP 2",
color = "Celltype") +
theme(axis.ticks = element_blank(),
axis.text = element_blank(),
plot.title = element_blank())
```
# Conclusions
We were able to use `SCISSORS` to estimate a biologically meaningful subclustering that is to some degree supported by canonical marker genes from the literature. I'll note that while these monocyte subtype annotations might be useful, we didn't analyze the entire dataset; those familiar with the PBMC3k dataset will know that what we labelled the CD8+ T cell cluster actually also contains NK cells. This analysis is not meant to be exhaustive or final, and serves mostly to show how and why `SCISSORS` is used. Please reach out to me with questions on the package [via email](mailto:jrleary@live.unc.edu), or by [opening an issueon the GitHub repository](https://github.com/jr-leary7/SCISSORS/issues).
# Session Info
```{r}
sessioninfo::session_info()
```