scRNA-seq Resources

Author

Affiliation

Jack Leary

University of Florida

Published

April 16, 2024

1 Purpose

Here I’ll catalog useful papers, preprints, method vignettes, Twitter discussions, etc. that I’ve found helpful while learning how to process and analyze single cell RNA-seq data. When possible, I’ll link to static versions of things in the hopes that links don’t break. I’ll categorize resources according to which problem they address e.g., raw data processing, clustering, visualization, etc., though some resources will of course touch on multiple topics.

2 Introductory Guides

Orchestrating Single Cell Analysis with Bioconductor
- This book is a thoroughly comprehensive guide on using Bioconductor tools built around the SingleCellExperiment data structure to perform scRNA-seq analysis. It covers basic & advanced versions everything from quality control to many different types of downstream analyses in both the single- and multi-sample cases.
Seurat guided clustering tutorial
- The famous Seurat PBMC3k vignette that everyone has looked at at least once. Useful for beginners to get an idea of how Seurat works, but glosses over a ton of details & makes things look perhaps a bit too easy.

3 Processing Raw Reads

10XGenomics cellranger documentation
- Shows how to use the official cellranger pipeline maintained by 10X to turn your raw reads into a gene-cell counts matrix.
Quantifying unspliced & spliced RNA with alevin-fry
- Use this pipeline if you want to perform RNA velocity analysis, or just want to have your read counts quantified by spliced / unspliced / ambiguous status (called USA mode in the documentation).
kallisto | bustools documentation
- This command line tool & Python downstream analysis suite are maintained by the Pachter Lab and are very well-written. This framework also allows for unspliced vs. spliced RNA count quantification, which is necessary to run RNA velocity. I would recommend reading through the docs just for the tips on scRNA-seq analysis even if you don’t end up using the tool.

4 Quality Control

Vignette for Seurat cell cycle scoring
- Uses regression to assign an S- and G2M-phase score to each cell based on human cell cycle genes from this 2019 paper.

5 Normalization

SCnorm: Robust normalization of single cell data
- Written by my PI Dr. Rhonda Bacher, this paper presents a robust normalization package & details many of the challenges / tradeoffs of performing normalization & variance stabilization.
Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression
- This paper details SCTransform, a Satija Lab method & R package for normalization that uses regularized negative binomial GLMs to model gene expression variance while accounting for overdispersion & sharing information across genes. There’s been some controversy over whether this method is actually that good, such as this paper which posits that the SCTransform model is overspecified, and recommends the usage of GLM-PCA as proposed here.
- Original SCTransform vignette & SCTransform V2 vignette with methodological improvements.

6 Integration

7 Clustering

From Louvain to Leiden: guaranteeing well-connected communities
- Most scRNA-seq analysis packages apply some type graph-based clustering algorithms as the clustering method of choice. While many other clustering methods exist, graph-based methods are easy to use & interpret, computationally efficient, & their assumptions generally match the structure of scRNA-seq data. The Leiden algorithm described in the paper above is one of the most widely-used options; a good tutorial on using it via in Python scanpy can be found here, and one for R using Seurat is here.
Sub-Cluster Identification through Semi-Supervised Optimization of Rare-cell Silhouettes (SCISSORS) in Single-Cell Sequencing
- This is a bit of a self-plug, but I wrote the SCISSORS R package to make obtaining well-fit, reproducible clustering results using the graph-based algorithms in Seurat a little easier. The method iterates across several combinations of clustering parameters and finds the set with the best silhouette score. Performance was validated via simulations, & we were able to discover new biological information in a pancreatic ductal adenocarcinoma (PDAC) dataset using the method. The GitHub repository has several examples as well as the raw code, and there’s also an introductory vignette on this site.

8 Annotation

Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage
- This paper details the usage of an automated celltype annotation tool called SingleR to transfer labels from a labelled reference dataset to an unlabeled query dataset. A good vignette can be found here. If your data are clean & well-processed, and your reference is high quality, it can be a fairly accurate & very efficient tool. I generally use it as a first step in an annotation process to generate broad celltype labels e.g., monocyte, T cell, fibroblast, etc. Don’t treat its labels as ground truth, and definitely try multiple reference datasets. In addition, it uses correlation to measure the similarity between the query & reference, & I’ve found that processing both datasets using the same pipeline (especially with respect to normalization) has led to better results.

9 Dimension Reduction & Visualization

10 Trajectory Inference

10.1 Via Pseudotime

Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics
- The slingshot package is my currently preferred method for estimating a pseudotemporal cellular ordering. A decent vignette can be found here. I would absolutely recommend using principal components as input to the algorithm instead of UMAP / t-SNE components.
Trajectory-based differential expression analysis for single-cell sequencing data
- This paper describes the development of the tradeSeq R package, which uses generalized additive models (GAMs) to perform differential expression over an inferred cellular trajectory. The package has some limitations, but provides a variety of different tests of different patterns of gene expression, and overall strikes a good balance between running quickly & providing accurate results. A good vignette can be accessed here. A nice characteristic of the method is that it is agnostic with respect to the type of pseudotime estimation used, meaning the user can derive their cellular ordering using any pseudotime or RNA velocity method prior to running tradeSeq.

10.2 Via RNA Velocity

RNA velocity unraveled
On the mathematics of RNA velocity I: Theoretical analysis
On the mathematics of RNA velocity II: Algorithmic aspects
Unified fate mapping in multiview single-cell data
- This preprint presents the CellRank 2 Python package, which implements a flexible, modular set of tools for the analysis of cell state composition and transition. The package is an extension of CellRank (paper), and is built around the concept of kernels based on pseudotime, RNA velocity, graph connectivity, time-series, etc. that are used to create estimators that classify cells into initial, intermediate, and terminal cell states. While RNA velocity is a possible input, it’s no longer necessary to run the estimation routine like it was in the original CellRank package. The great thing about this package is that you can combine kernels i.e., you can base the cell state estimation on weighted combinations of pseudotime, RNA velocity, graph structure, etc. based on how confident you are in each input.

11 Differential Expression

12 Pathway Analysis

Variance-adjusted Mahalanobis (VAM): a fast and accurate method for cell-specific gene set scoring
- VAM is an easy-to-use method for assigning each cell in a sample a score bounded on $[0, 1]$ for a user-provided gene set of interest. It’s conveniently wrapped around the Seurat framework, and I’ve found it to be accurate & useful in my single cell work. In addition, the author was very helpful when I reached out to him with some questions when I was first starting out. Vignettes can be found here.

13 Simulation

Enhancing biological signals and detection rates in single-cell RNA-seq experiments with cDNA library equalization
- This paper details the scaffold simulation method & accompanying R package (detailed vignette here) built by my PhD adviser Dr. Rhonda Bacher. scRNA-seq data simulation can often be tricky, and most packages have weird defaults, don’t generate data that follows true scRNA-seq count distributions, or are just really hard to set up & use. scaffold fixes all that, and allows the user to generate UMI or non-UMI data, as well as data containing multiple populations or dynamic trends over pseudotime. Ground truth cluster labels and cell temporal orderings are provided, which makes it a great choice when benchmarking new or extant methods. It’s fairly simple to use, & computationally efficient as well which is a huge plus when running large simulation studies.
Splatter: simulation of single-cell RNA sequencing data
- Splatter is another R package that implements relatively simple simulation of scRNA-seq counts data. While the package works well & has decent documentation, and gives good ground-truth values for gene differential expression & cell pseudotime, the counts it generates are less sparse than true single cell data typically are. This could lead to overly-optimistic method benchmarks if not accounted for when testing new methods. With that caveat, it’s a good method & a great way to jump into simulation studies. A nice introductory vignette can be found here.

14 Computational Reproducibility & Scalability

Reproducible pipelines in R using {targets}
- The targets package is one of my favorite R tools, & the well-written docs above show how to create version-controlled pipelines entirely using R. This framework is an absolute godsend for large projects, simulation studies, etc., and I’ve used it on every longterm computational project I’ve worked on in the past 2 years. It makes tracking, reproducing, & parallelizing the execution of large codebases very easy, and makes reproducible research accessible to anyone with a good handle on R.
Parallelization in Seurat vignette
- Shows usage of the future R package to improve tasks such as normalization, integration, clustering, & differential expression.

15 Miscellaneous

Dr. Ming (Tommy) Tang’s genomics tools GitHub repository
- Dr. Ming Tang is a wet lab scientist-turned-computational biologist who used to work at Dana Farber Cancer Institute & now is a director at Immunitas. He’s put together a ton of resources for anyone (biologists, bioinformatics newbies, even experienced computational biologists) who’s in the process of learning more about bioinformatics & biostatistics. This repository contains a ton of useful resources (books, blog posts, online courses, etc.) on statistics, computing, biology, and other topics. I’d recommend perusing it even if you’re not a beginner, as you’ll likely find something interesting or useful to read.

--- title: "scRNA-seq Resources" author: name: Jack Leary email: j.leary@ufl.edu orcid: 0009-0004-8821-3269 affiliations: - name: University of Florida department: Biostatistics city: Gainesville state: FL date: today date-format: long format: html: code-fold: show code-copy: true code-tools: true toc: true toc-depth: 2 embed-resources: true fig-format: retina fig-width: 9 fig-height: 6 df-print: kable link-external-newwindow: true tbl-cap-location: bottom fig-cap-location: bottom number-sections: true execute: cache: true freeze: auto --- # Purpose Here I'll catalog useful papers, preprints, method vignettes, Twitter discussions, etc. that I've found helpful while learning how to process and analyze single cell RNA-seq data. When possible, I'll link to static versions of things in the hopes that links don't break. I'll categorize resources according to which problem they address e.g., raw data processing, clustering, visualization, etc., though some resources will of course touch on multiple topics. # Introductory Guides - [Orchestrating Single Cell Analysis with Bioconductor](http://bioconductor.org/books/release/OSCA/) - This book is a thoroughly comprehensive guide on using Bioconductor tools built around the `SingleCellExperiment` data structure to perform scRNA-seq analysis. It covers basic & advanced versions everything from quality control to many different types of downstream analyses in both the single- and multi-sample cases. - [`Seurat` guided clustering tutorial](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html) - The famous `Seurat` PBMC3k vignette that everyone has looked at at least once. Useful for beginners to get an idea of how `Seurat` works, but glosses over a ton of details & makes things look perhaps a bit too easy. # Processing Raw Reads - [10XGenomics `cellranger` documentation](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/installation) - Shows how to use the official `cellranger` pipeline maintained by 10X to turn your raw reads into a gene-cell counts matrix. - [Quantifying unspliced & spliced RNA with `alevin-fry`](https://combine-lab.github.io/alevin-fry-tutorials/2021/alevin-fry-velocity/) - Use this pipeline if you want to perform RNA velocity analysis, or just want to have your read counts quantified by spliced / unspliced / ambiguous status (called `USA` mode in the documentation). - [`kallisto | bustools` documentation](https://www.kallistobus.tools) - This command line tool & Python downstream analysis suite are maintained by the Pachter Lab and are very well-written. This framework also allows for unspliced vs. spliced RNA count quantification, which is necessary to run RNA velocity. I would recommend reading through the docs just for the tips on scRNA-seq analysis even if you don't end up using the tool. # Quality Control - [Vignette for `Seurat` cell cycle scoring](https://satijalab.org/seurat/articles/cell_cycle_vignette.html) - Uses regression to assign an S- and G2M-phase score to each cell based on human cell cycle genes from [this 2019 paper](https://doi.org/10.1126/science.aad0501). # Normalization - [`SCnorm`: Robust normalization of single cell data](https://doi.org/10.1038/nmeth.4263) - Written by my PI Dr. Rhonda Bacher, this paper presents a [robust normalization package](https://bioconductor.org/packages/devel/bioc/html/SCnorm.html) & details many of the challenges / tradeoffs of performing normalization & variance stabilization. - [Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression](https://doi.org/10.1186/s13059-019-1874-1) - This paper details `SCTransform`, a Satija Lab method & R package for normalization that uses regularized negative binomial GLMs to model gene expression variance while accounting for overdispersion & sharing information across genes. There's been some controversy over whether this method is actually that good, such as [this paper](http://dx.doi.org/10.1186/s13059-021-02451-7) which posits that the `SCTransform` model is overspecified, and recommends the usage of GLM-PCA as proposed [here](https://doi.org/10.1186/s13059-019-1861-6). - [Original `SCTransform` vignette ](https://satijalab.org/seurat/articles/sctransform_vignette.html) & [`SCTransform` V2 vignette](https://satijalab.org/seurat/articles/sctransform_v2_vignette.html) with methodological improvements. # Integration # Clustering - [From Louvain to Leiden: guaranteeing well-connected communities](https://doi.org/10.1038/s41598-019-41695-z) - Most scRNA-seq analysis packages apply some type graph-based clustering algorithms as the clustering method of choice. While [many other clustering methods exist](https://en.wikipedia.org/wiki/Category:Cluster_analysis_algorithms), graph-based methods are easy to use & interpret, computationally efficient, & their assumptions generally match the structure of scRNA-seq data. The Leiden algorithm described in the paper above is one of the most widely-used options; a good tutorial on using it via in Python `scanpy` can be found [here](https://nbisweden.github.io/workshop-scRNAseq/labs/compiled/scanpy/scanpy_04_clustering.html), and one for R using `Seurat` is [here](https://cran.r-project.org/web/packages/leiden/vignettes/run_leiden.html). - [Sub-Cluster Identification through Semi-Supervised Optimization of Rare-cell Silhouettes (SCISSORS) in Single-Cell Sequencing](https://doi.org/10.1101/2021.10.29.466448) - This is a bit of a self-plug, but I wrote the `SCISSORS` R package to make obtaining well-fit, reproducible clustering results using the graph-based algorithms in `Seurat` a little easier. The method iterates across several combinations of clustering parameters and finds the set with the best silhouette score. Performance was validated via simulations, & we were able to discover new biological information in a pancreatic ductal adenocarcinoma (PDAC) dataset using the method. The [GitHub repository](https://github.com/jr-leary7/SCISSORS) has several examples as well as the raw code, and there's also [an introductory vignette](https://jr-leary7.github.io/quarto-site/tutorials/SCISSORS_Reclustering.html) on this site. # Annotation - [Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage](https://doi.org/10.1038/s41590-018-0276-y) - This paper details the usage of an automated celltype annotation tool called `SingleR` to transfer labels from a labelled reference dataset to an unlabeled query dataset. A good vignette can be found [here](https://www.bioconductor.org/packages/devel/bioc/vignettes/SingleR/inst/doc/SingleR.html). If your data are clean & well-processed, and your reference is high quality, it can be a fairly accurate & very efficient tool. I generally use it as a first step in an annotation process to generate broad celltype labels e.g., monocyte, T cell, fibroblast, etc. Don't treat its labels as ground truth, and definitely try multiple reference datasets. In addition, it uses correlation to measure the similarity between the query & reference, & I've found that processing both datasets using the same pipeline (especially with respect to normalization) has led to better results. # Dimension Reduction & Visualization # Trajectory Inference ## Via Pseudotime - [Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics](https://doi.org/10.1186/s12864-018-4772-0) - The `slingshot` package is my currently preferred method for estimating a pseudotemporal cellular ordering. A decent vignette can be found [here](https://bioconductor.org/packages/release/bioc/vignettes/slingshot/inst/doc/vignette.html). I would absolutely recommend using principal components as input to the algorithm instead of UMAP / t-SNE components. - [Trajectory-based differential expression analysis for single-cell sequencing data](https://doi.org/10.1038/s41467-020-14766-3) - This paper describes the development of the `tradeSeq` R package, which uses generalized additive models (GAMs) to perform differential expression over an inferred cellular trajectory. The package has some limitations, but provides a variety of different tests of different patterns of gene expression, and overall strikes a good balance between running quickly & providing accurate results. A good vignette can be accessed [here](https://statomics.github.io/tradeSeq/articles/tradeSeq.html). A nice characteristic of the method is that it is agnostic with respect to the type of pseudotime estimation used, meaning the user can derive their cellular ordering using any pseudotime or RNA velocity method prior to running `tradeSeq`. ## Via RNA Velocity - [RNA velocity unraveled](https://doi.org/10.1371/journal.pcbi.1010492) - [On the mathematics of RNA velocity I: Theoretical analysis](https://doi.org/10.4208/csiam-am.SO-2020-0001) - [On the mathematics of RNA velocity II: Algorithmic aspects](https://doi.org/10.48550/arXiv.2306.05707) - [Unified fate mapping in multiview single-cell data](https://doi.org/10.1101/2023.07.19.549685) - This preprint presents the `CellRank 2` Python package, which implements a flexible, modular set of tools for the analysis of cell state composition and transition. The package is an extension of `CellRank` ([paper](https://doi.org/10.1038/s41592-021-01346-6)), and is built around the concept of *kernels* based on pseudotime, RNA velocity, graph connectivity, time-series, etc. that are used to create *estimators* that classify cells into initial, intermediate, and terminal cell states. While RNA velocity is a possible input, it's no longer necessary to run the estimation routine like it was in the original `CellRank` package. The great thing about this package is that you can combine kernels i.e., you can base the cell state estimation on weighted combinations of pseudotime, RNA velocity, graph structure, etc. based on how confident you are in each input. # Differential Expression # Pathway Analysis - [Variance-adjusted Mahalanobis (VAM): a fast and accurate method for cell-specific gene set scoring](https://doi.org/10.1093/nar/gkaa582) - `VAM` is an easy-to-use method for assigning each cell in a sample a score bounded on $[0, 1]$ for a user-provided gene set of interest. It's conveniently wrapped around the `Seurat` framework, and I've found it to be accurate & useful in my single cell work. In addition, the author was very helpful when I reached out to him with some questions when I was first starting out. Vignettes can be found [here](https://hrfrost.host.dartmouth.edu/VAM/). # Simulation - [Enhancing biological signals and detection rates in single-cell RNA-seq experiments with cDNA library equalization ](https://doi.org/10.1093/nar/gkab1071) - This paper details the `scaffold` simulation method & accompanying R package ([detailed vignette here](https://www.rhondabacher.com/scaffold-vignette.pdf)) built by my PhD adviser Dr. Rhonda Bacher. scRNA-seq data simulation can often be tricky, and most packages have weird defaults, don't generate data that follows true scRNA-seq count distributions, or are just really hard to set up & use. `scaffold` fixes all that, and allows the user to generate UMI or non-UMI data, as well as data containing multiple populations or dynamic trends over pseudotime. Ground truth cluster labels and cell temporal orderings are provided, which makes it a great choice when benchmarking new or extant methods. It's fairly simple to use, & computationally efficient as well which is a huge plus when running large simulation studies. - [Splatter: simulation of single-cell RNA sequencing data](https://doi.org/10.1186/s13059-017-1305-0) - `Splatter` is another R package that implements relatively simple simulation of scRNA-seq counts data. While the package works well & has decent documentation, and gives good ground-truth values for gene differential expression & cell pseudotime, the counts it generates are less sparse than true single cell data typically are. This could lead to overly-optimistic method benchmarks if not accounted for when testing new methods. With that caveat, it's a good method & a great way to jump into simulation studies. A nice introductory vignette can be found [here](http://oshlacklab.com/splatter/articles/splatter.html). # Computational Reproducibility & Scalability - [Reproducible pipelines in R using `{targets}`](https://docs.ropensci.org/targets/) - The `targets` package is one of my favorite R tools, & the well-written docs above show how to create version-controlled pipelines entirely using R. This framework is an absolute godsend for large projects, simulation studies, etc., and I've used it on every longterm computational project I've worked on in the past 2 years. It makes tracking, reproducing, & parallelizing the execution of large codebases very easy, and makes reproducible research accessible to anyone with a good handle on R. - [Parallelization in `Seurat` vignette](https://satijalab.org/seurat/articles/future_vignette.html) - Shows usage of the `future` R package to improve tasks such as normalization, integration, clustering, & differential expression. # Miscellaneous - [Dr. Ming (Tommy) Tang's genomics tools GitHub repository](https://github.com/crazyhottommy/getting-started-with-genomics-tools-and-resources) - [Dr. Ming Tang](https://twitter.com/tangming2005) is a wet lab scientist-turned-computational biologist who used to work at Dana Farber Cancer Institute & now is a director at Immunitas. He's put together a *ton* of resources for anyone (biologists, bioinformatics newbies, even experienced computational biologists) who's in the process of learning more about bioinformatics & biostatistics. This repository contains a ton of useful resources (books, blog posts, online courses, etc.) on statistics, computing, biology, and other topics. I'd recommend perusing it even if you're not a beginner, as you'll likely find something interesting or useful to read.