scRNA-seq Resources
1 Purpose
Here I’ll catalog useful papers, preprints, method vignettes, Twitter discussions, etc. that I’ve found helpful while learning how to process and analyze single cell RNA-seq data. When possible, I’ll link to static versions of things in the hopes that links don’t break. I’ll categorize resources according to which problem they address e.g., raw data processing, clustering, visualization, etc., though some resources will of course touch on multiple topics.
2 Introductory Guides
- Orchestrating Single Cell Analysis with Bioconductor
- This book is a thoroughly comprehensive guide on using Bioconductor tools built around the
SingleCellExperiment
data structure to perform scRNA-seq analysis. It covers basic & advanced versions everything from quality control to many different types of downstream analyses in both the single- and multi-sample cases.
- This book is a thoroughly comprehensive guide on using Bioconductor tools built around the
Seurat
guided clustering tutorial- The famous
Seurat
PBMC3k vignette that everyone has looked at at least once. Useful for beginners to get an idea of howSeurat
works, but glosses over a ton of details & makes things look perhaps a bit too easy.
- The famous
3 Processing Raw Reads
- 10XGenomics
cellranger
documentation- Shows how to use the official
cellranger
pipeline maintained by 10X to turn your raw reads into a gene-cell counts matrix.
- Shows how to use the official
- Quantifying unspliced & spliced RNA with
alevin-fry
- Use this pipeline if you want to perform RNA velocity analysis, or just want to have your read counts quantified by spliced / unspliced / ambiguous status (called
USA
mode in the documentation).
- Use this pipeline if you want to perform RNA velocity analysis, or just want to have your read counts quantified by spliced / unspliced / ambiguous status (called
kallisto | bustools
documentation- This command line tool & Python downstream analysis suite are maintained by the Pachter Lab and are very well-written. This framework also allows for unspliced vs. spliced RNA count quantification, which is necessary to run RNA velocity. I would recommend reading through the docs just for the tips on scRNA-seq analysis even if you don’t end up using the tool.
4 Quality Control
- Vignette for
Seurat
cell cycle scoring- Uses regression to assign an S- and G2M-phase score to each cell based on human cell cycle genes from this 2019 paper.
5 Normalization
SCnorm
: Robust normalization of single cell data- Written by my PI Dr. Rhonda Bacher, this paper presents a robust normalization package & details many of the challenges / tradeoffs of performing normalization & variance stabilization.
- Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression
- This paper details
SCTransform
, a Satija Lab method & R package for normalization that uses regularized negative binomial GLMs to model gene expression variance while accounting for overdispersion & sharing information across genes. There’s been some controversy over whether this method is actually that good, such as this paper which posits that theSCTransform
model is overspecified, and recommends the usage of GLM-PCA as proposed here. - Original
SCTransform
vignette &SCTransform
V2 vignette with methodological improvements.
- This paper details
6 Integration
7 Clustering
- From Louvain to Leiden: guaranteeing well-connected communities
- Most scRNA-seq analysis packages apply some type graph-based clustering algorithms as the clustering method of choice. While many other clustering methods exist, graph-based methods are easy to use & interpret, computationally efficient, & their assumptions generally match the structure of scRNA-seq data. The Leiden algorithm described in the paper above is one of the most widely-used options; a good tutorial on using it via in Python
scanpy
can be found here, and one for R usingSeurat
is here.
- Most scRNA-seq analysis packages apply some type graph-based clustering algorithms as the clustering method of choice. While many other clustering methods exist, graph-based methods are easy to use & interpret, computationally efficient, & their assumptions generally match the structure of scRNA-seq data. The Leiden algorithm described in the paper above is one of the most widely-used options; a good tutorial on using it via in Python
- Sub-Cluster Identification through Semi-Supervised Optimization of Rare-cell Silhouettes (SCISSORS) in Single-Cell Sequencing
- This is a bit of a self-plug, but I wrote the
SCISSORS
R package to make obtaining well-fit, reproducible clustering results using the graph-based algorithms inSeurat
a little easier. The method iterates across several combinations of clustering parameters and finds the set with the best silhouette score. Performance was validated via simulations, & we were able to discover new biological information in a pancreatic ductal adenocarcinoma (PDAC) dataset using the method. The GitHub repository has several examples as well as the raw code, and there’s also an introductory vignette on this site.
- This is a bit of a self-plug, but I wrote the
8 Annotation
- Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage
- This paper details the usage of an automated celltype annotation tool called
SingleR
to transfer labels from a labelled reference dataset to an unlabeled query dataset. A good vignette can be found here. If your data are clean & well-processed, and your reference is high quality, it can be a fairly accurate & very efficient tool. I generally use it as a first step in an annotation process to generate broad celltype labels e.g., monocyte, T cell, fibroblast, etc. Don’t treat its labels as ground truth, and definitely try multiple reference datasets. In addition, it uses correlation to measure the similarity between the query & reference, & I’ve found that processing both datasets using the same pipeline (especially with respect to normalization) has led to better results.
- This paper details the usage of an automated celltype annotation tool called
9 Dimension Reduction & Visualization
10 Trajectory Inference
10.1 Via Pseudotime
- Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics
- The
slingshot
package is my currently preferred method for estimating a pseudotemporal cellular ordering. A decent vignette can be found here. I would absolutely recommend using principal components as input to the algorithm instead of UMAP / t-SNE components.
- The
- Trajectory-based differential expression analysis for single-cell sequencing data
- This paper describes the development of the
tradeSeq
R package, which uses generalized additive models (GAMs) to perform differential expression over an inferred cellular trajectory. The package has some limitations, but provides a variety of different tests of different patterns of gene expression, and overall strikes a good balance between running quickly & providing accurate results. A good vignette can be accessed here. A nice characteristic of the method is that it is agnostic with respect to the type of pseudotime estimation used, meaning the user can derive their cellular ordering using any pseudotime or RNA velocity method prior to runningtradeSeq
.
- This paper describes the development of the
10.2 Via RNA Velocity
Unified fate mapping in multiview single-cell data
- This preprint presents the
CellRank 2
Python package, which implements a flexible, modular set of tools for the analysis of cell state composition and transition. The package is an extension ofCellRank
(paper), and is built around the concept of kernels based on pseudotime, RNA velocity, graph connectivity, time-series, etc. that are used to create estimators that classify cells into initial, intermediate, and terminal cell states. While RNA velocity is a possible input, it’s no longer necessary to run the estimation routine like it was in the originalCellRank
package. The great thing about this package is that you can combine kernels i.e., you can base the cell state estimation on weighted combinations of pseudotime, RNA velocity, graph structure, etc. based on how confident you are in each input.
- This preprint presents the
11 Differential Expression
12 Pathway Analysis
- Variance-adjusted Mahalanobis (VAM): a fast and accurate method for cell-specific gene set scoring
VAM
is an easy-to-use method for assigning each cell in a sample a score bounded on \([0, 1]\) for a user-provided gene set of interest. It’s conveniently wrapped around theSeurat
framework, and I’ve found it to be accurate & useful in my single cell work. In addition, the author was very helpful when I reached out to him with some questions when I was first starting out. Vignettes can be found here.
13 Simulation
- Enhancing biological signals and detection rates in single-cell RNA-seq experiments with cDNA library equalization
- This paper details the
scaffold
simulation method & accompanying R package (detailed vignette here) built by my PhD adviser Dr. Rhonda Bacher. scRNA-seq data simulation can often be tricky, and most packages have weird defaults, don’t generate data that follows true scRNA-seq count distributions, or are just really hard to set up & use.scaffold
fixes all that, and allows the user to generate UMI or non-UMI data, as well as data containing multiple populations or dynamic trends over pseudotime. Ground truth cluster labels and cell temporal orderings are provided, which makes it a great choice when benchmarking new or extant methods. It’s fairly simple to use, & computationally efficient as well which is a huge plus when running large simulation studies.
- This paper details the
- Splatter: simulation of single-cell RNA sequencing data
Splatter
is another R package that implements relatively simple simulation of scRNA-seq counts data. While the package works well & has decent documentation, and gives good ground-truth values for gene differential expression & cell pseudotime, the counts it generates are less sparse than true single cell data typically are. This could lead to overly-optimistic method benchmarks if not accounted for when testing new methods. With that caveat, it’s a good method & a great way to jump into simulation studies. A nice introductory vignette can be found here.
14 Computational Reproducibility & Scalability
- Reproducible pipelines in R using
{targets}
- The
targets
package is one of my favorite R tools, & the well-written docs above show how to create version-controlled pipelines entirely using R. This framework is an absolute godsend for large projects, simulation studies, etc., and I’ve used it on every longterm computational project I’ve worked on in the past 2 years. It makes tracking, reproducing, & parallelizing the execution of large codebases very easy, and makes reproducible research accessible to anyone with a good handle on R.
- The
- Parallelization in
Seurat
vignette- Shows usage of the
future
R package to improve tasks such as normalization, integration, clustering, & differential expression.
- Shows usage of the
15 Miscellaneous
- Dr. Ming (Tommy) Tang’s genomics tools GitHub repository
- Dr. Ming Tang is a wet lab scientist-turned-computational biologist who used to work at Dana Farber Cancer Institute & now is a director at Immunitas. He’s put together a ton of resources for anyone (biologists, bioinformatics newbies, even experienced computational biologists) who’s in the process of learning more about bioinformatics & biostatistics. This repository contains a ton of useful resources (books, blog posts, online courses, etc.) on statistics, computing, biology, and other topics. I’d recommend perusing it even if you’re not a beginner, as you’ll likely find something interesting or useful to read.