scRNA-seq Resources

Author
Affiliation

Jack Leary

University of Florida

Published

April 16, 2024

1 Purpose

Here I’ll catalog useful papers, preprints, method vignettes, Twitter discussions, etc. that I’ve found helpful while learning how to process and analyze single cell RNA-seq data. When possible, I’ll link to static versions of things in the hopes that links don’t break. I’ll categorize resources according to which problem they address e.g., raw data processing, clustering, visualization, etc., though some resources will of course touch on multiple topics.

2 Introductory Guides

  • Orchestrating Single Cell Analysis with Bioconductor
    • This book is a thoroughly comprehensive guide on using Bioconductor tools built around the SingleCellExperiment data structure to perform scRNA-seq analysis. It covers basic & advanced versions everything from quality control to many different types of downstream analyses in both the single- and multi-sample cases.
  • Seurat guided clustering tutorial
    • The famous Seurat PBMC3k vignette that everyone has looked at at least once. Useful for beginners to get an idea of how Seurat works, but glosses over a ton of details & makes things look perhaps a bit too easy.

3 Processing Raw Reads

  • 10XGenomics cellranger documentation
    • Shows how to use the official cellranger pipeline maintained by 10X to turn your raw reads into a gene-cell counts matrix.
  • Quantifying unspliced & spliced RNA with alevin-fry
    • Use this pipeline if you want to perform RNA velocity analysis, or just want to have your read counts quantified by spliced / unspliced / ambiguous status (called USA mode in the documentation).
  • kallisto | bustools documentation
    • This command line tool & Python downstream analysis suite are maintained by the Pachter Lab and are very well-written. This framework also allows for unspliced vs. spliced RNA count quantification, which is necessary to run RNA velocity. I would recommend reading through the docs just for the tips on scRNA-seq analysis even if you don’t end up using the tool.

4 Quality Control

5 Normalization

6 Integration

7 Clustering

8 Annotation

  • Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage
    • This paper details the usage of an automated celltype annotation tool called SingleR to transfer labels from a labelled reference dataset to an unlabeled query dataset. A good vignette can be found here. If your data are clean & well-processed, and your reference is high quality, it can be a fairly accurate & very efficient tool. I generally use it as a first step in an annotation process to generate broad celltype labels e.g., monocyte, T cell, fibroblast, etc. Don’t treat its labels as ground truth, and definitely try multiple reference datasets. In addition, it uses correlation to measure the similarity between the query & reference, & I’ve found that processing both datasets using the same pipeline (especially with respect to normalization) has led to better results.

9 Dimension Reduction & Visualization

10 Trajectory Inference

10.1 Via Pseudotime

  • Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics
    • The slingshot package is my currently preferred method for estimating a pseudotemporal cellular ordering. A decent vignette can be found here. I would absolutely recommend using principal components as input to the algorithm instead of UMAP / t-SNE components.
  • Trajectory-based differential expression analysis for single-cell sequencing data
    • This paper describes the development of the tradeSeq R package, which uses generalized additive models (GAMs) to perform differential expression over an inferred cellular trajectory. The package has some limitations, but provides a variety of different tests of different patterns of gene expression, and overall strikes a good balance between running quickly & providing accurate results. A good vignette can be accessed here. A nice characteristic of the method is that it is agnostic with respect to the type of pseudotime estimation used, meaning the user can derive their cellular ordering using any pseudotime or RNA velocity method prior to running tradeSeq.

10.2 Via RNA Velocity

  • RNA velocity unraveled

  • On the mathematics of RNA velocity I: Theoretical analysis

  • On the mathematics of RNA velocity II: Algorithmic aspects

  • Unified fate mapping in multiview single-cell data

    • This preprint presents the CellRank 2 Python package, which implements a flexible, modular set of tools for the analysis of cell state composition and transition. The package is an extension of CellRank (paper), and is built around the concept of kernels based on pseudotime, RNA velocity, graph connectivity, time-series, etc. that are used to create estimators that classify cells into initial, intermediate, and terminal cell states. While RNA velocity is a possible input, it’s no longer necessary to run the estimation routine like it was in the original CellRank package. The great thing about this package is that you can combine kernels i.e., you can base the cell state estimation on weighted combinations of pseudotime, RNA velocity, graph structure, etc. based on how confident you are in each input.

11 Differential Expression

12 Pathway Analysis

13 Simulation

  • Enhancing biological signals and detection rates in single-cell RNA-seq experiments with cDNA library equalization
    • This paper details the scaffold simulation method & accompanying R package (detailed vignette here) built by my PhD adviser Dr. Rhonda Bacher. scRNA-seq data simulation can often be tricky, and most packages have weird defaults, don’t generate data that follows true scRNA-seq count distributions, or are just really hard to set up & use. scaffold fixes all that, and allows the user to generate UMI or non-UMI data, as well as data containing multiple populations or dynamic trends over pseudotime. Ground truth cluster labels and cell temporal orderings are provided, which makes it a great choice when benchmarking new or extant methods. It’s fairly simple to use, & computationally efficient as well which is a huge plus when running large simulation studies.
  • Splatter: simulation of single-cell RNA sequencing data
    • Splatter is another R package that implements relatively simple simulation of scRNA-seq counts data. While the package works well & has decent documentation, and gives good ground-truth values for gene differential expression & cell pseudotime, the counts it generates are less sparse than true single cell data typically are. This could lead to overly-optimistic method benchmarks if not accounted for when testing new methods. With that caveat, it’s a good method & a great way to jump into simulation studies. A nice introductory vignette can be found here.

14 Computational Reproducibility & Scalability

  • Reproducible pipelines in R using {targets}
    • The targets package is one of my favorite R tools, & the well-written docs above show how to create version-controlled pipelines entirely using R. This framework is an absolute godsend for large projects, simulation studies, etc., and I’ve used it on every longterm computational project I’ve worked on in the past 2 years. It makes tracking, reproducing, & parallelizing the execution of large codebases very easy, and makes reproducible research accessible to anyone with a good handle on R.
  • Parallelization in Seurat vignette
    • Shows usage of the future R package to improve tasks such as normalization, integration, clustering, & differential expression.

15 Miscellaneous

  • Dr. Ming (Tommy) Tang’s genomics tools GitHub repository
    • Dr. Ming Tang is a wet lab scientist-turned-computational biologist who used to work at Dana Farber Cancer Institute & now is a director at Immunitas. He’s put together a ton of resources for anyone (biologists, bioinformatics newbies, even experienced computational biologists) who’s in the process of learning more about bioinformatics & biostatistics. This repository contains a ton of useful resources (books, blog posts, online courses, etc.) on statistics, computing, biology, and other topics. I’d recommend perusing it even if you’re not a beginner, as you’ll likely find something interesting or useful to read.