Analyzing Transcriptional Dynamics with CellRank

Author
Affiliation

Jack Leary

University of Florida

Published

April 16, 2024

Introduction

In this analysis we’ll use CellRank to identify multiple cell fates in pancreatic endocrinogenesis data by combining RNA velocity, pseudotime, & developmental potential information. The goal is to showcase a more advanced analysis that incorporates multiple modalities to infer a confident final trajectory.

Libraries

Code
import warnings                  # filter out warnings
import numpy as np               # matrix utilities
import scanpy as sc              # scRNA-seq processing
import pandas as pd              # dataframe tools
import scvelo as scv             # RNA velocity estimation
import anndata as ad             # scRNA-seq data structures
import cellrank as cr            # cell fate estimation
import matplotlib.pyplot as plt  # plot utilities
Code
warnings.simplefilter('ignore', category=UserWarning)

Theme for matplotlib

Code
base_size = 12
plt.rcParams.update({
    # font
    'font.size': base_size, 
    'font.weight': 'normal',
    # figure
    'figure.dpi': 320, 
    'figure.edgecolor': 'white', 
    'figure.facecolor': 'white', 
    'figure.figsize': (6, 4), 
    'figure.constrained_layout.use': True,
    # axes
    'axes.edgecolor': 'black',
    'axes.grid': False,
    'axes.labelpad': 2.75,
    'axes.labelsize': base_size * 0.8,
    'axes.linewidth': 1.5,
    'axes.spines.right': False,
    'axes.spines.top': False,
    'axes.titlelocation': 'left',
    'axes.titlepad': 11,
    'axes.titlesize': base_size,
    'axes.titleweight': 'normal',
    'axes.xmargin': 0.1, 
    'axes.ymargin': 0.1, 
    # legend
    'legend.borderaxespad': 1,
    'legend.borderpad': 0.5,
    'legend.columnspacing': 2,
    'legend.fontsize': base_size * 0.8,
    'legend.frameon': False,
    'legend.handleheight': 1,
    'legend.handlelength': 1.2,
    'legend.labelspacing': 1,
    'legend.title_fontsize': base_size, 
    'legend.markerscale': 1.25
})

Data

First, we load the data from Bastidas-Ponce et al (2019). This dataset is comprised of murine pancreas cells from timepoint E15.5, and includes both spliced & unspliced mRNA counts.

Code
ad_panc = scv.datasets.pancreas()
Code
ad_panc.raw = ad_panc

Analysis

Preprocessing

HVG Selection & Normalization

We begin by filtering out cells that have fewer than 20 total counts between the spliced & unspliced assays. Next, we select the top 3,000 most highly variable genes (HVGs), after which we normalize the counts using a log1p-transform. Lastly, we use the cell cycle gene sets from Tirosh et al (2016) to assign S-phase and G2M-phase scores to each cell in order to identify populations of cycling cells. Cell cycle variation is not always of biological interest, and can prevent trajectory structures from being as clean or simple as we’d like.

Code
scv.pp.filter_and_normalize(
    ad_panc, 
    min_shared_counts=20, 
    n_top_genes=3000,
    flavor='seurat',  
    subset_highly_variable=False, 
    log=False
)
sc.pp.log1p(ad_panc)
scv.tl.score_genes_cell_cycle(ad_panc)
Filtered out 20801 genes that are detected 20 counts (shared).
Normalized count data: X, spliced, unspliced.
Extracted 3000 highly variable genes.
calculating cell cycle phase
-->     'S_score' and 'G2M_score', scores of cell cycle phases (adata.obs)

PCA Embedding

After centering & scaling the normalized spliced mRNA counts matrix, we use the HVGs identified earlier to generate a 30-dimensional principal component analysis (PCA) embedding.

Code
sc.pp.scale(ad_panc)
sc.tl.pca(
    ad_panc, 
    n_comps=30, 
    random_state=312, 
    use_highly_variable=True
)

Visualizing the PCA embedding shows us a rather simple trajectory from endocrine precursors to mature endocrine cells.

Code
sc.pl.embedding(
    ad_panc, 
    basis='pca', 
    color='clusters',
    title='', 
    frameon=True, 
    size=20, 
    alpha=0.75, 
    show=False
)
plt.gca().set_xlabel('PC 1')
plt.gca().set_ylabel('PC 2')
plt.show()
Figure 1: PCA embedding

Graph-based Clustering

After identifying the 20 nearest neighbors (NNs) for each cell in PCA space, we use the Leiden algorithm to partition the NN graph into discrete clusters.

Code
sc.pp.neighbors(
    ad_panc, 
    use_rep='X_pca', 
    n_pcs=30, 
    n_neighbors=20, 
    metric='cosine', 
    random_state=312
)
sc.tl.leiden(
    ad_panc, 
    resolution=0.3, 
    random_state=312
)

The clusters generally correspond to our celltypes, with some minor differences e.g., overclustering in the ductal cell population (most likely due to cell cycle variation).

Code
sc.pl.embedding(
    ad_panc, 
    basis='pca', 
    color='leiden',
    title='', 
    frameon=True, 
    size=20, 
    alpha=0.75, 
    show=False
)
plt.gca().set_xlabel('PC 1')
plt.gca().set_ylabel('PC 2')
plt.show()
Figure 2: Leiden clusters in PCA space

Plotting the cell cycle scores on the PCA embedding shows us that the two ductal cell clusters are in fact split by enrichment for cell cycle genes.

Code
scv.pl.scatter(
  ad_panc, 
  basis='pca', 
  color_gradients=['S_score', 'G2M_score'], 
  smooth=True,
  perc=[5, 95], 
  frameon=True, 
  show=False
)
plt.gca().set_xlabel('PC 1')
plt.gca().set_ylabel('PC 2')
plt.show()
Figure 3: Cell cycle scores

UMAP Embedding

We generate a UMAP embedding in two dimensions.

Code
sc.tl.umap(ad_panc, random_state=312)

Upon visual inspection, the UMAP embedding appears to preserve both the transitions between celltypes and the heterogeneity of the mature endocrine cells better than the PCA embedding.

Code
sc.pl.embedding(
    ad_panc, 
    basis='umap', 
    color='clusters',
    title='', 
    frameon=True, 
    size=20, 
    alpha=0.75, 
    show=False
)
plt.gca().set_xlabel('UMAP 1')
plt.gca().set_ylabel('UMAP 2')
plt.show()
Figure 4: UMAP embedding

Diffusion Map Embedding

We’ll also generate a 2D diffusion map embedding, since that algorithm often works well for trajectory manifolds.

Code
sc.tl.diffmap(
  ad_panc, 
  n_comps=10, 
  random_state=312
)

We visualize the 2nd and 3rd components - for some reason there’s a bug in the scanpy source code that incorrectly arranges the returned array for the embedding. While the overall trajectory structure is preserved, the local structure of the mature endocrine celltypes is messy, making this embedding less than ideal for our purposes.

Code
sc.pl.embedding(
    ad_panc, 
    basis='diffmap', 
    color='clusters',
    title='', 
    frameon=True, 
    size=20, 
    alpha=0.75, 
    show=False, 
    components='2, 3'
)
plt.gca().set_xlabel('DC 1')
plt.gca().set_ylabel('DC 2')
plt.show()
Figure 5: Diffusion map embedding

Despite the imperfection of the diffusion map embedding, we’ll also estimate a diffusion pseudotime value for each cell, as the estimates might be useful later on. We set the root cell to be the ductal cell closest to the minimum of the second diffusion component (as visualized above).

Code
ad_panc.uns['iroot'] = np.argmin(ad_panc.obsm['X_diffmap'][:2])
sc.tl.dpt(ad_panc, n_dcs=10)

Visualizing the diffusion pseudotime on our UMAP embedding shows that it seems to recapitulate the underlying biological process fairly well, with ductal cells assigned lower values and endocrine cells assigned higher values.

Code
sc.pl.embedding(
    ad_panc, 
    basis='umap', 
    color='dpt_pseudotime',
    title='', 
    frameon=True, 
    size=20, 
    alpha=0.75, 
    show=False, 
    cmap='plasma'
)
plt.gca().set_xlabel('UMAP 1')
plt.gca().set_ylabel('UMAP 2')
plt.show()
Figure 6: Estimated diffusion pseudotime

Force-directed Graph Embedding

It’s often worth checking a force-directed graph embedding in addition to the UMAP embedding, as it can sometimes better preserve trajectory structure in the data.

Code
sc.tl.draw_graph(
    ad_panc, 
    layout='kk',
    random_state=312,
    n_jobs=4
)

While the overall trajectory structure is better (visually speaking) than the PCA embedding, variation in the ductal & endocrine progenitor celltypes is hidden. We’ll stick with the UMAP embedding going forwards.

Code
sc.pl.draw_graph(
    ad_panc, 
    color='clusters', 
    title='', 
    alpha=0.75, 
    size=20, 
    show=False
)
plt.gca().set_xlabel('Dim 1')
plt.gca().set_ylabel('Dim 2')
plt.show()
Figure 7: Force-directed graph embedding

RNA Velocity Estimation

Next, we’ll estimate RNA velocity vectors for each cell via scvelo. Details on the methodology can be found in the original paper, and a review of current challenges & limitations of velocity analyses can be read here.

SNN-based Imputation

We start by smoothing the spliced & unspliced counts across the NNs we identified earlier. This helps to preserve signal in otherwise very noisy velocity vectors.

Code
scv.pp.moments(
    ad_panc, 
    n_pcs=None, 
    n_neighbors=20, 
    use_rep='X_pca'
)
computing moments based on connectivities
    finished (0:00:01) --> added 
    'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)

Dynamical Velocity Computation

Next, we utilize the dynamical model to estimate velocity vectors, generate a velocity graph, and compute the uncertainty of each cells’ differentiation direction. Lastly, we project the velocity vectors onto our 2D UMAP embedding for visualization purposes.

Code
scv.tl.recover_dynamics(ad_panc, n_jobs=6)
scv.tl.velocity(
    ad_panc, 
    mode='dynamical', 
    use_highly_variable=True
)
scv.tl.velocity_graph(
    ad_panc, 
    n_jobs=4, 
    compute_uncertainties=True
)
scv.tl.velocity_confidence(ad_panc)
scv.tl.velocity_embedding(ad_panc, basis='umap')
recovering dynamics (using 6/8 cores)
  0%|          | 0/1430 [00:00<?, ?gene/s]
    finished (0:02:57) --> added 
    'fit_pars', fitted parameters for splicing dynamics (adata.var)
computing velocities
    finished (0:00:05) --> added 
    'velocity', velocity vectors for each individual cell (adata.layers)
computing velocity graph (using 4/8 cores)
  0%|          | 0/3696 [00:00<?, ?cells/s]
    finished (0:00:20) --> added 
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)
--> added 'velocity_length' (adata.obs)
--> added 'velocity_confidence' (adata.obs)
--> added 'velocity_confidence_transition' (adata.obs)
computing velocity embedding
    finished (0:00:00) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)

Visualizing the projected velocities with a streamline plot allows us to get a rough idea of which direction cells are differentiating towards in each local area. From the endocrine progenitors through pre-endocrine cells we see a smooth flow of differentiation, which is followed by branching trajectories towards each of the mature endocrine celltypes. The streamlines broadly correspond to known biology, which indicates that RNA velocity should provide decent biological insights about this dataset. We also see cycling in the ductal cells, which is expected as this dataset is known to have strong cell-cycle effects in the ductal cell cluster.

Code
scv.pl.velocity_embedding_stream(
    ad_panc, 
    basis='umap', 
    color='clusters',
    alpha=0.75, 
    size=20, 
    frameon=True, 
    linewidth=0.5, 
    legend_loc='right',
    arrow_size=0.8,
    title='', 
    show=False
)
plt.gca().set_xlabel('UMAP 1')
plt.gca().set_ylabel('UMAP 2')
plt.show()
Figure 8: Velocity embedding streamlines
Note

Streamline embedding plots generally provide qualitative insight at best, & their interpretation can be very subjective. Take care when forming biological conclusions or further hypotheses based on them, and verify what you see in a quantitative manner.

Visualizing the length of the velocity vectors in each cell gives us an idea of how quickly differentiation is occurring, as velocity vector length is a proxy measurement for differentiation speed. We see high scores in the pre-endocrine cells, as well as in the beta cells. This could be due to the phenomenon of self-duplication that is observed in beta cells, wherein new beta cells are produced in that way instead of through precursor differentiation. More information on beta cell specification can be read in Murtaugh (2007).

Code
sc.pl.embedding(
    ad_panc, 
    basis='umap', 
    color='velocity_length',
    title='', 
    frameon=True, 
    size=20, 
    alpha=0.75, 
    show=False, 
    legend_loc='right margin', 
    cmap='coolwarm'
)
plt.gca().set_xlabel('UMAP 1')
plt.gca().set_ylabel('UMAP 2')
plt.show()
Figure 9: Velocity length

Investigation of Velocity Genes

We begin by identifying ranking the velocity genes, then pull the top 6 highest-likelihood genes for each celltype.

Code
scv.tl.rank_dynamical_genes(ad_panc, groupby='clusters')
Code
top_dyn_genes = scv.get_df(ad_panc, 'rank_dynamical_genes/names').head(6)

Plotting the spliced vs. unspliced mRNA estimates along with the learned dynamics per-gene allows us to determine which genes have dynamics specific to mature endocrine celltypes. For example, Pak3 is present in the top 6 genes of every endocrine celltype. Interestingly, this gene is mostly known to be associated in humans with intellectual development disorders, though it’s also present in GO biological processes such as differentiation and system development. Lastly, according to Piccand et al (2013), Pak3 is linked to beta cell specification and has implications for diabetes risk, but their study does not conclude that it is necessary for the development of other endocrine celltypes.

Another interesting tidbit is the presence of Meis2 as a top gene for both delta & epsilon cells. This gene is associated with organogenesis and autism, and is known to be a vital part of neural crest development as per Machon et al (2015). It’s role in some endocrine celltypes’ development but not others could indicate that the specification of delta & epsilon cells is more similar than that of alpha and beta cells.

Alpha Cells

Code
scv.pl.scatter(
  ad_panc, 
  top_dyn_genes['Alpha'], 
  ncols=3, 
  frameon=False, 
  alpha=0.75, 
  size=20, 
  show=False
)
plt.gcf().supxlabel('Spliced mRNA')
plt.gcf().supylabel('Unspliced mRNA')
plt.show()
Figure 10: Alpha cell velocity genes

Beta Cells

Code
scv.pl.scatter(
  ad_panc, 
  top_dyn_genes['Beta'], 
  ncols=3, 
  frameon=False, 
  alpha=0.75, 
  size=20, 
  show=False
)
plt.gcf().supxlabel('Spliced mRNA')
plt.gcf().supylabel('Unspliced mRNA')
plt.show()
Figure 11: Beta cell velocity genes

Delta Cells

Code
scv.pl.scatter(
  ad_panc, 
  top_dyn_genes['Delta'], 
  ncols=3, 
  frameon=False, 
  alpha=0.75, 
  size=20, 
  show=False
)
plt.gcf().supxlabel('Spliced mRNA')
plt.gcf().supylabel('Unspliced mRNA')
plt.show()
Figure 12: Delta cell velocity genes

Epsilon Cells

Code
scv.pl.scatter(
  ad_panc, 
  top_dyn_genes['Epsilon'], 
  ncols=3, 
  frameon=False, 
  alpha=0.75, 
  size=20, 
  show=False
)
plt.gcf().supxlabel('Spliced mRNA')
plt.gcf().supylabel('Unspliced mRNA')
plt.show()
Figure 13: Epsilon cell velocity genes

Cell Fate Identification

In this part of the analysis we’ll focus on identifying a set of terminal and initial cell states, along with the probabilities of transitions between them. This is performed via CellRank; the details of the method can be read in the original paper as well as the preprint for the second version.

Velocity Kernel

We start by computing a cell-cell transition probability matrix based on our RNA velocity estimates from earlier. Using the model='monte_carlo' option allows us to propagate the velocity vector uncertainty estimates forward to the estimated transition probability matrix.

Code
vk = cr.kernels.VelocityKernel(ad_panc)
vk.compute_transition_matrix(
    model='monte_carlo', 
    similarity='cosine', 
    seed=312, 
    n_jobs=4
)

Visualizing the directed transitions on our UMAP embedding shows essentially the same patterns as our velocity vector projections from above, as expected. A drawback of using velocity on this dataset is that the inferred directionality in the ductal cell population is uncertain, so we’ll need another source of information to strengthen our inferred trajectory.

Code
vk.plot_projection(
    basis='umap', 
    recompute=True, 
    linewidth=0.5, 
    arrow_size=0.8,
    color='clusters', 
    size=20, 
    alpha=0.75, 
    legend_loc='right margin', 
    title='', 
    frameon=True, 
    show=False
)
plt.gca().set_xlabel('UMAP 1')
plt.gca().set_ylabel('UMAP 2')
plt.show()
Figure 14: Velocity kernel projection

Visualizing the velocity confidence for each cell confirms our intuition that the ductal cells aren’t fully interpretable using velocity-based information, as scores are slightly lower for the ductal cells than for e.g., the endocrine precursors or mature endocrine cells.

Code
sc.pl.embedding(
    ad_panc, 
    basis='umap', 
    color='velocity_confidence',
    title='', 
    frameon=True, 
    size=20, 
    alpha=0.75, 
    show=False, 
    legend_loc='right margin', 
    cmap='coolwarm'
)
plt.gca().set_xlabel('UMAP 1')
plt.gca().set_ylabel('UMAP 2')
plt.show()
Figure 15: UMAP colored by velocity coherence

Plotting the coherence of the velocity estimates per-celltype shows us that the alpha cells also have a relatively low velocity confidence. This could be because they’re located at the end of the trajectory / they’re terminally differentiated, or it could mean that the cells are of lower quality than those in other clusters.

Code
sc.pl.violin(
  ad_panc, 
  keys='velocity_confidence', 
  groupby='clusters', 
  rotation=30, 
  xlabel='', 
  ylabel='Velocity Confidence',
  show=False
)
plt.show()
Figure 16: Velocity coherence by celltype

CytoTRACE Kernel

In addition to the velocity kernel we’ll compute another transition probability matrix based on CytoTRACE scores. The CytoTRACE method essentially uses the number of expressed genes as a proxy for differentiation potential. The details of the method can be found in Gulati et al (2020).

Code
ctk = cr.kernels.CytoTRACEKernel(ad_panc).compute_cytotrace()
ctk.compute_transition_matrix(
    threshold_scheme='soft',
    nu=0.5, 
    n_jobs=2
)

Visualizing the CytoTRACE score shows what we would expect - ductal cells, which function as the initial cell state in this dataset, express more genes and have a higher differentiation potential than the intermediate and terminal cell states.

Code
sc.pl.embedding(
    ad_panc, 
    basis='umap', 
    color='ct_score',
    title='', 
    frameon=True, 
    size=20, 
    alpha=0.75, 
    show=False, 
    legend_loc='right margin'
)
plt.gca().set_xlabel('UMAP 1')
plt.gca().set_ylabel('UMAP 2')
plt.show()
Figure 17: CytoTRACE differentiation scores

The embedding projection based on our CytoTRACE kernel displays a confident differentiation trajectory for the initial and intermediate cell states, but for the mature cell states with fewer expressed genes the differentiation directions are much less certain.

Code
ctk.plot_projection(
    basis='umap', 
    recompute=True, 
    linewidth=0.5, 
    color='clusters', 
    size=20, 
    alpha=0.75, 
    arrow_size=0.8, 
    legend_loc='right margin', 
    title='', 
    frameon=True, 
    show=False
)
plt.gca().set_xlabel('UMAP 1')
plt.gca().set_ylabel('UMAP 2')
plt.show()
Figure 18: CytoTRACE kernel projection

Pseudotime Kernel

Lastly, we can create another kernel based on our estimated diffusion pseudotime from earlier. This should help to smooth out the overall trajectory, since it represents the transitions between celltypes well.

Code
pk = cr.kernels.PseudotimeKernel(ad_panc, time_key='dpt_pseudotime')
pk.compute_transition_matrix(threshold_scheme='soft', n_jobs=2)

We visualize the directed transition matrix on our UMAP embedding, and see that the pseudotime kernel does a bit better of a job than the CytoTRACE kernel at identifying directionality in the mature endocrine cells.

Code
pk.plot_projection(
    basis='umap', 
    recompute=True, 
    linewidth=0.5, 
    color='clusters', 
    size=20, 
    alpha=0.75, 
    arrow_size=0.8, 
    legend_loc='right margin', 
    title='', 
    frameon=True, 
    show=False
)
plt.gca().set_xlabel('UMAP 1')
plt.gca().set_ylabel('UMAP 2')
plt.show()
Figure 19: Pseudotime kernel projection

Combined Kernel

As we just saw, individual sources of information about our data can miss critical aspects, like CytoTRACE not detecting much differentiation during the latter half of the trajectory. One of the best features of CellRank is the ability to combine sources of information by combining the kernels themselves. Here we combine the velocity, CytoTRACE, & pseudotime kernels using an equal weighting scheme. This results in a new transition probability matrix, which we can then project onto our embedding.

Code
ck = 0.33 * vk + 0.33 * ctk + 0.33 * pk

With all three source of information (velocity, differentiation potential, & pseudotime) combined, we finally have a relatively smooth trajectory that more or less lines up with known biology.

Code
ck.plot_projection(
    basis='umap', 
    recompute=True, 
    linewidth=0.5, 
    arrow_size=0.8, 
    color='clusters', 
    size=20, 
    alpha=0.75, 
    legend_loc='right margin', 
    title='', 
    frameon=True, 
    show=False
)
plt.gca().set_xlabel('UMAP 1')
plt.gca().set_ylabel('UMAP 2')
plt.show()
Figure 20: Combined kernel projection

Macrostate Estimation

Terminal States

After computing a Schur decomposition of the transition probability matrix, we make the assumption that each component corresponds to a cell macrostate - a somewhat vague concept that can loosely be interpreted as a cell phenotype.

Code
g = cr.estimators.GPCCA(ck)
g.compute_schur(n_components=15, method='brandts')
Code
g.compute_macrostates(
    cluster_key='clusters', 
    n_cells=20, 
    n_states=13, 
    method='brandts'
)

On our UMAP embedding we plot the 20 cells most likely to be assigned to each macrostate.

Code
g.plot_macrostates(
    which='all', 
    discrete=True, 
    same_plot=True, 
    legend_loc='right margin', 
    title='', 
    basis='umap',
    frameon=True,
    show=False
)
plt.gca().set_xlabel('UMAP 1')
plt.gca().set_ylabel('UMAP 2')
plt.show()
Figure 21: Macrostates identified via Schur decomposition

Based on the macrostate estimates & known biology, we designate the four mature endocrine celltypes as terminal states. This step can be performed automatically using a heuristic

Code
g.set_terminal_states(
    states=['Alpha', 'Beta', 'Delta', 'Epsilon'], 
    n_cells=20, 
    cluster_key='clusters'
)
Code
g.plot_macrostates(
    which='terminal', 
    discrete=False, 
    same_plot=True, 
    legend_loc='right margin', 
    title='', 
    frameon=True,
    basis='umap',
    show=False
)
plt.gca().set_xlabel('UMAP 1')
plt.gca().set_ylabel('UMAP 2')
plt.show()
Figure 22: Terminal cell fates

Initial States

Based on our prior knowledge that the ductal cells are the root state, we set the initial state to be the ductal cell macrostate closest to the beginning of the trajectory.

Code
g.set_initial_states(
    states=['Ductal_5'], 
    n_cells=20, 
    cluster_key='clusters'
)
Code
g.plot_macrostates(
    which='initial', 
    discrete=True,
    legend_loc='right margin', 
    title='', 
    frameon=True,
    basis='umap', 
    show=False
)
plt.gca().set_xlabel('UMAP 1')
plt.gca().set_ylabel('UMAP 2')
plt.show()
Figure 23: Initial cell state

Absorption Probability Estimation

Now that we have our terminal cell fates identified, we compute the probability of and time to absorption in each state.

Code
g.compute_absorption_times(n_jobs=2)

We visualize the transition probabilities and initial & stationary distributions for each coarse macrostate.

Code
g.plot_coarse_T(
    title='', 
    show_initial_dist=True, 
    show_stationary_dist=True
)
plt.show()
Figure 24: Coarse transition probability matrix

The next step is to estimate the absorption probability of each terminal state; these probabilities are denoted fate probabilities by CellRank.

Code
g.compute_fate_probabilities()

As expected, fate probabilities are highest near each terminal state. Beta cells seem to have the highest fate probability among endocrine progenitors, but this is most likely due to the higher frequency of beta cells in the dataset along with their position at the “end” of the trajectory on the embedding.

Code
g.plot_fate_probabilities(
    same_plot=True, 
    title='', 
    legend_loc='right margin', 
    frameon=True, 
    show=False
)
plt.gca().set_xlabel('UMAP 1')
plt.gca().set_ylabel('UMAP 2')
plt.show()
Figure 25: Cell fate probabilities

Lineage Driver Identification

We compute lineage drivers for each cell fate by estimating the correlation between spliced mRNA and fate probability, after which we estimate lineage priming. Lineage priming is the degree to which a cell is “committed” towards its cell fate; details of how this is calculated can be read in Velten et al (2017).

Code
lineage_drivers = g.compute_lineage_drivers(
    seed=312, 
    cluster_key='clusters', 
    confidence_level=0.95, 
    method='fisher'
)
lineage_priming = g.compute_lineage_priming()

Visualizing lineage priming on our UMAP embedding shows that fate commitment is strongest in the mature endocrine cells, with values being particularly high among epsilon cells.

Code
sc.pl.embedding(
    ad_panc, 
    basis='umap', 
    color='priming_degree_fwd', 
    title='', 
    frameon=True, 
    show=False
)
plt.gca().set_xlabel('UMAP 1')
plt.gca().set_ylabel('UMAP 2')
plt.show()
Figure 26: Lineage priming

Latent Time Estimation

By using the initial & terminal cell state probabilities as priors, we obtain a more accurate estimate of per-cell latent time.

Code
scv.tl.latent_time(
    ad_panc, 
    end_key='term_states_fwd_probs', 
    root_key='init_states_fwd_probs'
)

The estimated latent time seems to characterize our trajectory well, which is good news for later analysis of trajectory differential expression (TDE). Overall, the beta cells seem to take the longest to reach terminal differentiation.

Code
sc.pl.embedding(
    ad_panc, 
    basis='umap', 
    color='latent_time', 
    title='', 
    frameon=True,
    show=False
)
plt.gca().set_xlabel('UMAP 1')
plt.gca().set_ylabel('UMAP 2')
plt.show()
Figure 27: Gene-shared latent time

Directed Graph Abstraction

Using the initial & terminal cell state probabilities and gene-shared latent time as priors, we estimated a directed partitioned graph abstraction (PAGA), which shows us the general direction each celltype differentiates towards.

Code
scv.tl.paga(
    ad_panc, 
    groups='clusters', 
    end_key='term_states_fwd_probs', 
    root_key='init_states_fwd_probs', 
    use_time_prior='latent_time'
)

Visualizing the PAGA on our UMAP embedding shows a transition from ductal cells to endocrine progenitors, from pre-endocrine cells to the mature endocrine celltypes. The one bit that doesn’t agree with known biology is the transition from alpha to epsilon cells, as epsilon cells should be generated from pre-endocrine cells instead. This situation demonstrates the importance of having at least some a priori biological knowledge with which your analysis can be guided, since even with sophisticated methods scRNA-seq studies are still very difficult to get right.

Code
scv.pl.paga(
    ad_panc, 
    color='clusters',
    basis='umap', 
    size=20, 
    alpha=.25,
    min_edge_width=2, 
    frameon=True, 
    title='', 
    node_size_scale=0.75, 
    arrowsize=8,
    show=False
)
plt.gca().set_xlabel('UMAP 1')
plt.gca().set_ylabel('UMAP 2')
plt.show()
Figure 28: Directed PAGA

Conclusions

This dataset is relatively simple as far as trajectory structures go - there’s only one subject, one timepoint, one developmental lineage. In reality, scRNA-seq experiments are often much more complex design-wise. Those situations are where the ability of CellRank to easily combine multiple sources of information really shines. We didn’t use it here since the cells were all from a single timepoint, but the RealTimeKernel allows you to incorporate experimental time in addition to the other modalities we used here; this functionality is a great way to make sure your trajectory structure is capturing the cellular developmental process being studied and not just the progression of time. Overall, I think CellRank is a pretty well-designed method, and it’s been very helpful to me on several scRNA-seq projects this year since the v2 release.

Session Info

Code
sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.1 (2022-06-23)
 os       macOS Big Sur ... 10.16
 system   x86_64, darwin17.0
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2023-12-08
 pandoc   3.1.9 @ /usr/local/bin/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 cli           3.6.1   2023-03-23 [1] CRAN (R 4.2.0)
 digest        0.6.33  2023-07-07 [1] CRAN (R 4.2.0)
 evaluate      0.23    2023-11-01 [1] CRAN (R 4.2.1)
 fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
 glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
 here          1.0.1   2020-12-13 [1] CRAN (R 4.2.0)
 htmltools     0.5.3   2022-07-18 [1] CRAN (R 4.2.0)
 htmlwidgets   1.5.4   2021-09-08 [1] CRAN (R 4.2.0)
 jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.2.0)
 knitr         1.40    2022-08-24 [1] CRAN (R 4.2.0)
 lattice       0.20-45 2021-09-22 [1] CRAN (R 4.2.1)
 lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.2.0)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
 Matrix        1.6-3   2023-11-14 [1] CRAN (R 4.2.1)
 png           0.1-7   2013-12-03 [1] CRAN (R 4.2.0)
 Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.2.0)
 reticulate    1.28    2023-01-27 [1] CRAN (R 4.2.0)
 rlang         1.1.2   2023-11-04 [1] CRAN (R 4.2.1)
 rmarkdown     2.16    2022-08-24 [1] CRAN (R 4.2.0)
 rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.2.0)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
 stringi       1.7.8   2022-07-11 [1] CRAN (R 4.2.0)
 stringr       1.5.0   2022-12-02 [1] CRAN (R 4.2.0)
 vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.2.0)
 xfun          0.32    2022-08-10 [1] CRAN (R 4.2.0)
 yaml          2.3.5   2022-02-21 [1] CRAN (R 4.2.0)

 [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library

─ Python configuration ───────────────────────────────────────────────────────
 python:         /Users/jack/Desktop/PhD/Research/Python_Envs/personal_site/bin/python
 libpython:      /usr/local/opt/python@3.11/Frameworks/Python.framework/Versions/3.11/lib/python3.11/config-3.11-darwin/libpython3.11.dylib
 pythonhome:     /Users/jack/Desktop/PhD/Research/Python_Envs/personal_site:/Users/jack/Desktop/PhD/Research/Python_Envs/personal_site
 version:        3.11.6 (main, Nov  2 2023, 04:52:24) [Clang 14.0.3 (clang-1403.0.22.14.1)]
 numpy:          /Users/jack/Desktop/PhD/Research/Python_Envs/personal_site/lib/python3.11/site-packages/numpy
 numpy_version:  1.23.5
 
 NOTE: Python version was forced by use_python function

──────────────────────────────────────────────────────────────────────────────