Converting scRNA-seq Datasets from `Seurat` to `AnnData`

Author

Affiliation

Jack Leary

University of Florida

Published

December 9, 2023

Introduction

if you’re like me, you switch between using R & Python pretty frequently. Despite all the hype, single cell is still a relatively nascent field, and not every method is implemented in both languages. For example: if you want to run an RNA velocity analysis you’re probably going to need to use Python, but when it comes to making figures for your publication it’s generally easier to go back to R and use ggplot2. Whatever the motivation, it’s useful to be able to seamlessly switch between programming environments - though doing so isn’t always easy.

There are several libraries that facilitate conversions from R-based scRNA-seq data structures (typically SingleCellExperiment or Seurat objects) to Python-based ones (AnnData and loom being the most common formats). I’m aware of {zellkonverter}, {SeuratDisk}, {anndata2ri}, {loomR}, & {sceasy}. While these packages generally work fairly well, they don’t always transfer every piece of data that you need, and reading through the source code to try & figure out why your objects are incomplete can take up a lot of time. In addition, some of the more niche / less used dataset attributes are often undocumented or poorly-documented in conversion packages, making it difficult to identify how the data types correspond between formats. As such, sometimes it’s necessary to get through the conversion on your own. This vignette will hopefully make doing so a bit easier by detailing the equivalencies across packages between different types of counts matrices, metadata, graphs, & more. We’ll focus on converting directly from Seurat to AnnData, since converting between formats within the same language e.g., AnnData to loom or SingleCellExperiment to Seurat is generally easier & much better-documented.

Libraries

R

Code

library(dplyr)       # data manipulation
library(Seurat)      # scRNA-seq tools 
library(ggplot2)     # plots
library(biomaRt)     # gene metadata 
library(paletteer)   # color palettes
library(patchwork)   # plot alignment
library(reticulate)  # Python interface

Python

Code

import numpy as np                   # linear algebra tools
import pandas as pd                  # data manipulation
import scanpy as sc                  # scRNA-seq tools
import scvelo as scv                 # RNA velocity models
import anndata as ad                 # scRNA-seq data structures
from scipy.io import mmread          # read .mtx files
import matplotlib.pyplot as plt      # matplotlib
from plotnine import theme_classic   # plot themes
from scipy.sparse import csr_matrix  # sparse matrices

We also tweak our matplotlib settings to match ggplot2::theme_classic(), which is my preferred theme.

Code

base_size = 12
plt.rcParams.update({
    # font
    'font.size': base_size, 
    'font.weight': 'normal',
    # figure
    'figure.dpi': 300, 
    'figure.edgecolor': 'white', 
    'figure.facecolor': 'white', 
    'figure.figsize': (6, 4), 
    'figure.constrained_layout.use': True,
    # axes
    'axes.edgecolor': 'black',
    'axes.grid': False,
    'axes.labelpad': 2.75,
    'axes.labelsize': base_size * 0.8,
    'axes.linewidth': 1.5,
    'axes.spines.right': False,
    'axes.spines.top': False,
    'axes.titlelocation': 'left',
    'axes.titlepad': 11,
    'axes.titlesize': base_size,
    'axes.titleweight': 'normal',
    'axes.xmargin': 0.1, 
    'axes.ymargin': 0.1, 
    # legend
    'legend.borderaxespad': 1,
    'legend.borderpad': 0.5,
    'legend.columnspacing': 2,
    'legend.fontsize': base_size * 0.8,
    'legend.frameon': False,
    'legend.handleheight': 1,
    'legend.handlelength': 1.2,
    'legend.labelspacing': 1,
    'legend.title_fontsize': base_size, 
    'legend.markerscale': 1.25
})

Data

First we’ll load a dataset to analyze; I’m a fan of the neural development data from La Manno et al (2016) as a good example for trajectory / velocity methods. The dataset is available in the scRNAseq package, though we’ll need to convert it from the default SingleCellExperiment format to a Seurat object first.

Code

brain <- scRNAseq::LaMannoBrainData(which = "human-embryo")
brain <- CreateSeuratObject(counts = counts(brain), 
                            project = "neural_development", 
                            assay = "RNA", 
                            meta.data = data.frame(colData(brain), row.names = colnames(brain)), 
                            min.cells = 3, 
                            min.features = 3)

Analysis

Preprocessing in R

We’ll start by running the data through a very typical preprocessing pipeline composed of QC, HVG selection, dimension reduction, Leiden graph-based clustering, & visualization. We’ll do a couple things that are non-standard though - first, we make sure to return the fitted UMAP model after running RunUMAP(), which will allow us to later recompute the set of nearest-neighbors that UMAP uses internally. Next, we identify nearest neighbors twice, once with return.neighbor toggled & once without it. The set of NNs identified each time is the same, but the first run returns the NN indices and inter-cell distances, while the second returns the shared nearest-neighbor graph that we use as input to the Leiden clustering algorithm in FindClusters().

Code

brain <- brain %>% 
         PercentageFeatureSet(pattern = "^RP[SL]", col.name = "percent_ribo") %>% 
         PercentageFeatureSet(pattern = "^MT-", col.name = "percent_mito") %>% 
         NormalizeData(verbose = FALSE) %>% 
         ScaleData(verbose = FALSE) %>% 
         FindVariableFeatures(nfeatures = 3000, verbose = FALSE) %>% 
         RunPCA(features = VariableFeatures(.), 
                npcs = 30, 
                verbose = FALSE, 
                seed.use = 312, 
                approx = TRUE) %>% 
         CellCycleScoring(s.features = cc.genes.updated.2019$s.genes, 
                          g2m.features = cc.genes.updated.2019$g2m.genes, 
                          set.ident = FALSE) %>% 
         AddMetaData(.$S.Score - .$G2M.Score, col.name = "CC_difference") %>% 
         RunUMAP(reduction = "pca", 
                 dims = 1:30, 
                 return.model = TRUE, 
                 n.neighbors = 20, 
                 n.components = 2, 
                 metric = "cosine", 
                 n.epochs = 1000, 
                 seed.use = 312, 
                 verbose = FALSE) %>% 
         FindNeighbors(reduction = "pca", 
                       dims = 1:30, 
                       k.param = 20, 
                       return.neighbor = TRUE, 
                       nn.method = "annoy", 
                       annoy.metric = "cosine", 
                       verbose = FALSE) %>% 
         FindNeighbors(reduction = "pca", 
                       dims = 1:30, 
                       k.param = 20, 
                       compute.SNN = TRUE, 
                       nn.method = "annoy", 
                       annoy.metric = "cosine", 
                       verbose = FALSE) %>% 
         FindClusters(resolution = 0.25, 
                      algorithm = 4, 
                      method = "igraph", 
                      random.seed = 312, 
                      verbose = FALSE)

We’ll also add & clean up the metadata by creating a cell age variable and sorting the fine celltype labels into slightly coarser categories for visualization. This data is mostly derived from Figure 1 and Supplementary Table 1 in the original paper.

Code

brain@meta.data <- mutate(brain@meta.data, 
                          age = case_when(Timepoint == "week_6" ~ 6,
                                          Timepoint == "week_7" ~ 7,
                                          Timepoint == "week_8" ~ 8,
                                          Timepoint == "week_9" ~ 9,
                                          Timepoint == "week_10" ~ 10,
                                          Timepoint == "week_11" ~ 11,
                                          TRUE ~ NA_real_), 
                          age = as.factor(age), 
                          broad_celltype = case_when(Cell_type %in% c("hOMTN") ~ "Oculomotor & Trochlear Nucleus", 
                                                     Cell_type %in% c("hSert") ~ "Serotonergic", 
                                                     Cell_type %in% c("hGaba", "hNbGaba") ~ "Gabaergic", 
                                                     Cell_type %in% c("hDA0", "hDA1", "hDA2") ~ "Dopaminergic", 
                                                     Cell_type %in% c("hRgl1", "hRgl2a", "hRgl2b", "hRgl2c", "hRgl3") ~ "Radial Glia", 
                                                     Cell_type %in% c("hOPC") ~ "Oligodendrocyte Precursor", 
                                                     Cell_type %in% c("hPeric") ~ "Pericyte", 
                                                     Cell_type %in% c("hEndo") ~ "Endothelial", 
                                                     Cell_type %in% c("hNProg", "hProgBP", "hProgFPL", "hProgFPM", "hProgM") ~ "Progenitor", 
                                                     Cell_type %in% c("hRN") ~ "Red Nucleus", 
                                                     Cell_type %in% c("hNbM", "hNbML1", "hNbML5") ~ "Neuroblast", 
                                                     Cell_type %in% c("hMgl") ~ "Microglia", 
                                                     Cell_type %in% c("Unk") ~ "Unknown", 
                                                     TRUE ~ NA_character_), 
                          broad_celltype = as.factor(broad_celltype))

Let’s visualize the results. First, we’ll define a color palette for each of our key metadata features using the paletteer package.

Code

palette_cluster <- paletteer_d("ggsci::nrc_npg")
palette_celltype <- paletteer_d("ggthemes::Tableau_20")
palette_age <- paletteer_d("MetBrewer::Juarez")
palette_cc <- paletteer_d("ggsci::default_locuszoom")

We’ll also create a clean theme & legend settings for our dimension reduction plots.

Code

theme_umap <- function(base.size = 1) {
  scLANE::theme_scLANE() + 
  theme(axis.ticks = element_blank(), 
        axis.text = element_blank(), 
        plot.subtitle = element_text(face = "italic"), 
        plot.caption = element_text(face = "italic"))
}
guide_umap <- function(key.size = 4) {
  guides(color = guide_legend(override.aes = list(size = key.size, alpha = 1)))
}

We plot the unsupervised graph-based clustering, the true celltype labels, cell ages, & estimated cell cycle phase on our UMAP embedding.

Code

p0 <- DimPlot(brain, 
              group.by = "seurat_clusters", 
              cols = alpha(0.75, colour = palette_cluster), 
              shuffle = TRUE, 
              seed = 312, 
              pt.size = 1) + 
      labs(x = "UMAP 1", 
           y = "UMAP 2", 
           title = "Leiden Graph-based Clustering", 
           subtitle = "k = 20, r = 0.25") + 
      theme_umap() + 
      guide_umap()
p1 <- DimPlot(brain, 
              group.by = "broad_celltype", 
              cols = alpha(0.75, colour = palette_celltype), 
              shuffle = TRUE, 
              seed = 312, 
              pt.size = 1) + 
      labs(x = "UMAP 1", 
           y = "UMAP 2", 
           title = "Coarse Celltypes", 
           subtitle = "Derived from authors' original annotations") + 
      theme_umap() + 
      guide_umap()
p2 <- DimPlot(brain, 
              group.by = "age", 
              cols = alpha(0.75, colour = palette_age), 
              shuffle = TRUE, 
              seed = 312, 
              pt.size = 1) + 
      labs(x = "UMAP 1", 
           y = "UMAP 2", 
           title = "Cell Age", 
           subtitle = "Timepoints measured in weeks") + 
      theme_umap() + 
      guide_umap()
p3 <- DimPlot(brain, 
              group.by = "Phase", 
              cols = alpha(0.75, colour = palette_cc), 
              shuffle = TRUE, 
              seed = 312, 
              pt.size = 1) + 
      labs(x = "UMAP 1", 
           y = "UMAP 2", 
           title = "Cell Cycle Phase", 
           subtitle = "Estimated using genes from Tirosh et al (2016)") + 
      theme_umap() + 
      guide_umap()
(p0 | p1) / (p2 | p3)

Figure 1: Overview of the neural development dataset

Conversion to `AnnData`

Having done all this preprocessing already, when using tools in Python we’d ideally like to be able to keep our clustering, annotations, embeddings, etc. as they are instead of recomputing them with scanpy. That isn’t a terrible worst option of course, but it’s pretty much impossible to get the results of stochastic algorithms such as Leiden clustering and UMAP, and even simpler ones like HVG identification to agree when using differing implementations of differing methods. With that in mind, let’s port our results to Python!

Note

There’s essentially two ways to do this; the first relies on the usage of RMarkdown & reticulate, which allows you to pass objects back & forth between languages. The second, more general method is to write the data to common file formats from R, then read them in with Python & clean up the files afterwards. We’ll show examples of both options, but the best general-practice method is the file-based way. We’ll set up a temporary directory for the conversion files now, then remove it after we’re done.

Code

if (!dir.exists("./conversion_files/")) {
  dir.create("./conversion_files")
}

Raw Counts

First, we’ll write the counts matrices to a sparse matrix market file, which can be read into Python via scipy. Note: the AnnData ecosystem is built around \(\text{cell} \times \text{gene}\) counts matrices, which is transposed from how R-based single cell data structures store them. Make sure to transpose them in either R or Python before creating an AnnData object.

Code

Matrix::writeMM(brain@assays$RNA@layers$counts, file = "./conversion_files/RNA_counts.mtx")

Cell & Gene Metadata

We’ll use CSVs to store the metadata for our cells & genes; these can be read into Python using pandas. The biomaRt package allows us to pull an Ensembl ID & biotype for each gene in our dataset, which we save as well.

Code

readr::write_csv(brain@meta.data,
                 file = "./conversion_files/cell_metadata.csv",
                 col_names = TRUE)
readr::write_csv(data.frame(cell = colnames(brain)), 
                 file = "./conversion_files/cells.csv", 
                 col_names = TRUE)
readr::write_csv(data.frame(gene = rownames(brain)), 
                 file = "./conversion_files/genes.csv", 
                 col_names = TRUE)
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
gene_mapping_table <- getBM(filters = "hgnc_symbol",
                            attributes = c("hgnc_symbol", "ensembl_gene_id", "gene_biotype"),
                            values = rownames(brain),
                            mart = mart, 
                            uniqueRows = TRUE)
gene_mapping_table <- data.frame(hgnc_symbol = rownames(brain)) %>% 
                      left_join(gene_mapping_table, by = "hgnc_symbol") %>% 
                      with_groups(hgnc_symbol, 
                                  mutate, 
                                  R = row_number()) %>% 
                      filter(R == 1) %>% 
                      dplyr::select(-R)
readr::write_csv(gene_mapping_table, 
                 file = "./conversion_files/gene_mapping.csv", 
                 col_names = TRUE)

Embeddings

Next, we’ll create CSVs of our PCA & UMAP embeddings. Once we read these into Python they’ll need to be converted to numpy arrays.

Code

readr::write_csv(as.data.frame(brain@reductions$pca@cell.embeddings),
                 file = "./conversion_files/PCA.csv",
                 col_names = TRUE)
readr::write_csv(as.data.frame(brain@reductions$umap@cell.embeddings),
                 file = "./conversion_files/UMAP.csv",
                 col_names = TRUE)

Graph Structures

Lastly, we’ll save our nearest-neighbor graph data using a couple different formats - sparse matrices for the NN distance & UMAP connectivity graphs, and a CSV for the NN indices. First we need to actually create the NN distance graph though; in a Seurat object this graph is stored in an \(n \times k\) matrix containing only the distances for each cell’s nearest neighbors, while in an AnnData object the matrix is expanded to also include a value of 0 for all non-neighbor cells i.e., a \(n \times n\) sparse matrix. Luckily this is a fairly quick conversion to make: we simply record the row & column index for each cell’s nearest neighbors along with the accompanying cosine distance, then use that data to build a sparse matrix.

Code

knn_param <- ncol(brain@neighbors$RNA.nn@nn.idx)
row_idx <- col_idx <- X <- vector("numeric", length = ncol(brain) * knn_param)
for (i in seq(ncol(brain))) {
  row_idx[(knn_param * i - (knn_param - 1)):(knn_param * i)] <- i
  col_idx[(knn_param * i - (knn_param - 1)):(knn_param * i)] <- brain@neighbors$RNA.nn@nn.idx[i, ]
  X[(knn_param * i - (knn_param - 1)):(knn_param * i)] <- brain@neighbors$RNA.nn@nn.dist[i, ]
  
}
knn_dist_mat <- Matrix::sparseMatrix(i = row_idx, 
                                     j = col_idx, 
                                     x = X)

The trickiest bit is the cell-level connectivities. This term is used pretty often in analyses done with scanpy and scvelo, but there isn’t really a direct equivalent in R. In short, scanpy (and the other packages built on top of it) uses part of the UMAP algorithm to both compute the nearest-neighbor graph & embed the cells in one step, rather than computing the neighbors separately from the UMAP embedding as is done in R. More technically, scanpy borrows the computation of the fuzzy simplicial set from UMAP, which is essentially a local approximation of the distances between points for a given distance metric (cosine, in our case). See here, specifically the section entitled “Adapting to Real World Data”, for many more details.

Since the UMAP embeddings you get from Seurat & scanpy will differ somewhat because of differences in their underlying implementations, we’d like to be able to extract the connectivities from our pre-computed embedding and bring them in Python, instead of recomputing them with scanpy and thus having a set of neighbors that doesn’t exactly correspond to our UMAP. We can do this by refitting the UMAP model using the exact settings that we did when running it through the Seurat wrapper function, but this time we specify the parameter ret_extra = c("fgraph"), which indicates that the fuzzy simplicial set matrix should be returned. Some of the parameters aren’t stored in the returned uwot model, so we need to set them manually. The only tricky one is the initialization; RunUMAP() uses the spectral (normalized Laplacian plus Gaussian noise) initialization as of the time of writing. Note: make sure to use the same random seed as you did when calling RunUMAP() originally.

Code

set.seed(312)
umap_reembed <- uwot::umap(X = brain@reductions$pca@cell.embeddings, 
                           n_neighbors = brain@reductions$umap@misc$model$n_neighbors,
                           n_components = 2,
                           metric = "cosine", 
                           n_epochs = brain@reductions$umap@misc$model$n_epochs, 
                           init = "spectral", 
                           learning_rate = brain@reductions$umap@misc$model$alpha, 
                           min_dist = 0.1, 
                           local_connectivity = brain@reductions$umap@misc$model$local_connectivity, 
                           nn_method = "annoy", 
                           negative_sample_rate = brain@reductions$umap@misc$model$negative_sample_rate, 
                           ret_extra = c("fgraph"),
                           n_threads = 2,
                           a = brain@reductions$umap@misc$model$a, 
                           b = brain@reductions$umap@misc$model$b, 
                           search_k = brain@reductions$umap@misc$model$search_k, 
                           approx_pow = brain@reductions$umap@misc$model$approx_pow, 
                           verbose = FALSE)

Now that we’ve created our data, we write it to files.

Code

Matrix::writeMM(knn_dist_mat, file = "./conversion_files/KNN_distances.mtx")
Matrix::writeMM(umap_reembed$fgraph, file = "./conversion_files/UMAP_connectivity_matrix.mtx")
readr::write_csv(as.data.frame(brain@neighbors$RNA.nn@nn.idx), 
                 file = "./conversion_files/KNN_indices.csv", 
                 col_names = TRUE)

Downstream Analysis in Python

We read our data into Python, & do some necessary reformatting.

Code

# raw counts 
RNA_counts = mmread('./conversion_files/RNA_counts.mtx').transpose().tocsr().astype('float64')
# cell metadata
cell_metadata = pd.read_csv('./conversion_files/cell_metadata.csv') \
                  .rename(columns={'Cell_ID': 'cell'}) \
                  .set_index('cell', drop=False)
cell_metadata['seurat_clusters'] = cell_metadata['seurat_clusters'].astype('category')
cell_metadata['age'] = pd.Categorical(cell_metadata['age'].astype('str'), categories=['6', '7', '8', '9', '10', '11'])
cat_cols = ['broad_celltype', 'Phase', 'Cell_type']
cell_metadata[cat_cols] = cell_metadata[cat_cols].apply(lambda x: pd.Categorical(x, categories=list(dict.fromkeys(sorted(x, key=str.lower)))))
cell_names = pd.read_csv('./conversion_files/cells.csv').set_index('cell', drop=False)
# gene metadata
gene_names = pd.read_csv('./conversion_files/genes.csv').set_index('gene', drop=False)
gene_mapping_table = pd.read_csv('./conversion_files/gene_mapping.csv').set_index('hgnc_symbol', drop=False)
# reduced dimensional embeddings
pca_embedding = pd.read_csv('./conversion_files/PCA.csv') \
                  .set_index(pd.Index(cell_names['cell'])) \
                  .to_numpy()
umap_embedding = pd.read_csv('./conversion_files/UMAP.csv') \
                   .set_index(pd.Index(cell_names['cell'])) \
                   .to_numpy()
# KNN graphs
knn_idx = pd.read_csv('./conversion_files/KNN_indices.csv').to_numpy().astype('int32') - 1
knn_dist_mat = mmread('./conversion_files/KNN_distances.mtx').tocsr().astype('float64')
knn_conn_mat = mmread('./conversion_files/UMAP_connectivity_matrix.mtx').tocsr().astype('float64')

Now we can finally create our AnnData object. We make sure to store the unprocessed version of the data in the AnnData.raw slot.

Code

layers_dict = {'RNA': RNA_counts}
dimred_dict = {'X_pca': pca_embedding, 'X_umap': umap_embedding}
graph_dict = {'distances': knn_dist_mat, 'connectivities': knn_conn_mat}
uns_dict = {
    'broad_celltype_colors': np.array(r.palette_celltype[0:13]), 
    'seurat_clusters_colors': np.array(r.palette_cluster[0:7]), 
    'age_colors': np.array(r.palette_age[0:7]), 
    'Phase_colors': np.array(r.palette_cc[0:3]), 
    'neighbors': {
        'connectivities_key': 'connectivities', 
        'distances_key': 'distances', 
        'indices': knn_idx, 
        'params': {
            'n_neighbors': knn_idx.shape[1], 
            'method': 'Seurat::FindNeighbors()', 
            'metric': 'cosine', 
            'n_pcs': pca_embedding.shape[1], 
            'use_rep': 'X_pca'
        } 
    }
}
adata = ad.AnnData(
    X=layers_dict['RNA'], 
    obs=cell_metadata, 
    var=gene_mapping_table, 
    layers=layers_dict, 
    obsm=dimred_dict, 
    obsp=graph_dict, 
    uns=uns_dict
)
adata.raw = adata
adata

AnnData object with n_obs × n_vars = 1977 × 19527
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'cell', 'Cell_type', 'Timepoint', 'percent_ribo', 'percent_mito', 'S.Score', 'G2M.Score', 'Phase', 'CC_difference', 'RNA_snn_res.0.25', 'seurat_clusters', 'age', 'broad_celltype'
    var: 'hgnc_symbol', 'ensembl_gene_id', 'gene_biotype'
    uns: 'broad_celltype_colors', 'seurat_clusters_colors', 'age_colors', 'Phase_colors', 'neighbors'
    obsm: 'X_pca', 'X_umap'
    layers: 'RNA'
    obsp: 'distances', 'connectivities'

When plotting the UMAP embedding, we can see that the coordinates & colors have been preserved.

Code

ax = sc.pl.scatter(
  adata, 
  basis='umap',
  color='broad_celltype', 
  title='', 
  frameon=True, 
  show=False, 
  legend_fontsize=10, 
)
ax.set_xlabel('UMAP 1')
ax.set_ylabel('UMAP 2')
plt.show()

Now we can proceed with a typical downstream analysis using the scanpy library. After processing the cells, we estimate a force-directed graph embedding and then a diffusion map embedding, which we’ll use to estimate pseudotime.

Code

sc.pp.filter_genes_dispersion(adata, flavor='seurat', n_top_genes=3000)
sc.pp.normalize_per_cell(adata)
sc.pp.log1p(adata)
sc.pp.scale(adata)
sc.tl.draw_graph(
  adata,
  layout='fr',
  n_jobs=2,
  random_state=312
)
sc.tl.diffmap(adata, random_state=312)

Code

ax = sc.pl.scatter(
  adata,
  basis='diffmap',
  color='broad_celltype', 
  title='', 
  frameon=True, 
  show=False, 
  legend_fontsize=10
)
ax.set_xlabel('DC 1')
ax.set_ylabel('DC 2')
plt.show()

Figure 3: Diffusion map embedding colored by celltype

We assign a root cell belonging to the progenitor population, then estimate diffusion pseudotime for each cell, and finally plot the results with our diffusion map embedding.

Code

adata.uns['iroot'] = np.argmin(adata.obsm['X_diffmap'][1])
sc.tl.dpt(adata)
ax = sc.pl.scatter(
  adata, 
  color='dpt_pseudotime', 
  basis='diffmap', 
  title='', 
  color_map='gnuplot', 
  legend_fontsize=10, 
  show=False
)
ax.set_xlabel('DC 1')
ax.set_ylabel('DC 2')
plt.show()

When plotting the distribution of diffusion pseudotime for each coarse celltype, we see that more immature celltypes such as progenitors have lower values, whereas more mature celltypes like endothelial cells & microglia have larger values. This indicates that our DPT estimate is a decent proxy for progression through the underlying biological process.

Code

violin_order = pd.DataFrame(adata.obs[['broad_celltype', 'dpt_pseudotime']]) \
                 .groupby('broad_celltype')['dpt_pseudotime'].mean() \
                 .sort_values(ascending=True) \
                 .index.tolist()
ax = sc.pl.violin(
  adata, 
  keys='dpt_pseudotime', 
  groupby='broad_celltype', 
  frameon=True, 
  order=violin_order, 
  show=False, 
  inner='box',
  scale='width', 
  stripplot=False
)
ax.set_ylabel('Diffusion Pseudotime')
ax.set_xlabel(None)
ax.set_xticklabels(violin_order, rotation=40, ha='right')
ax.set_position([0.1, 0.375, 0.85, 0.575]) 
plt.show()

Figure 5: Diffusion pseudotime distribution by celltype

File Cleanup

Lastly, we’ll remove the temporary directory we used to store our data while converting it all to AnnData. Note: always be super careful when using rm -rf!

Code

if (dir.exists("./conversion_files")) {
  system("rm -rf ./conversion_files")
}

Returning Results to R

We’ll quickly pull the diffusion pseudotime estimate for each cell back into R using reticulate, and add it to our Seurat object. We also add the diffusion map embedding to the reductions slot in the object.

Code

dpt_est <- py$adata$obs$dpt_pseudotime
names(dpt_est) <- py$adata$obs$cell
brain <- AddMetaData(brain, metadata = dpt_est, col.name = "scanpy_dpt")
diffmap_embed <- py$adata$obsm["X_diffmap"]
rownames(diffmap_embed) <- colnames(brain)
colnames(diffmap_embed) <- paste0("DC_", c(1:ncol(diffmap_embed)))
brain@reductions$diffmap <- CreateDimReducObject(embeddings = diffmap_embed, 
                                                 assay = "RNA", 
                                                 key = "DC_", 
                                                 global = TRUE)

We can plot the DPT estimates on our UMAP embedding:

Code

FeaturePlot(brain, 
            features = "scanpy_dpt", 
            reduction = "umap", 
            pt.size = 1) + 
  scale_color_gradientn(colours = viridisLite::inferno(n = 20), 
                        labels = scales::label_number(accuracy = .1)) + 
  labs(x = "UMAP 1", y = "UMAP 2") + 
  theme_umap()

Figure 6: Diffusion pseudotime from Scanpy

We can also visualize gene dynamics over pseudotime for a few genes of interest. For example, we see that fibronectin 1 (FN1) is highly expressed towards the end of the biological process, specifically in pericyte & endothelial cells. These two celltypes form part of the blood-brain barrier, & are closely related to one another (source).

Code

data.frame(cell = colnames(brain), 
           celltype = brain$broad_celltype, 
           dpt = brain$scanpy_dpt, 
           LGALS1 = GetAssayData(brain, "RNA")["LGALS1", ], 
           COL4A1 = GetAssayData(brain, "RNA")["COL4A1", ], 
           RGS5 = GetAssayData(brain, "RNA")["RGS5", ], 
           FN1 = GetAssayData(brain, "RNA")["FN1", ]) %>% 
  tidyr::pivot_longer(cols = !c(cell, celltype, dpt), 
                      names_to = "gene", 
                      values_to = "expression") %>% 
  ggplot(aes(x = dpt, y = expression)) + 
  facet_wrap(~gene, 
             ncol = 2, 
             nrow = 2) + 
  geom_point(aes(color = celltype), 
             size = 1, 
             alpha = 0.75) + 
  geom_smooth(method = "gam", 
              se = FALSE, 
              linewidth = 1.25, 
              color = "black") + 
  scale_x_continuous(labels = scales::label_number(accuracy = .1)) + 
  scale_y_continuous(limits = c(0, NA), labels = scales::label_number(accuracy = 1)) + 
  scale_color_manual(values = palette_celltype) + 
  labs(x = "Diffusion Pseudotime", y = "Normalized Expression") + 
  scLANE::theme_scLANE() + 
  theme(legend.title = element_blank(), 
        strip.text.x = element_text(face = "italic")) + 
  guides(color = guide_legend(override.aes = list(alpha = 1, size = 4)))

Figure 7: Pseudotemporal gene dynamics during human embryonic neurogenesis

Conclusions

With the reticulate package it’s relatively easy to pass data back and forth between R & Python in order to take advantage of tools & workflows that are only available in one of the two languages. For larger files, or those for which datatype conversion is necessary, writing to disk for conversion is helpful as well. For more information, check out the reticulate documentation and the scanpy documentation.

Session Info

Code

sessioninfo::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.1 (2022-06-23)
 os       macOS Big Sur ... 10.16
 system   x86_64, darwin17.0
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2023-12-09
 pandoc   3.1.9 @ /usr/local/bin/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package                * version    date (UTC) lib source
 abind                    1.4-5      2016-07-21 [1] CRAN (R 4.2.0)
 AnnotationDbi            1.58.0     2022-04-26 [1] Bioconductor
 AnnotationFilter         1.20.0     2022-04-26 [1] Bioconductor
 AnnotationHub            3.4.0      2022-04-26 [1] Bioconductor
 assertthat               0.2.1      2019-03-21 [1] CRAN (R 4.2.0)
 backports                1.4.1      2021-12-13 [1] CRAN (R 4.2.0)
 bigassertr               0.1.5      2021-07-08 [1] CRAN (R 4.2.0)
 bigparallelr             0.3.2      2021-10-02 [1] CRAN (R 4.2.0)
 bigstatsr                1.5.12     2022-10-14 [1] CRAN (R 4.2.0)
 Biobase                * 2.56.0     2022-04-26 [1] Bioconductor
 BiocFileCache            2.4.0      2022-04-26 [1] Bioconductor
 BiocGenerics           * 0.42.0     2022-04-26 [1] Bioconductor
 BiocIO                   1.6.0      2022-04-26 [1] Bioconductor
 BiocManager              1.30.18    2022-05-18 [1] CRAN (R 4.2.0)
 BiocParallel             1.30.3     2022-06-05 [1] Bioconductor
 BiocVersion              3.15.2     2022-03-29 [1] Bioconductor
 biomaRt                * 2.52.0     2022-04-26 [1] Bioconductor
 Biostrings               2.64.1     2022-08-18 [1] Bioconductor
 bit                      4.0.4      2020-08-04 [1] CRAN (R 4.2.0)
 bit64                    4.0.5      2020-08-30 [1] CRAN (R 4.2.0)
 bitops                   1.0-7      2021-04-24 [1] CRAN (R 4.2.0)
 blob                     1.2.3      2022-04-10 [1] CRAN (R 4.2.0)
 boot                     1.3-28     2021-05-03 [1] CRAN (R 4.2.1)
 broom                    1.0.0      2022-07-01 [1] CRAN (R 4.2.0)
 broom.mixed              0.2.9.4    2022-04-17 [1] CRAN (R 4.2.0)
 cachem                   1.0.6      2021-08-19 [1] CRAN (R 4.2.0)
 cli                      3.6.1      2023-03-23 [1] CRAN (R 4.2.0)
 cluster                  2.1.4      2022-08-22 [1] CRAN (R 4.2.0)
 coda                     0.19-4     2020-09-30 [1] CRAN (R 4.2.0)
 codetools                0.2-18     2020-11-04 [1] CRAN (R 4.2.1)
 colorspace               2.0-3      2022-02-21 [1] CRAN (R 4.2.0)
 cowplot                  1.1.1      2020-12-30 [1] CRAN (R 4.2.0)
 crayon                   1.5.1      2022-03-26 [1] CRAN (R 4.2.0)
 curl                     4.3.2      2021-06-23 [1] CRAN (R 4.2.0)
 data.table               1.14.2     2021-09-27 [1] CRAN (R 4.2.0)
 DBI                      1.1.3      2022-06-18 [1] CRAN (R 4.2.0)
 dbplyr                   2.2.1      2022-06-27 [1] CRAN (R 4.2.0)
 DelayedArray             0.22.0     2022-04-26 [1] Bioconductor
 deldir                   1.0-6      2021-10-23 [1] CRAN (R 4.2.0)
 digest                   0.6.33     2023-07-07 [1] CRAN (R 4.2.0)
 doParallel               1.0.17     2022-02-07 [1] CRAN (R 4.2.0)
 doSNOW                   1.0.20     2022-02-04 [1] CRAN (R 4.2.0)
 dotCall64                1.0-1      2021-02-11 [1] CRAN (R 4.2.0)
 dplyr                  * 1.1.3      2023-09-03 [1] CRAN (R 4.2.0)
 ellipsis                 0.3.2      2021-04-29 [1] CRAN (R 4.2.0)
 emmeans                  1.8.0      2022-08-05 [1] CRAN (R 4.2.0)
 ensembldb                2.20.2     2022-06-21 [1] Bioconductor
 estimability             1.4.1      2022-08-05 [1] CRAN (R 4.2.0)
 evaluate                 0.23       2023-11-01 [1] CRAN (R 4.2.1)
 ExperimentHub            2.4.0      2022-04-26 [1] Bioconductor
 fansi                    1.0.3      2022-03-24 [1] CRAN (R 4.2.0)
 farver                   2.1.1      2022-07-06 [1] CRAN (R 4.2.0)
 fastDummies              1.7.3      2023-07-06 [1] CRAN (R 4.2.0)
 fastmap                  1.1.0      2021-01-25 [1] CRAN (R 4.2.0)
 filelock                 1.0.2      2018-10-05 [1] CRAN (R 4.2.0)
 fitdistrplus             1.1-8      2022-03-10 [1] CRAN (R 4.2.0)
 flock                    0.7        2016-11-12 [1] CRAN (R 4.2.0)
 forcats                  0.5.2      2022-08-19 [1] CRAN (R 4.2.0)
 foreach                  1.5.2      2022-02-02 [1] CRAN (R 4.2.0)
 furrr                    0.3.1      2022-08-15 [1] CRAN (R 4.2.0)
 future                   1.33.0     2023-07-01 [1] CRAN (R 4.2.0)
 future.apply             1.9.0      2022-04-25 [1] CRAN (R 4.2.0)
 gamlss                   5.4-20     2023-10-04 [1] CRAN (R 4.2.0)
 gamlss.data              6.0-2      2021-11-07 [1] CRAN (R 4.2.0)
 gamlss.dist              6.0-5      2022-08-28 [1] CRAN (R 4.2.1)
 geeM                     0.10.1     2018-06-18 [1] CRAN (R 4.2.0)
 generics                 0.1.3      2022-07-05 [1] CRAN (R 4.2.0)
 GenomeInfoDb           * 1.32.3     2022-08-09 [1] Bioconductor
 GenomeInfoDbData         1.2.8      2022-08-29 [1] Bioconductor
 GenomicAlignments        1.32.1     2022-08-02 [1] Bioconductor
 GenomicFeatures          1.48.3     2022-05-31 [1] Bioconductor
 GenomicRanges          * 1.48.0     2022-04-26 [1] Bioconductor
 ggplot2                * 3.4.4      2023-10-12 [1] CRAN (R 4.2.0)
 ggrepel                  0.9.4      2023-10-13 [1] CRAN (R 4.2.0)
 ggridges                 0.5.3      2021-01-08 [1] CRAN (R 4.2.0)
 glm2                     1.2.1      2018-08-11 [1] CRAN (R 4.2.0)
 glmmTMB                  1.1.8      2023-10-07 [1] CRAN (R 4.2.1)
 globals                  0.16.1     2022-08-28 [1] CRAN (R 4.2.1)
 glue                     1.6.2      2022-02-24 [1] CRAN (R 4.2.0)
 goftest                  1.2-3      2021-10-07 [1] CRAN (R 4.2.0)
 gridExtra                2.3        2017-09-09 [1] CRAN (R 4.2.0)
 gtable                   0.3.0      2019-03-25 [1] CRAN (R 4.2.0)
 here                     1.0.1      2020-12-13 [1] CRAN (R 4.2.0)
 hms                      1.1.2      2022-08-19 [1] CRAN (R 4.2.0)
 htmltools                0.5.3      2022-07-18 [1] CRAN (R 4.2.0)
 htmlwidgets              1.5.4      2021-09-08 [1] CRAN (R 4.2.0)
 httpuv                   1.6.5      2022-01-05 [1] CRAN (R 4.2.0)
 httr                     1.4.4      2022-08-17 [1] CRAN (R 4.2.0)
 ica                      1.0-3      2022-07-08 [1] CRAN (R 4.2.0)
 igraph                   1.5.1      2023-08-10 [1] CRAN (R 4.2.0)
 interactiveDisplayBase   1.34.0     2022-04-26 [1] Bioconductor
 IRanges                * 2.30.1     2022-08-18 [1] Bioconductor
 irlba                    2.3.5.1    2022-10-03 [1] CRAN (R 4.2.1)
 iterators                1.0.14     2022-02-05 [1] CRAN (R 4.2.0)
 jsonlite                 1.8.7      2023-06-29 [1] CRAN (R 4.2.0)
 KEGGREST                 1.36.3     2022-07-14 [1] Bioconductor
 KernSmooth               2.23-20    2021-05-03 [1] CRAN (R 4.2.1)
 knitr                    1.40       2022-08-24 [1] CRAN (R 4.2.0)
 labeling                 0.4.2      2020-10-20 [1] CRAN (R 4.2.0)
 later                    1.3.0      2021-08-18 [1] CRAN (R 4.2.0)
 lattice                  0.20-45    2021-09-22 [1] CRAN (R 4.2.1)
 lazyeval                 0.2.2      2019-03-15 [1] CRAN (R 4.2.0)
 leiden                   0.4.2      2022-05-09 [1] CRAN (R 4.2.0)
 lifecycle                1.0.3      2022-10-07 [1] CRAN (R 4.2.0)
 listenv                  0.8.0      2019-12-05 [1] CRAN (R 4.2.0)
 lme4                     1.1-30     2022-07-08 [1] CRAN (R 4.2.0)
 lmtest                   0.9-40     2022-03-21 [1] CRAN (R 4.2.0)
 magrittr                 2.0.3      2022-03-30 [1] CRAN (R 4.2.0)
 MASS                     7.3-60     2023-05-04 [1] CRAN (R 4.2.0)
 Matrix                   1.6-3      2023-11-14 [1] CRAN (R 4.2.1)
 MatrixGenerics         * 1.8.1      2022-06-30 [1] Bioconductor
 matrixStats            * 0.62.0     2022-04-19 [1] CRAN (R 4.2.0)
 memoise                  2.0.1      2021-11-26 [1] CRAN (R 4.2.0)
 mgcv                     1.8-40     2022-03-29 [1] CRAN (R 4.2.1)
 mime                     0.12       2021-09-28 [1] CRAN (R 4.2.0)
 miniUI                   0.1.1.1    2018-05-18 [1] CRAN (R 4.2.0)
 minqa                    1.2.4      2014-10-09 [1] CRAN (R 4.2.0)
 multcomp                 1.4-20     2022-08-07 [1] CRAN (R 4.2.0)
 munsell                  0.5.0      2018-06-12 [1] CRAN (R 4.2.0)
 mvtnorm                  1.1-3      2021-10-08 [1] CRAN (R 4.2.0)
 nlme                     3.1-159    2022-08-09 [1] CRAN (R 4.2.0)
 nloptr                   2.0.3      2022-05-26 [1] CRAN (R 4.2.0)
 numDeriv                 2016.8-1.1 2019-06-06 [1] CRAN (R 4.2.0)
 paletteer              * 1.5.0      2022-10-19 [1] CRAN (R 4.2.0)
 parallelly               1.36.0     2023-05-26 [1] CRAN (R 4.2.0)
 patchwork              * 1.1.2      2022-08-19 [1] CRAN (R 4.2.0)
 pbapply                  1.5-0      2021-09-16 [1] CRAN (R 4.2.0)
 pillar                   1.9.0      2023-03-22 [1] CRAN (R 4.2.0)
 pkgconfig                2.0.3      2019-09-22 [1] CRAN (R 4.2.0)
 plotly                   4.10.0     2021-10-09 [1] CRAN (R 4.2.0)
 plyr                     1.8.7      2022-03-24 [1] CRAN (R 4.2.0)
 png                      0.1-7      2013-12-03 [1] CRAN (R 4.2.0)
 polyclip                 1.10-0     2019-03-14 [1] CRAN (R 4.2.0)
 prettyunits              1.1.1      2020-01-24 [1] CRAN (R 4.2.0)
 prismatic                1.1.1      2022-08-15 [1] CRAN (R 4.2.0)
 progress                 1.2.2      2019-05-16 [1] CRAN (R 4.2.0)
 progressr                0.10.1     2022-06-03 [1] CRAN (R 4.2.0)
 promises                 1.2.0.1    2021-02-11 [1] CRAN (R 4.2.0)
 ProtGenerics             1.28.0     2022-04-26 [1] Bioconductor
 purrr                    1.0.2      2023-08-10 [1] CRAN (R 4.2.0)
 R6                       2.5.1      2021-08-19 [1] CRAN (R 4.2.0)
 RANN                     2.6.1      2019-01-08 [1] CRAN (R 4.2.0)
 rappdirs                 0.3.3      2021-01-31 [1] CRAN (R 4.2.0)
 RColorBrewer             1.1-3      2022-04-03 [1] CRAN (R 4.2.0)
 Rcpp                     1.0.9      2022-07-08 [1] CRAN (R 4.2.0)
 RcppAnnoy                0.0.19     2021-07-30 [1] CRAN (R 4.2.0)
 RcppEigen                0.3.3.9.4  2023-11-02 [1] CRAN (R 4.2.1)
 RcppHNSW                 0.4.1      2022-07-18 [1] CRAN (R 4.2.0)
 RCurl                    1.98-1.8   2022-07-30 [1] CRAN (R 4.2.0)
 readr                    2.1.2      2022-01-30 [1] CRAN (R 4.2.0)
 rematch2                 2.1.2      2020-05-01 [1] CRAN (R 4.2.0)
 reshape2                 1.4.4      2020-04-09 [1] CRAN (R 4.2.0)
 restfulr                 0.0.15     2022-06-16 [1] CRAN (R 4.2.0)
 reticulate             * 1.28       2023-01-27 [1] CRAN (R 4.2.0)
 rjson                    0.2.21     2022-01-09 [1] CRAN (R 4.2.0)
 rlang                    1.1.2      2023-11-04 [1] CRAN (R 4.2.1)
 rmarkdown                2.16       2022-08-24 [1] CRAN (R 4.2.0)
 ROCR                     1.0-11     2020-05-02 [1] CRAN (R 4.2.0)
 rprojroot                2.0.3      2022-04-02 [1] CRAN (R 4.2.0)
 Rsamtools                2.12.0     2022-04-26 [1] Bioconductor
 RSpectra                 0.16-1     2022-04-24 [1] CRAN (R 4.2.0)
 RSQLite                  2.2.16     2022-08-17 [1] CRAN (R 4.2.0)
 rtracklayer              1.56.1     2022-06-30 [1] Bioconductor
 Rtsne                    0.16       2022-04-17 [1] CRAN (R 4.2.0)
 S4Vectors              * 0.34.0     2022-04-26 [1] Bioconductor
 sandwich                 3.0-2      2022-06-15 [1] CRAN (R 4.2.0)
 scales                   1.2.1      2022-08-20 [1] CRAN (R 4.2.0)
 scattermore              1.2        2023-06-12 [1] CRAN (R 4.2.0)
 scLANE                   0.7.8      2023-12-01 [1] Bioconductor
 scRNAseq               * 2.10.0     2022-04-28 [1] Bioconductor
 sctransform              0.4.1      2023-10-19 [1] CRAN (R 4.2.0)
 sessioninfo              1.2.2      2021-12-06 [1] CRAN (R 4.2.0)
 Seurat                 * 5.0.0      2023-11-04 [1] CRAN (R 4.2.1)
 SeuratObject           * 5.0.0      2023-10-26 [1] CRAN (R 4.2.1)
 shiny                    1.7.2      2022-07-19 [1] CRAN (R 4.2.0)
 SingleCellExperiment   * 1.18.0     2022-04-26 [1] Bioconductor
 snow                     0.4-4      2021-10-27 [1] CRAN (R 4.2.0)
 sp                     * 1.5-0      2022-06-05 [1] CRAN (R 4.2.0)
 spam                     2.9-1      2022-08-07 [1] CRAN (R 4.2.0)
 spatstat.data            3.0-0      2022-10-21 [1] CRAN (R 4.2.0)
 spatstat.explore         3.0-5      2022-11-10 [1] CRAN (R 4.2.1)
 spatstat.geom            3.0-3      2022-10-25 [1] CRAN (R 4.2.0)
 spatstat.random          3.0-1      2022-11-03 [1] CRAN (R 4.2.0)
 spatstat.sparse          3.0-0      2022-10-21 [1] CRAN (R 4.2.0)
 spatstat.utils           3.0-1      2022-10-19 [1] CRAN (R 4.2.0)
 stringi                  1.7.8      2022-07-11 [1] CRAN (R 4.2.0)
 stringr                  1.5.0      2022-12-02 [1] CRAN (R 4.2.0)
 SummarizedExperiment   * 1.26.1     2022-05-01 [1] Bioconductor
 survival                 3.4-0      2022-08-09 [1] CRAN (R 4.2.0)
 tensor                   1.5        2012-05-05 [1] CRAN (R 4.2.0)
 TH.data                  1.1-1      2022-04-26 [1] CRAN (R 4.2.0)
 tibble                   3.2.1      2023-03-20 [1] CRAN (R 4.2.0)
 tidyr                    1.3.0      2023-01-24 [1] CRAN (R 4.2.0)
 tidyselect               1.2.0      2022-10-10 [1] CRAN (R 4.2.0)
 TMB                      1.9.7      2023-11-22 [1] CRAN (R 4.2.1)
 tzdb                     0.3.0      2022-03-28 [1] CRAN (R 4.2.0)
 utf8                     1.2.2      2021-07-24 [1] CRAN (R 4.2.0)
 uwot                     0.1.16     2023-06-29 [1] CRAN (R 4.2.0)
 vctrs                    0.6.3      2023-06-14 [1] CRAN (R 4.2.0)
 viridisLite              0.4.1      2022-08-22 [1] CRAN (R 4.2.0)
 vroom                    1.5.7      2021-11-30 [1] CRAN (R 4.2.0)
 withr                    2.5.2      2023-10-30 [1] CRAN (R 4.2.1)
 xfun                     0.32       2022-08-10 [1] CRAN (R 4.2.0)
 XML                      3.99-0.10  2022-06-09 [1] CRAN (R 4.2.0)
 xml2                     1.3.3      2021-11-30 [1] CRAN (R 4.2.0)
 xtable                   1.8-4      2019-04-21 [1] CRAN (R 4.2.0)
 XVector                  0.36.0     2022-04-26 [1] Bioconductor
 yaml                     2.3.5      2022-02-21 [1] CRAN (R 4.2.0)
 zlibbioc                 1.42.0     2022-04-26 [1] Bioconductor
 zoo                      1.8-10     2022-04-15 [1] CRAN (R 4.2.0)

 [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library

─ Python configuration ───────────────────────────────────────────────────────
 python:         /Users/jack/Desktop/PhD/Research/Python_Envs/personal_site/bin/python
 libpython:      /usr/local/opt/python@3.11/Frameworks/Python.framework/Versions/3.11/lib/python3.11/config-3.11-darwin/libpython3.11.dylib
 pythonhome:     /Users/jack/Desktop/PhD/Research/Python_Envs/personal_site:/Users/jack/Desktop/PhD/Research/Python_Envs/personal_site
 version:        3.11.6 (main, Nov  2 2023, 04:52:24) [Clang 14.0.3 (clang-1403.0.22.14.1)]
 numpy:          /Users/jack/Desktop/PhD/Research/Python_Envs/personal_site/lib/python3.11/site-packages/numpy
 numpy_version:  1.23.5
 
 NOTE: Python version was forced by use_python function

──────────────────────────────────────────────────────────────────────────────

Introduction

Libraries

R

Python

Data

Analysis

Preprocessing in R

Conversion to AnnData

Raw Counts

Cell & Gene Metadata

Embeddings

Graph Structures

Downstream Analysis in Python

File Cleanup

Returning Results to R

Conclusions

Session Info

Conversion to `AnnData`