In this analysis we’ll be looking at cells undergoing neurogenesis in the developing human frontal cortex. The data are taken from Nowakowski et al (2017). Our goal is to build a simple method for fitting single cell gene regulatory networks (GRNs) and compare the results across pseudotime lineages, keeping in mind the trajectory structure of the data. If you want to skip straight to the analysis, head to Section 6.
2 Libraries
We start by loading in a few necessary packages & resolving some function conflicts.
Unfortunately the data that are made available as part of the scRNAseq package come in a transcripts-per-million (TPM) normalized matrix - not raw counts. A rough conversion back to the raw integer counts is to multiply each cell’s normalized expression vector by the TPM size factor. This gets us most of the way there but the results are still estimates, so we round them to the nearest integer.
Note
This conversion ignores gene length, which is why the results are estimates & require rounding. For a more detailed discussion on how to un-normalize TPM counts see this StackOverflow post. While the original authors used a plate-based protocol (Fluidigm C1) they do not mention whether they used unique molecular identifiers (UMIs); if UMIs were used then gene length normalization isn’t necessary. Since simply reversing the TPM transformation doesn’t result in integer values I would guess that UMIs were not used, though their Supplementary Methods don’t mention gene length normalization.
We run the cells through a basic preprocessing pipeline: QC, normalization, highly variable gene (HVG) identification, PCA, UMAP, and graph-based Leiden clustering.
Here we examine some of our dataset’s attributes - the original celltype annotations, our new clusters, age in post-conception weeks (PCW), and the proportion of each cell’s reads aligning to ribosomal genes.
We begin by running a basic differential expression (DE) testing routine to identify some potential marker genes for each cluster. This will help us assign a celltype identity to each cluster.
Tip
As of Seurat v5, if you install the presto package using remotes::install_github("immunogenomics/presto"), the Wilcox tests performed here will be much, much faster.
We visualize the markers for each cluster with a dotplot, which shows clear differences between clusters. For example, the transcription factor (TF) aristaless related homeobox (ARX) is specifically expressed in cluster 7; this TF’s expression is known to be necessary for healthy neuronal development and helps to regulate differentiation.
The markers seem consistent with what we’d expect for excitatory neurons; for more information see e.g., Figure 4 and Suppl. Figures 4 & 11 in the original publication.
Table 1: The top-3 markers genes for each identified celltype.
6.2 Pseudotime estimation
After estimating principal curves across our UMAP embedding with slingshot, we generate per-lineage pseudotime values for each cell. We use our clustering as a source of structure in the data, and assign start and end clusters based on age in PCW. In addition, we mean-aggregate the pseudotime values for each cell in order to get a total pseudotime which represents overall differentiation regardless of lineage.
The graph structure estimate by slingshot is called a minimum spanning tree (MST), and describes the relationships between clusters in an undirected manner.
Code
p7 <-data.frame(Embeddings(seu_cortex, "umap")) %>%mutate(leiden = seu_cortex$seurat_clusters) %>%ggplot(aes(x = umap_1, y = umap_2, color = leiden)) +geom_point(size =1.25, alpha =0.75) +geom_path(data = sling_mst, mapping =aes(x = umap_1, y = umap_2, group = Lineage), linewidth =1.25, color ="black") +geom_point(data = sling_mst, mapping =aes(x = umap_1, y = umap_2, fill = Cluster), color ="black", shape =21, size =4.5, stroke =1.25, show.legend =FALSE) +scale_color_manual(values = palette_cluster) +scale_fill_manual(values = palette_cluster) +labs(x ="UMAP 1", y ="UMAP 2") +theme_scLANE(umap =TRUE) +theme(legend.title =element_blank()) +guide_umap()p7
Since our data have real-life experimental timepoints, it makes sense to normalize the pseudotime within each lineage to \([0, 1]\).
The principal curves show a trifurcating lineage structure.
Code
p9 <-data.frame(Embeddings(seu_cortex, "umap")) %>%mutate(leiden = seu_cortex$seurat_clusters) %>%ggplot(aes(x = umap_1, y = umap_2, color = leiden)) +geom_point(size =1.25, alpha =0.75) +geom_path(data = sling_curves,mapping =aes(x = umap_1, y = umap_2, group = Lineage), color ="black", linewidth =1.5, alpha =0.75, lineend ="round") +scale_color_manual(values = palette_cluster) +labs(x ="UMAP 1", y ="UMAP 2") +theme_scLANE(umap =TRUE) +theme(legend.title =element_blank()) +guides(color =guide_legend(override.aes =list(size =4, alpha =1, stroke =0.25)))p9
Lastly, we plot the distribution of mean-aggregated pseudotime per-timepoint and see that later timepoints have, in general, larger values of overall pseudotime. This indicates that our estimated pseudotime is linked to real experimental time in a meaningful (though fairly noisy) way.
Code
p10 <-select(seu_cortex@meta.data, Age_in_Weeks, PT_Overall) %>%mutate(Age_in_Weeks =as.factor(Age_in_Weeks)) %>%ggplot(aes(x = Age_in_Weeks, y = PT_Overall, color = Age_in_Weeks)) + ggbeeswarm::geom_quasirandom(size =1.25, alpha =0.75, show.legend =FALSE) +stat_summary(geom ="point", fun ="mean", color ="black",size =3) +scale_color_manual(values = palette_timepoint) +labs(x ="Age (PCW)", y ="Mean Psueodtime") +theme_scLANE()p10
6.3 Trajectory DE testing
Using scLANE (see the GitHub repository or the preprint for details) we perform trajectory DE significance testing for each HVG across each lineage.
Building a GRN requires two main inputs - a set of transcription factors, and a set of potential target genes. We load a database of TFs from Lambert et al (2018).
We identify a subset of all the TFs that are dynamic across our trajectory. Since TFs can regulate the expression of other TFs as they do non-TF target genes, we will include TFs in the set of “target genes” that we model.
The GRN has the following structure - for every target gene, a penalized model is fit with the dynamics of every possible TF as features (see below). We use the Gamma distribution with log link as our regression objective as the dynamics are continuous and strictly non-negative. The model’s internal regularization identifies a subset of TFs whose dynamics are predictive of the target gene’s dynamics. See e.g., the original XGBoost paper for details.
Our GRNs will be lineage-specific, since the matrix of gene dynamics differs by lineage. First we fit the GRN for lineage A; I’ve opted to use the LightGBM package to fit the models as it’s lightweight and in general much faster than other options such as XGBoost. In addition, I prefer this method to alternative penalized methods such as LASSO - boosted trees better capture nonlinearities in the data, though LASSO does offer advantages such as significance testing for coefficients and linear interpretability. Both methods have built-in cross-validation (CV) routines (lightgbm::lgb.cv() and glmnet::cv.glmnet(), respectively); here we use 5-fold CV to select the best set of hyperparameters before training the final model. Lastly, instead of the default booster we utilize a DART booster, which purports to improve on the overspecialization problem suffered by the default booster - see the DART preprint for details.
Note
Since we’re using the foreach and doSNOW packages to loop over the set of target genes in parallel, we need to ensure that our LightGBM model only uses a single thread when fitting the model. This is done by setting the parameters tree_learner = "serial" and num_threads = 1L in the model parameters list.
We coerce the model output into a tidy table. The two importance metrics we’re interested in are gain (the improvement in L2 loss of a certain TF) and frequency (the proportion of times a TF was included in the model). For more information see this StackExchange post or the XGBoost docs. As such, we calculate the rank of each target gene for gain and frequency, then mean-aggregate the two ranks in order to produce an average rank of how relatively strong the relationship between each TF & target gene is.
Next we add the Spearman correlation and mutual information (MI) between TF and target gene dynamics, as well as the normalized gene expression correlation and MI. We denote the relationship between TF and target gene as repression if the correlation between their dynamics is negative, and activation if the correlation is positive.
Warning
Usually the correlations between TF and target dynamics and TF and target expression are similar, or at least have the same directionality (positive or negative). However, when expression patterns are noisy, the correlation directions might differ (see Table 4 for examples). In this case, we’re opting to trust the dynamics more, as they are a smoothed version of expression - though an argument could be made for the opposite stance as well.
The output is shown below. Interestingly, the mutual information seems to be generally higher for the dynamics than for expression - this indicates that our usage of the dynamics as a de-noising technique for the GRN might be working well.
We repeat the fitting process for the other two lineages (expand this code block to see the details). Skip to Section 7 if you want a version of this logic that’s been wrapped into a function!
For each lineage we create a directed, weighted graph of the relationships between TFs and target genes using the igraph package. In the absence of a better, more sophisticated heuristic, we prune the graphs by selecting just the top 5 TFs per target gene. We use the inverse of the mean feature importance rank as edge weights.
One target gene of particular interest is LIM domain only 3 (LMO3). This gene is predominantly expressed in the brain, is involved in neurogenesis (source), and potentially acts as an oncogene during neuroblastoma (source) - so a comparison of its behavior across lineages should be pretty neat. We use a Fruchterman-Reingold layout to embed the first-order neighborhood graphs in 2D space via the layout_with_fr() function from igraph. We see that not only do the top target genes differ widely by lineage, so too do the directions of the relationships (activation vs. repression), and the strengths of those relationships. For example, we can see that in lineages A & C that LMO3 is repressed by EOMES and activated by MEF2C - though the MEF2C activation is stronger in lineage C. Conversely, in lineage B LMO3 is activated by both NFIA and NFIB, though neither TF appears to act on LMO3 in either of the other two lineages.
Code
set.seed(12)ranks <- purrr::map(list(graph_df_linA, graph_df_linB, graph_df_linC), \(x) {filter(x, Target_Gene =="LMO3") %>%pull(Mean_Rank_Weight)})rank_min <-min(purrr::reduce(ranks, c))rank_max <-max(purrr::reduce(ranks, c))LMO3_node <-which(names(V(graph_obj_linA)) =="LMO3")p11 <-fortify(make_ego_graph(graph_obj_linA, nodes = LMO3_node, order =1)[[1]], layout =layout_with_fr(graph =make_ego_graph(graph_obj_linA, nodes = LMO3_node, order =1)[[1]], grid ="nogrid", weights =E(make_ego_graph(graph_obj_linA, nodes = LMO3_node, order =1)[[1]])$weights),arrow.gap = .05) %>%mutate(Gene_Type =if_else(name %in% hs_tfs$hgnc_symbol, "TF", "Target Gene")) %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +facet_wrap(~"Lineage A") +geom_edges(aes(linewidth = Mean_Rank_Weight, alpha = Mean_Rank_Weight, color = State), arrow =arrow(length =unit(8, "pt"), type ="closed")) +geom_nodelabel(aes(label = name, fill = Gene_Type), size =5) +scale_color_manual(values =c("forestgreen", "firebrick")) +scale_fill_manual(values =c("darkorange2", "dodgerblue3")) +scale_linewidth_continuous(range =c(0.5, 2), limits =c(rank_min, rank_max)) +scale_alpha_continuous(range =c(0.75, 1), limits =c(rank_min, rank_max)) +labs(x ="FR 1", y ="FR 2", color ="TF State", fill ="Gene Type", linewidth ="Rank Weight", alpha ="Rank Weight") +theme_scLANE(umap =TRUE) +guides(fill =guide_legend(override.aes =list(label =""), order =1), color =guide_legend(order =2))LMO3_node <-which(names(V(graph_obj_linB)) =="LMO3")p12 <-fortify(make_ego_graph(graph_obj_linB, nodes = LMO3_node, order =1)[[1]], layout =layout_with_fr(graph =make_ego_graph(graph_obj_linB, nodes = LMO3_node, order =1)[[1]], grid ="nogrid", weights =E(make_ego_graph(graph_obj_linB, nodes = LMO3_node, order =1)[[1]])$weights),arrow.gap = .05) %>%mutate(Gene_Type =if_else(name %in% hs_tfs$hgnc_symbol, "TF", "Target Gene")) %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +facet_wrap(~"Lineage B") +geom_edges(aes(linewidth = Mean_Rank_Weight, alpha = Mean_Rank_Weight, color = State), arrow =arrow(length =unit(8, "pt"), type ="closed")) +geom_nodelabel(aes(label = name, fill = Gene_Type), size =5) +scale_color_manual(values =c("forestgreen", "firebrick")) +scale_fill_manual(values =c("darkorange2", "dodgerblue3")) +scale_linewidth_continuous(range =c(0.5, 2), limits =c(rank_min, rank_max)) +scale_alpha_continuous(range =c(0.75, 1), limits =c(rank_min, rank_max)) +labs(x ="FR 1", y ="FR 2", color ="TF State", fill ="Gene Type", linewidth ="Rank Weight", alpha ="Rank Weight") +theme_scLANE(umap =TRUE) +guides(fill =guide_legend(override.aes =list(label =""), order =1), color =guide_legend(order =2))LMO3_node <-which(names(V(graph_obj_linC)) =="LMO3")p13 <-fortify(make_ego_graph(graph_obj_linC, nodes = LMO3_node, order =1)[[1]], layout =layout_with_fr(graph =make_ego_graph(graph_obj_linC, nodes = LMO3_node, order =1)[[1]], grid ="nogrid", weights =E(make_ego_graph(graph_obj_linC, nodes = LMO3_node, order =1)[[1]])$weights),arrow.gap = .05) %>%mutate(Gene_Type =if_else(name %in% hs_tfs$hgnc_symbol, "TF", "Target Gene")) %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +facet_wrap(~"Lineage C") +geom_edges(aes(linewidth = Mean_Rank_Weight, alpha = Mean_Rank_Weight, color = State), arrow =arrow(length =unit(8, "pt"), type ="closed")) +geom_nodelabel(aes(label = name, fill = Gene_Type), size =5) +scale_color_manual(values =c("forestgreen", "firebrick")) +scale_fill_manual(values =c("darkorange2", "dodgerblue3")) +scale_linewidth_continuous(range =c(0.5, 2), limits =c(rank_min, rank_max)) +scale_alpha_continuous(range =c(0.75, 1), limits =c(rank_min, rank_max)) +labs(x ="FR 1", y ="FR 2", color ="TF State", fill ="Gene Type", linewidth ="Rank Weight", alpha ="Rank Weight") +theme_scLANE(umap =TRUE) +guides(fill =guide_legend(override.aes =list(label =""), order =1), color =guide_legend(order =2))p14 <- (p11 / p12 / p13) +plot_layout(guides ="collect", axes ="collect")p14
The make_ego_graph() function allows us to pull out the first-order interactions for a given transcription factor; here we display those (admittedly complex) relationships for eomesodermin (EOMES). This highlights the multi-faceted roles of TFs during development, and the complexity of GRNs in general.
Code
EOMES_node <-which(names(V(graph_obj_linA)) =="EOMES")set.seed(5)p15 <-fortify(make_ego_graph(graph_obj_linA, nodes = EOMES_node, order =1)[[1]], layout =layout_with_fr(graph =make_ego_graph(graph_obj_linA, nodes = EOMES_node, order =1)[[1]], grid ="nogrid", weights =E(make_ego_graph(graph_obj_linA, nodes = EOMES_node, order =1)[[1]])$weights^3),arrow.gap = .05) %>%mutate(Gene_Type =if_else(name %in% hs_tfs$hgnc_symbol, "TF", "Target Gene")) %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +facet_wrap(~"Lineage A") +geom_edges(aes(linewidth = Mean_Rank_Weight, alpha = Mean_Rank_Weight, color = State), arrow =arrow(length =unit(8, "pt"), type ="closed")) +geom_nodelabel(aes(label = name, fill = Gene_Type), size =5) +scale_color_manual(values =c("forestgreen", "firebrick")) +scale_fill_manual(values =c("darkorange2", "dodgerblue3")) +scale_linewidth_continuous(range =c(0.5, 2)) +scale_alpha_continuous(range =c(0.75, 1)) +labs(x ="FR 1", y ="FR 2", color ="TF State", fill ="Gene Type", linewidth ="Rank Weight", alpha ="Rank Weight") +theme_scLANE(umap =TRUE) +guides(fill =guide_legend(override.aes =list(label =""), order =1), color =guide_legend(order =2))p15
7 Conclusions
Hopefully this analysis has demonstrated the utility of trajectory GRNs for comparing gene dynamics across lineages. Beyond just visual comparison of how gene expression changes over pseudotime, trajectory GRNs allow us to identify which transcription factors are activating or suppressing target genes, how top TF targets differ between lineages, and the ways in which TF-target relationships change based on the cell fate towards which cells are progressing. To conclude, I’ve written up the methodology for creating a trajectory GRN into a function, which is shown in the code block below. It includes functionality for a progress bar if desired, and takes as input matrices of gene dynamics (dyn.mat) and normalized expression (expr.mat), a set of dynamic genes (dyn.genes), and a set of transcription factors (tx.factors). Eventually I plan to expand this functionality and turn it into an R package, but for now it’ll stay a simple function.
---title: "Building a Trajectory Gene Regulatory Network with scRNA-seq Data"author: name: Jack R. Leary email: j.leary@ufl.edu orcid: 0009-0004-8821-3269 affiliations: - name: University of Florida department: Biostatistics city: Gainesville state: FLdate: todaydate-format: longformat: html: code-fold: show code-copy: true code-tools: true toc: true toc-depth: 2 embed-resources: true fig-format: retina fig-width: 9 fig-height: 6 df-print: kable link-external-newwindow: true fig-cap-location: bottom tbl-cap-location: bottom number-sections: trueexecute: cache: false freeze: auto---```{r setup}#| include: falseknitr::opts_chunk$set(comment = NA)reticulate::use_virtualenv("~/Desktop/PhD/Research/Python_Envs/personal_site/", required = TRUE)set.seed(312)```# Introduction {#sec-intro}In this analysis we'll be looking at cells undergoing neurogenesis in the developing human frontal cortex. The data are taken from [Nowakowski *et al* (2017)](https://doi.org/10.1126/science.aap8809). Our goal is to build a simple method for fitting single cell gene regulatory networks (GRNs) and compare the results across pseudotime lineages, keeping in mind the trajectory structure of the data. If you want to skip straight to the analysis, head to @sec-analysis. # Libraries {#sec-libs}We start by loading in a few necessary packages & resolving some function conflicts. ```{r}#| message: false#| warning: falselibrary(dplyr) # data manipulationlibrary(Seurat) # scRNA-seq toolslibrary(scLANE) # trajectory DElibrary(igraph) # graph utilitieslibrary(ggplot2) # pretty plotslibrary(biomaRt) # gene annotationlibrary(foreach) # parallel loopslibrary(ggnetwork) # plotting graphslibrary(slingshot) # pseudotime estimationlibrary(patchwork) # plot alignmentlibrary(SingleCellExperiment) # scRNA-seq data structuresrename <- dplyr::renameselect <- dplyr::select```# Helper functions {#sec-fns}We define a helper function to make our legends look prettier. ```{r}guide_umap <-function(key.size =4) { ggplot2::guides(color = ggplot2::guide_legend(override.aes =list(size = key.size, alpha =1, stroke =0.25)))}```Next, we assign a few consistent color palettes that we'll use throughout. ```{r}palette_heatmap <- paletteer::paletteer_d("MetBrewer::Hiroshige", direction =-1)palette_celltype <- paletteer::paletteer_d("ggsci::category10_d3")palette_cluster <- paletteer::paletteer_d("ggsci::default_locuszoom")palette_timepoint <- paletteer::paletteer_d("ggsci::default_igv")```# Data {#sec-data}Unfortunately the data that are made available as part of the `scRNAseq` package come in a transcripts-per-million (TPM) normalized matrix - not raw counts. A rough conversion back to the raw integer counts is to multiply each cell's normalized expression vector by the TPM size factor. This gets us most of the way there but the results are still estimates, so we round them to the nearest integer. ::: {.callout-note}This conversion ignores gene length, which is why the results are estimates & require rounding. For a more detailed discussion on how to un-normalize TPM counts see [this StackOverflow post](https://stackoverflow.com/questions/60614972/converting-tpm-data-to-read-counts-for-seurat). While the original authors used a plate-based protocol (Fluidigm C1) they do not mention whether they used unique molecular identifiers (UMIs); if UMIs were used then gene length normalization isn't necessary. Since simply reversing the TPM transformation doesn't result in integer values I would guess that UMIs were not used, though [their Supplementary Methods](https://www.science.org/doi/10.1126/science.aap8809#supplementary-materials) don't mention gene length normalization. :::```{r}#| results: hide#| message: false#| warning: falsesce_cortex <- scRNAseq::NowakowskiCortexData()counts_matrix <- Matrix::t(sce_cortex@assays@data$tpm)scale_factors <- Matrix::rowSums(counts_matrix) /1e6counts_matrix <-round(Matrix::t(counts_matrix) * scale_factors)seu_cortex <-CreateSeuratObject(counts_matrix, assay ="RNA", meta.data =as.data.frame(colData(sce_cortex)), project ="cortex", min.cells =5,min.features =0)excitatory_neuron_celltypes <-c("nEN-late", "nEN-early1", "nEN-early2","EN-V1-1", "EN-V1-2", "EN-V1-3", "EN-PFC1", "EN-PFC2", "EN-PFC3")seu_cortex <- seu_cortex[, seu_cortex$WGCNAcluster %in% excitatory_neuron_celltypes]seu_cortex@meta.data <-mutate(seu_cortex@meta.data, celltype_original =case_when(WGCNAcluster =="nEN-early1"~"Early Exc. Neuron 1", WGCNAcluster =="nEN-early2"~"Early Exc. Neuron 2", WGCNAcluster =="nEN-late"~"Late Exc. Neuron", WGCNAcluster =="EN-V1-1"~"Exc. Neuron 1", WGCNAcluster =="EN-V1-2"~"Exc. Neuron 2", WGCNAcluster =="EN-V1-3"~"Exc. Neuron 3", WGCNAcluster =="EN-PFC1"~"PFC Exc. Neuron 1", WGCNAcluster =="EN-PFC2"~"PFC Exc. Neuron 2", WGCNAcluster =="EN-PFC3"~"PFC Exc. Neuron 3", TRUE~NA_character_))``````{r}#| echo: falserm(sce_cortex, counts_matrix)```# Preprocessing {#sec-proprocess}We run the cells through a basic preprocessing pipeline: QC, normalization, highly variable gene (HVG) identification, PCA, UMAP, and graph-based Leiden clustering. ```{r}#| message: false #| warning: falseseu_cortex <- seu_cortex %>%PercentageFeatureSet(pattern ="^MT-", col.name ="percent_mito") %>%PercentageFeatureSet(pattern ="^RP[SL]", col.name ="percent_ribo") %>%NormalizeData(verbose =FALSE) %>%FindVariableFeatures(nfeatures =3000, verbose =FALSE) %>%ScaleData(verbose =FALSE) %>%RunPCA(npcs =50, approx =TRUE, seed.use =312, verbose =FALSE) %>%CellCycleScoring(s.features = cc.genes.updated.2019$s.genes, g2m.features = cc.genes.updated.2019$g2m.genes, set.ident =FALSE) %>%RunUMAP(reduction ="pca", dims =1:30,n.neighbors =20, n.components =2, metric ="cosine", n.epochs =750,seed.use =312, verbose =FALSE) %>%FindNeighbors(reduction ="pca", k.param =20,nn.method ="annoy", annoy.metric ="cosine", verbose =FALSE) %>%FindClusters(algorithm =4, method ="igraph", resolution =0.5, random.seed =312, verbose =FALSE)```Here we examine some of our dataset's attributes - the original celltype annotations, our new clusters, age in post-conception weeks (PCW), and the proportion of each cell's reads aligning to ribosomal genes. ```{r}#| message: false#| warning: false#| code-fold: true#| fig-width: 10#| fig-height: 6#| fig-cap: Characteristics of the cortical development dataset.#| label: fig-EDAp1 <-DimPlot(seu_cortex, group.by ="celltype_original",pt.size =1.25, alpha =0.75) +scale_color_manual(values = palette_celltype) +theme_scLANE(umap =TRUE) +labs(color ="Celltype") +theme(axis.title =element_blank(), plot.title =element_blank()) +guide_umap()p2 <-FeaturePlot(seu_cortex, features ="Age_in_Weeks", pt.size =1.25, alpha =0.75) +scale_color_gradientn(colors = palette_heatmap) +labs(color ="Age (weeks)") +theme_scLANE(umap =TRUE) +theme(axis.title =element_blank(), plot.title =element_blank())p3 <-DimPlot(seu_cortex, group.by ="seurat_clusters", pt.size =1.25, alpha =0.75) +scale_color_manual(values = palette_cluster) +labs(color ="Leiden Cluster") +theme_scLANE(umap =TRUE) +theme(axis.title =element_blank(), plot.title =element_blank()) +guide_umap()p4 <-FeaturePlot(seu_cortex, features ="percent_ribo", pt.size =1.25, alpha =0.75) +scale_color_gradientn(colors = palette_heatmap, labels = scales::label_percent(accuracy =1, scale =1)) +labs(color ="Ribosomal\nReads") +theme_scLANE(umap =TRUE) +theme(axis.title =element_blank(), plot.title =element_blank())p5 <- (p1 / p2) | (p3 / p4)p5 <- ggpubr::annotate_figure(ggpubr::ggarrange(p5), bottom ="UMAP 1",left ="UMAP 2")p5```# Analysis {#sec-analysis}## Celltype annotationWe begin by running a basic differential expression (DE) testing routine to identify some potential marker genes for each cluster. This will help us assign a celltype identity to each cluster. ::: {.callout-tip} As of `Seurat` v5, if you install the `presto` package using `remotes::install_github("immunogenomics/presto")`, the Wilcox tests performed here will be much, much faster.:::```{r}cluster_markers <-FindAllMarkers(seu_cortex, assay ="RNA", logfc.threshold =0.25, test.use ="wilcox", slot ="data", min.pct =0.1, only.pos =TRUE, verbose =FALSE, random.seed =312)top5_cluster_markers <-arrange(cluster_markers, p_val_adj) %>%with_groups(cluster, slice_head, n =5)```We visualize the markers for each cluster with a dotplot, which shows clear differences between clusters. For example, the transcription factor (TF) aristaless related homeobox (*ARX*) is specifically expressed in cluster 7; this TF's expression is known to be necessary for healthy neuronal development and helps to regulate differentiation. ```{r}#| code-fold: true#| fig-width: 5#| fig-height: 7#| fig-cap: Putative marker genes for each cluster.#| label: fig-dotplot_clusterp6 <-DotPlot(seu_cortex, features =unique(top5_cluster_markers$gene),assay ="RNA", group.by ="seurat_clusters", dot.scale =5,cols = paletteer::paletteer_d("wesanderson::Zissou1")[c(1, 5)], scale.by ="radius") +coord_flip() +labs(y ="Leiden Cluster") +theme_scLANE() +theme(axis.title.y =element_blank(), axis.text.y =element_text(face ="italic"))p6```Now we identify markers for each celltype:```{r}Idents(seu_cortex) <-"celltype_original"celltype_markers <-FindAllMarkers(seu_cortex, assay ="RNA", logfc.threshold =0.25, test.use ="wilcox", slot ="data", min.pct =0.1, only.pos =TRUE, verbose =FALSE, random.seed =312)top3_celltype_markers <-arrange(celltype_markers, p_val_adj) %>%with_groups(cluster, slice_head, n =3)```The markers seem consistent with what we'd expect for excitatory neurons; for more information see e.g., Figure 4 and Suppl. Figures 4 & 11 in [the original publication](https://doi.org/10.1126/science.aap8809). ```{r}#| code-fold: true#| tbl-cap: The top-3 markers genes for each identified celltype.#| label: tbl-celltype_markersselect(top3_celltype_markers, celltype = cluster, gene, avg_log2FC, p_val_adj) %>% kableExtra::kbl(digits =3, booktabs =TRUE, col.names =c("Celltype", "Gene", "Mean log2FC", "Adj. P-value")) %>% kableExtra::kable_classic(full_width =FALSE, "hover")```## Pseudotime estimationAfter estimating principal curves across our UMAP embedding with `slingshot`, we generate per-lineage pseudotime values for each cell. We use our clustering as a source of structure in the data, and assign start and end clusters based on age in PCW. In addition, we mean-aggregate the pseudotime values for each cell in order to get a total pseudotime which represents overall differentiation regardless of lineage. ```{r}sling_res <-slingshot(Embeddings(seu_cortex, "umap"), clusterLabels = seu_cortex$seurat_clusters, start.clus =c("3"), end.clus =c("5", "6", "7"), approx_points =1e3)sling_curves <-slingCurves(sling_res, as.df =TRUE)sling_mst <-slingMST(sling_res, as.df =TRUE)sling_pt <-as.data.frame(slingPseudotime(sling_res)) %>%rowwise() %>%mutate(PT_Overall =mean(c_across(starts_with("Lineage")), na.rm =TRUE)) %>%ungroup() %>%mutate(across(c(starts_with("Lineage"), PT_Overall), \(x) (x -min(x, na.rm =TRUE)) / (max(x, na.rm =TRUE) -min(x, na.rm =TRUE))))seu_cortex <-AddMetaData(seu_cortex, metadata = sling_pt$PT_Overall, col.name ="PT_Overall")```The graph structure estimate by `slingshot` is called a minimum spanning tree (MST), and describes the relationships between clusters in an undirected manner. ```{r}#| code-fold: true#| fig-width: 6#| fig-height: 4#| fig-cap: UMAP embedding with MST from Slinghot overlaid.#| label: fig-umap_MSTp7 <-data.frame(Embeddings(seu_cortex, "umap")) %>%mutate(leiden = seu_cortex$seurat_clusters) %>%ggplot(aes(x = umap_1, y = umap_2, color = leiden)) +geom_point(size =1.25, alpha =0.75) +geom_path(data = sling_mst, mapping =aes(x = umap_1, y = umap_2, group = Lineage), linewidth =1.25, color ="black") +geom_point(data = sling_mst, mapping =aes(x = umap_1, y = umap_2, fill = Cluster), color ="black", shape =21, size =4.5, stroke =1.25, show.legend =FALSE) +scale_color_manual(values = palette_cluster) +scale_fill_manual(values = palette_cluster) +labs(x ="UMAP 1", y ="UMAP 2") +theme_scLANE(umap =TRUE) +theme(legend.title =element_blank()) +guide_umap()p7```Since our data have real-life experimental timepoints, it makes sense to normalize the pseudotime within each lineage to $[0, 1]$.```{r}#| code-fold: true#| fig-width: 6#| fig-height: 6#| fig-cap: UMAP embedding colored by per-lineage pseudotime.#| label: fig-umap_lineage_PTp8 <-data.frame(Embeddings(seu_cortex, "umap")) %>%bind_cols(sling_pt) %>% tidyr::pivot_longer(starts_with("Lineage"), names_to ="lineage", values_to ="pseudotime") %>%ggplot(aes(x = umap_1, y = umap_2, color = pseudotime)) +facet_wrap(~lineage, nrow =3) +geom_point(size =1.25, alpha =0.75) +labs(x ="UMAP 1", y ="UMAP 2", color ="Pseudotime") +scale_color_gradientn(colors = palette_heatmap, labels = scales::label_number(accuracy = .01)) +theme_scLANE(umap =TRUE)p8```The principal curves show a trifurcating lineage structure. ```{r}#| code-fold: true#| fig-width: 6#| fig-height: 4#| fig-cap: UMAP embedding with principal curves from Slinghot overlaid in black.#| label: fig-umap_curvesp9 <-data.frame(Embeddings(seu_cortex, "umap")) %>%mutate(leiden = seu_cortex$seurat_clusters) %>%ggplot(aes(x = umap_1, y = umap_2, color = leiden)) +geom_point(size =1.25, alpha =0.75) +geom_path(data = sling_curves,mapping =aes(x = umap_1, y = umap_2, group = Lineage), color ="black", linewidth =1.5, alpha =0.75, lineend ="round") +scale_color_manual(values = palette_cluster) +labs(x ="UMAP 1", y ="UMAP 2") +theme_scLANE(umap =TRUE) +theme(legend.title =element_blank()) +guides(color =guide_legend(override.aes =list(size =4, alpha =1, stroke =0.25)))p9```Lastly, we plot the distribution of mean-aggregated pseudotime per-timepoint and see that later timepoints have, in general, larger values of overall pseudotime. This indicates that our estimated pseudotime is linked to real experimental time in a meaningful (though fairly noisy) way. ```{r}#| code-fold: true#| fig-width: 8#| fig-height: 5#| fig-cap: Beeswarm plot displaying the distribution of mean-aggregated pseudotime for each timepoint.#| label: fig-PCW_PTp10 <-select(seu_cortex@meta.data, Age_in_Weeks, PT_Overall) %>%mutate(Age_in_Weeks =as.factor(Age_in_Weeks)) %>%ggplot(aes(x = Age_in_Weeks, y = PT_Overall, color = Age_in_Weeks)) + ggbeeswarm::geom_quasirandom(size =1.25, alpha =0.75, show.legend =FALSE) +stat_summary(geom ="point", fun ="mean", color ="black",size =3) +scale_color_manual(values = palette_timepoint) +labs(x ="Age (PCW)", y ="Mean Psueodtime") +theme_scLANE()p10```## Trajectory DE testingUsing `scLANE` (see [the GitHub repository](https://github.com/jr-leary7/scLANE) or [the preprint](https://doi.org/10.1101/2023.12.19.572477) for details) we perform trajectory DE significance testing for each HVG across each lineage. ```{r}top3k_hvg <-HVFInfo(seu_cortex) %>%arrange(desc(variance.standardized)) %>%slice_head(n =3000) %>%rownames(.)cell_offset <-createCellOffset(seu_cortex)pt_df <-select(sling_pt, -PT_Overall)scLANE_models <-testDynamic(seu_cortex, pt = pt_df, genes = top3k_hvg,size.factor.offset = cell_offset, n.cores =6L, verbose =FALSE)scLANE_de_res <-getResultsDE(scLANE_models)```We now have lineage-specific trajectory DE statistics for each gene: ```{r}#| code-fold: true#| tbl-cap: The top-10 most DE genes from scLANE.#| label: tbl-scLANE_outputselect(scLANE_de_res, Gene, Lineage, Test_Stat, P_Val, P_Val_Adj, Gene_Dynamic_Lineage, Gene_Dynamic_Overall) %>%mutate(Gene_Dynamic_Lineage =if_else(Gene_Dynamic_Lineage ==1, "Dynamic", "Static"), Gene_Dynamic_Overall =if_else(Gene_Dynamic_Overall ==1, "Dynamic", "Static")) %>%slice_head(n =10) %>% kableExtra::kbl(digits =4, booktabs =TRUE, col.names =c("Gene", "Lineage", "LRT stat.", "P-value", "Adj. p-value", "Gene status (lineage)", "Gene status (overall)")) %>% kableExtra::kable_classic(full_width =FALSE, "hover")```We identify a set of dynamic genes - dynamic meaning DE over any of the three lineages - to be used for downstream analysis. ```{r}dyn_genes <-filter(scLANE_de_res, Gene_Dynamic_Overall ==1) %>%distinct(Gene) %>%pull(Gene)```Lastly, we pull a table of gene dynamics for each lineage. ```{r}smoothed_counts <-smoothedCountsMatrix(scLANE_models, size.factor.offset = cell_offset, pt = pt_df, genes = dyn_genes, log1p.norm =TRUE, n.cores =4L)```## Constructing a GRN from scratchBuilding a GRN requires two main inputs - a set of transcription factors, and a set of potential target genes. We load a database of TFs from [Lambert *et al* (2018)](https://doi.org/10.1016/j.cell.2018.01.029). ```{r}#| message: false#| warning: falsehs_ensembl <-useMart("ensembl",dataset ="hsapiens_gene_ensembl", host ="https://useast.ensembl.org")hs_tf_raw <- readr::read_csv("http://humantfs.ccbr.utoronto.ca/download/v_1.01/DatabaseExtract_v_1.01.csv",col_select =-1,num_threads =2L,show_col_types =FALSE) %>% janitor::clean_names() %>%filter(is_tf =="Yes") %>%select(ensembl_id)hs_tfs <-getBM(attributes =c("ensembl_gene_id", "hgnc_symbol", "entrezgene_id", "description", "gene_biotype"),filters ="ensembl_gene_id",values = hs_tf_raw$ensembl_id,mart = hs_ensembl,uniqueRows =TRUE) %>%rename(ensembl_id = ensembl_gene_id,entrez_id = entrezgene_id) %>%arrange(ensembl_id) %>%mutate(hgnc_symbol =if_else(hgnc_symbol =="", NA_character_, hgnc_symbol),description =gsub("\\[Source.*", "", description))```Along with the HGNC symbol, we used `BiomaRt` to retrieve the Ensembl & Entrez IDs and a description for every TF. ```{r}#| code-fold: true#| tbl-cap: Human transcription factors.#| label: tbl-TFsslice_sample(hs_tfs, n =10) %>% kableExtra::kbl(booktabs =TRUE, col.names =c("Ensembl ID", "HGNC symbol", "Entrez ID", "Description", "Biotype")) %>% kableExtra::kable_classic(full_width =FALSE, "hover")```We identify a subset of all the TFs that are dynamic across our trajectory. Since TFs can regulate the expression of other TFs as they do non-TF target genes, we will include TFs in the set of "target genes" that we model. ```{r}dynamic_tfs <- hs_tfs$hgnc_symbol[hs_tfs$hgnc_symbol %in% dyn_genes]```The GRN has the following structure - for every target gene, a penalized model is fit with the dynamics of every possible TF as features (see below). We use the Gamma distribution with log link as our regression objective as the dynamics are continuous and strictly non-negative. The model's internal regularization identifies a subset of TFs whose dynamics are predictive of the target gene's dynamics. See e.g., [the original XGBoost paper](https://doi.org/10.48550/arXiv.1603.02754) for details. $$\text{Dynamics}_{\text{Target}} = \begin{bmatrix} \text{Dynamics}_{\text{TF}_1} & \text{Dynamics}_{\text{TF}_2} & \dots & \text{Dynamics}_{\text{TF}_N} \end{bmatrix}$$ {#eq-XGBoost}Our GRNs will be lineage-specific, since the matrix of gene dynamics differs by lineage. First we fit the GRN for lineage A; I've opted to use the `LightGBM` package to fit the models as it's lightweight and in general much faster than other options such as `XGBoost`. In addition, I prefer this method to alternative penalized methods such as LASSO - boosted trees better capture nonlinearities in the data, though LASSO does offer advantages such as significance testing for coefficients and linear interpretability. Both methods have built-in cross-validation (CV) routines (`lightgbm::lgb.cv()` and `glmnet::cv.glmnet()`, respectively); here we use 5-fold CV to select the best set of hyperparameters before training the final model. Lastly, instead of the default booster we utilize a DART booster, which purports to improve on [the overspecialization problem](https://xgboost.readthedocs.io/en/stable/tutorials/dart.html) suffered by the default booster - see [the DART preprint](https://doi.org/10.48550/arXiv.1505.01866) for details. ::: {.callout-note}Since we're using the `foreach` and `doSNOW` packages to loop over the set of target genes in parallel, we need to ensure that our `LightGBM` model only uses a single thread when fitting the model. This is done by setting the parameters `tree_learner = "serial"` and `num_threads = 1L` in the model parameters list. :::```{r}#| message: false#| warning: falsecl <- parallel::makeCluster(6L)doSNOW::registerDoSNOW(cl)lgbm_params <-list(objective ="gamma", tree_learner ="serial", metric ="l2", boosting ="dart", num_threads =1L, device_type ="cpu",seed =312)grn_res_linA <-foreach(g =seq(dyn_genes), .combine ="list",.multicombine =ifelse(length(dyn_genes) >1, TRUE, FALSE),.maxcombine =ifelse(length(dyn_genes) >1, length(dyn_genes), 2),.packages =c("lightgbm", "dplyr"),.errorhandling ="pass",.inorder =TRUE,.verbose =FALSE) %dopar% { feature_mat <- smoothed_counts$Lineage_A[, dynamic_tfs] feature_mat <- feature_mat[, colnames(feature_mat) != dyn_genes[g]] # remove TF as a predictor of itself resp_var <- smoothed_counts$Lineage_A[, dyn_genes[g]] lgbm_data <- lightgbm::lgb.Dataset(data = feature_mat, label = resp_var) lgbm_cv <- lightgbm::lgb.cv(params = lgbm_params, data = lgbm_data, nrounds =100L, nfold =5L, stratified =FALSE) lgbm_model <- lightgbm::lgb.train(params = lgbm_params, data = lgbm_data, nrounds = lgbm_cv$best_iter) imp_table <-as.data.frame(lightgbm::lgb.importance(lgbm_model)) %>% dplyr::mutate(Lineage ="A", Target_Gene = dyn_genes[g], Target_Gene_Type =if_else(Target_Gene %in% dynamic_tfs, "TF", "Target Gene"), .before =1) imp_table }names(grn_res_linA) <- dyn_genesgrn_table_linA <- purrr::imap(grn_res_linA, \(x, y) {if (!inherits(x, "data.frame")) { empty_res <-data.frame(Lineage ="A", Target_Gene = y, Target_Gene_Type =NA_character_, Feature =NA_character_, Gain =NA_real_, Cover =NA_real_, Frequency =NA_real_)return(empty_res) } else {return(x) }})```We coerce the model output into a tidy table. The two importance metrics we're interested in are gain (the improvement in L2 loss of a certain TF) and frequency (the proportion of times a TF was included in the model). For more information see [this StackExchange post](https://datascience.stackexchange.com/questions/12318/how-to-interpret-the-output-of-xgboost-importance) or [the XGBoost docs](https://xgboost.readthedocs.io/en/stable/R-package/discoverYourData.html#measure-feature-importance). As such, we calculate the rank of each target gene for gain and frequency, then mean-aggregate the two ranks in order to produce an average rank of how relatively strong the relationship between each TF & target gene is. ```{r}grn_table_linA <- purrr::reduce(grn_table_linA, rbind) %>%select(-Cover) %>%rename(Tx_Factor = Feature) %>%filter(!is.na(Target_Gene), !is.na(Tx_Factor)) %>%arrange(Tx_Factor, desc(Frequency)) %>%with_groups(Tx_Factor, mutate, Frequency_Rank =row_number()) %>%arrange(Tx_Factor, desc(Gain)) %>%with_groups(Tx_Factor, mutate, Gain_Rank =row_number()) %>%rowwise() %>%mutate(Mean_Rank =mean(c_across(c(Frequency_Rank, Gain_Rank)))) %>%ungroup() %>%arrange(Tx_Factor, Mean_Rank) %>%mutate(Mean_Rank_Weight =1/ Mean_Rank)```Next we add the Spearman correlation and mutual information (MI) between TF and target gene dynamics, as well as the normalized gene expression correlation and MI. We denote the relationship between TF and target gene as repression if the correlation between their dynamics is negative, and activation if the correlation is positive. ::: {.callout-warning}Usually the correlations between TF and target dynamics and TF and target expression are similar, or at least have the same directionality (positive or negative). However, when expression patterns are noisy, the correlation directions might differ (see @tbl-GRN_linA for examples). In this case, we're opting to trust the dynamics more, as they are a smoothed version of expression - though an argument could be made for the opposite stance as well. :::```{r}dyn_cormat <-cor(smoothed_counts$Lineage_A, method ="spearman")dyn_mimat <- minet::build.mim(smoothed_counts$Lineage_A, estimator ="mi.empirical", disc ="equalwidth")exp_cormat <-cor(as.matrix(Matrix::t(seu_cortex@assays$RNA$data[dyn_genes, which(!is.na(sling_pt$Lineage1))])), method ="spearman")exp_mimat <- minet::build.mim(as.matrix(Matrix::t(seu_cortex@assays$RNA$data[dyn_genes, which(!is.na(sling_pt$Lineage1))])), estimator ="mi.empirical", disc ="equalwidth")grn_table_linA <-mutate(grn_table_linA, Dyn_Cor =NA_real_, Exp_Cor =NA_real_, Dyn_MI =NA_real_, Exp_MI =NA_real_)for (i inseq(nrow(grn_table_linA))) { grn_table_linA$Dyn_Cor[i] <- dyn_cormat[grn_table_linA$Target_Gene[i], grn_table_linA$Tx_Factor[i]] grn_table_linA$Dyn_MI[i] <- dyn_mimat[grn_table_linA$Target_Gene[i], grn_table_linA$Tx_Factor[i]] grn_table_linA$Exp_Cor[i] <- exp_cormat[grn_table_linA$Target_Gene[i], grn_table_linA$Tx_Factor[i]] grn_table_linA$Exp_MI[i] <- exp_mimat[grn_table_linA$Target_Gene[i], grn_table_linA$Tx_Factor[i]]}grn_table_linA <-mutate(grn_table_linA, State =if_else(Dyn_Cor <0, "Repression", "Activation"))```The output is shown below. Interestingly, the mutual information seems to be generally higher for the dynamics than for expression - this indicates that our usage of the dynamics as a de-noising technique for the GRN might be working well. ```{r}#| code-fold: true#| tbl-cap: GRN for dynamic genes across Lineage A.#| label: tbl-GRN_linAselect(grn_table_linA, Lineage, Tx_Factor, Target_Gene, Gain, Frequency, Mean_Rank, Dyn_Cor, Exp_Cor, Dyn_MI, Exp_MI, State) %>%slice_sample(n =10) %>% kableExtra::kbl(digits =3, booktabs =TRUE, col.names =c("Lineage", "TF", "Target", "Gain", "Frequency", "Mean rank", "Dynamics cor.", "Expression cor.", "Dynamics MI", "Expression MI", "State")) %>% kableExtra::kable_classic(full_width =FALSE, "hover")```We repeat the fitting process for the other two lineages (expand this code block to see the details). Skip to @sec-conclusions if you want a version of this logic that's been wrapped into a function!```{r}#| code-fold: true#| message: false#| warning: falsegrn_res_linB <-foreach(g =seq(dyn_genes), .combine ="list",.multicombine =ifelse(length(dyn_genes) >1, TRUE, FALSE),.maxcombine =ifelse(length(dyn_genes) >1, length(dyn_genes), 2),.packages =c("lightgbm", "dplyr"),.errorhandling ="pass",.inorder =TRUE,.verbose =FALSE) %dopar% { feature_mat <- smoothed_counts$Lineage_B[, dynamic_tfs] feature_mat <- feature_mat[, colnames(feature_mat) != dyn_genes[g]] # remove TF as a predictor of itself resp_var <- smoothed_counts$Lineage_B[, dyn_genes[g]] lgbm_data <- lightgbm::lgb.Dataset(data = feature_mat, label = resp_var) lgbm_cv <- lightgbm::lgb.cv(params = lgbm_params, data = lgbm_data, nrounds =100L, nfold =5L, stratified =FALSE) lgbm_model <- lightgbm::lgb.train(params = lgbm_params, data = lgbm_data, nrounds = lgbm_cv$best_iter) imp_table <-as.data.frame(lightgbm::lgb.importance(lgbm_model)) %>% dplyr::mutate(Lineage ="B", Target_Gene = dyn_genes[g], Target_Gene_Type =if_else(Target_Gene %in% dynamic_tfs, "TF", "Target Gene"),.before =1) imp_table }names(grn_res_linB) <- dyn_genesgrn_table_linB <- purrr::imap(grn_res_linB, \(x, y) {if (!inherits(x, "data.frame")) { empty_res <-data.frame(Lineage ="B", Target_Gene = y, Target_Gene_Type =NA_character_, Feature =NA_character_, Gain =NA_real_, Cover =NA_real_, Frequency =NA_real_)return(empty_res) } else {return(x) }})grn_table_linB <- purrr::reduce(grn_table_linB, rbind) %>%select(-Cover) %>%rename(Tx_Factor = Feature) %>%filter(!is.na(Target_Gene), !is.na(Tx_Factor)) %>%arrange(Tx_Factor, desc(Frequency)) %>%with_groups(Tx_Factor, mutate, Frequency_Rank =row_number()) %>%arrange(Tx_Factor, desc(Gain)) %>%with_groups(Tx_Factor, mutate, Gain_Rank =row_number()) %>%rowwise() %>%mutate(Mean_Rank =mean(c_across(c(Frequency_Rank, Gain_Rank)))) %>%ungroup() %>%arrange(Tx_Factor, Mean_Rank) %>%mutate(Mean_Rank_Weight =1/ Mean_Rank)dyn_cormat <-cor(smoothed_counts$Lineage_B, method ="spearman")dyn_mimat <- minet::build.mim(smoothed_counts$Lineage_B, estimator ="mi.empirical", disc ="equalwidth")exp_cormat <-cor(as.matrix(Matrix::t(seu_cortex@assays$RNA$data[dyn_genes, which(!is.na(sling_pt$Lineage2))])), method ="spearman")exp_mimat <- minet::build.mim(as.matrix(Matrix::t(seu_cortex@assays$RNA$data[dyn_genes, which(!is.na(sling_pt$Lineage2))])), estimator ="mi.empirical", disc ="equalwidth")grn_table_linB <-mutate(grn_table_linB, Dyn_Cor =NA_real_, Exp_Cor =NA_real_, Dyn_MI =NA_real_, Exp_MI =NA_real_)for (i inseq(nrow(grn_table_linB))) { grn_table_linB$Dyn_Cor[i] <- dyn_cormat[grn_table_linB$Target_Gene[i], grn_table_linB$Tx_Factor[i]] grn_table_linB$Dyn_MI[i] <- dyn_mimat[grn_table_linB$Target_Gene[i], grn_table_linB$Tx_Factor[i]] grn_table_linB$Exp_Cor[i] <- exp_cormat[grn_table_linB$Target_Gene[i], grn_table_linB$Tx_Factor[i]] grn_table_linB$Exp_MI[i] <- exp_mimat[grn_table_linB$Target_Gene[i], grn_table_linB$Tx_Factor[i]]}grn_table_linB <-mutate(grn_table_linB, State =if_else(Dyn_Cor <0, "Repression", "Activation"))grn_res_linC <-foreach(g =seq(dyn_genes), .combine ="list",.multicombine =ifelse(length(dyn_genes) >1, TRUE, FALSE),.maxcombine =ifelse(length(dyn_genes) >1, length(dyn_genes), 2),.packages =c("lightgbm", "dplyr"),.errorhandling ="pass",.inorder =TRUE,.verbose =FALSE) %dopar% { feature_mat <- smoothed_counts$Lineage_C[, dynamic_tfs] feature_mat <- feature_mat[, colnames(feature_mat) != dyn_genes[g]] # remove TF as a predictor of itself resp_var <- smoothed_counts$Lineage_C[, dyn_genes[g]] lgbm_data <- lightgbm::lgb.Dataset(data = feature_mat, label = resp_var) lgbm_cv <- lightgbm::lgb.cv(params = lgbm_params, data = lgbm_data, nrounds =100L, nfold =5L, stratified =FALSE) lgbm_model <- lightgbm::lgb.train(params = lgbm_params, data = lgbm_data, nrounds = lgbm_cv$best_iter) imp_table <-as.data.frame(lightgbm::lgb.importance(lgbm_model)) %>% dplyr::mutate(Lineage ="C", Target_Gene = dyn_genes[g], Target_Gene_Type =if_else(Target_Gene %in% dynamic_tfs, "TF", "Target Gene"),.before =1) imp_table }names(grn_res_linC) <- dyn_genesgrn_table_linC <- purrr::imap(grn_res_linC, \(x, y) {if (!inherits(x, "data.frame")) { empty_res <-data.frame(Lineage ="C", Target_Gene = y, Target_Gene_Type =NA_character_, Feature =NA_character_, Gain =NA_real_, Cover =NA_real_, Frequency =NA_real_)return(empty_res) } else {return(x) }})grn_table_linC <- purrr::reduce(grn_table_linC, rbind) %>%select(-Cover) %>%rename(Tx_Factor = Feature) %>%filter(!is.na(Target_Gene), !is.na(Tx_Factor)) %>%arrange(Tx_Factor, desc(Frequency)) %>%with_groups(Tx_Factor, mutate, Frequency_Rank =row_number()) %>%arrange(Tx_Factor, desc(Gain)) %>%with_groups(Tx_Factor, mutate, Gain_Rank =row_number()) %>%rowwise() %>%mutate(Mean_Rank =mean(c_across(c(Frequency_Rank, Gain_Rank)))) %>%ungroup() %>%arrange(Tx_Factor, Mean_Rank) %>%mutate(Mean_Rank_Weight =1/ Mean_Rank)dyn_cormat <-cor(smoothed_counts$Lineage_C, method ="spearman")dyn_mimat <- minet::build.mim(smoothed_counts$Lineage_C, estimator ="mi.empirical", disc ="equalwidth")exp_cormat <-cor(as.matrix(Matrix::t(seu_cortex@assays$RNA$data[dyn_genes, which(!is.na(sling_pt$Lineage3))])), method ="spearman")exp_mimat <- minet::build.mim(as.matrix(Matrix::t(seu_cortex@assays$RNA$data[dyn_genes, which(!is.na(sling_pt$Lineage3))])), estimator ="mi.empirical", disc ="equalwidth")grn_table_linC <-mutate(grn_table_linC, Dyn_Cor =NA_real_, Exp_Cor =NA_real_, Dyn_MI =NA_real_, Exp_MI =NA_real_)for (i inseq(nrow(grn_table_linC))) { grn_table_linC$Dyn_Cor[i] <- dyn_cormat[grn_table_linC$Target_Gene[i], grn_table_linC$Tx_Factor[i]] grn_table_linC$Dyn_MI[i] <- dyn_mimat[grn_table_linC$Target_Gene[i], grn_table_linC$Tx_Factor[i]] grn_table_linC$Exp_Cor[i] <- exp_cormat[grn_table_linC$Target_Gene[i], grn_table_linC$Tx_Factor[i]] grn_table_linC$Exp_MI[i] <- exp_mimat[grn_table_linC$Target_Gene[i], grn_table_linC$Tx_Factor[i]]}grn_table_linC <-mutate(grn_table_linC, State =if_else(Dyn_Cor <0, "Repression", "Activation"))parallel::stopCluster(cl)```We collate all three GRNs into a single overall table:```{r}grn_table_all <-bind_rows(grn_table_linA, grn_table_linB, grn_table_linC) %>%arrange(Lineage, Tx_Factor, Mean_Rank)```For each lineage we create a directed, weighted graph of the relationships between TFs and target genes using the `igraph` package. In the absence of a better, more sophisticated heuristic, we prune the graphs by selecting just the top 5 TFs per target gene. We use the inverse of the mean feature importance rank as edge weights. ```{r}graph_df_linA <-select(grn_table_linA, Tx_Factor, Target_Gene, Target_Gene_Type, Mean_Rank, Mean_Rank_Weight, Frequency, Gain, Dyn_MI, State) %>%arrange(Target_Gene, Mean_Rank) %>%with_groups(Target_Gene, slice_head, n =5)graph_obj_linA <-graph_from_data_frame(graph_df_linA, directed =TRUE)E(graph_obj_linA)$weight <- graph_df_linA$Mean_Rank_Weightgraph_df_linB <-select(grn_table_linB, Tx_Factor, Target_Gene, Target_Gene_Type, Mean_Rank, Mean_Rank_Weight, Frequency, Gain, Dyn_MI, State) %>%arrange(Target_Gene, Mean_Rank) %>%with_groups(Target_Gene, slice_head, n =5)graph_obj_linB <-graph_from_data_frame(graph_df_linB, directed =TRUE)E(graph_obj_linB)$weight <- graph_df_linB$Mean_Rank_Weightgraph_df_linC <-select(grn_table_linC, Tx_Factor, Target_Gene, Target_Gene_Type, Mean_Rank, Mean_Rank_Weight, Frequency, Gain, Dyn_MI, State) %>%arrange(Target_Gene, Mean_Rank) %>%with_groups(Target_Gene, slice_head, n =5)graph_obj_linC <-graph_from_data_frame(graph_df_linC, directed =TRUE)E(graph_obj_linC)$weight <- graph_df_linC$Mean_Rank_Weight```One target gene of particular interest is LIM domain only 3 (*LMO3*). This gene is predominantly expressed in the brain, is involved in neurogenesis ([source](https://doi.org/10.1016/j.ydbio.2023.07.007)), and potentially acts as an oncogene during neuroblastoma ([source](https://doi.org/10.1158/0008-5472.CAN-04-4630)) - so a comparison of its behavior across lineages should be pretty neat. We use a Fruchterman-Reingold layout to embed the first-order neighborhood graphs in 2D space via the `layout_with_fr()` function from `igraph`. We see that not only do the top target genes differ widely by lineage, so too do the directions of the relationships (activation vs. repression), and the strengths of those relationships. For example, we can see that in lineages A & C that *LMO3* is repressed by *EOMES* and activated by *MEF2C* - though the *MEF2C* activation is stronger in lineage C. Conversely, in lineage B *LMO3* is activated by both *NFIA* and *NFIB*, though neither TF appears to act on *LMO3* in either of the other two lineages. ```{r}#| code-fold: true#| message: false#| warning: false#| fig-width: 10#| fig-height: 15#| fig-cap: Fruchterman-Reingold embeddings of the first-order relationships between *LMO3* and lineage-specific TFs. Lines are weighted & shaded by the weighted ranking score between TF & target gene dynamics that we computed earlier. #| label: fig-FR_embed_lineageset.seed(12)ranks <- purrr::map(list(graph_df_linA, graph_df_linB, graph_df_linC), \(x) {filter(x, Target_Gene =="LMO3") %>%pull(Mean_Rank_Weight)})rank_min <-min(purrr::reduce(ranks, c))rank_max <-max(purrr::reduce(ranks, c))LMO3_node <-which(names(V(graph_obj_linA)) =="LMO3")p11 <-fortify(make_ego_graph(graph_obj_linA, nodes = LMO3_node, order =1)[[1]], layout =layout_with_fr(graph =make_ego_graph(graph_obj_linA, nodes = LMO3_node, order =1)[[1]], grid ="nogrid", weights =E(make_ego_graph(graph_obj_linA, nodes = LMO3_node, order =1)[[1]])$weights),arrow.gap = .05) %>%mutate(Gene_Type =if_else(name %in% hs_tfs$hgnc_symbol, "TF", "Target Gene")) %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +facet_wrap(~"Lineage A") +geom_edges(aes(linewidth = Mean_Rank_Weight, alpha = Mean_Rank_Weight, color = State), arrow =arrow(length =unit(8, "pt"), type ="closed")) +geom_nodelabel(aes(label = name, fill = Gene_Type), size =5) +scale_color_manual(values =c("forestgreen", "firebrick")) +scale_fill_manual(values =c("darkorange2", "dodgerblue3")) +scale_linewidth_continuous(range =c(0.5, 2), limits =c(rank_min, rank_max)) +scale_alpha_continuous(range =c(0.75, 1), limits =c(rank_min, rank_max)) +labs(x ="FR 1", y ="FR 2", color ="TF State", fill ="Gene Type", linewidth ="Rank Weight", alpha ="Rank Weight") +theme_scLANE(umap =TRUE) +guides(fill =guide_legend(override.aes =list(label =""), order =1), color =guide_legend(order =2))LMO3_node <-which(names(V(graph_obj_linB)) =="LMO3")p12 <-fortify(make_ego_graph(graph_obj_linB, nodes = LMO3_node, order =1)[[1]], layout =layout_with_fr(graph =make_ego_graph(graph_obj_linB, nodes = LMO3_node, order =1)[[1]], grid ="nogrid", weights =E(make_ego_graph(graph_obj_linB, nodes = LMO3_node, order =1)[[1]])$weights),arrow.gap = .05) %>%mutate(Gene_Type =if_else(name %in% hs_tfs$hgnc_symbol, "TF", "Target Gene")) %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +facet_wrap(~"Lineage B") +geom_edges(aes(linewidth = Mean_Rank_Weight, alpha = Mean_Rank_Weight, color = State), arrow =arrow(length =unit(8, "pt"), type ="closed")) +geom_nodelabel(aes(label = name, fill = Gene_Type), size =5) +scale_color_manual(values =c("forestgreen", "firebrick")) +scale_fill_manual(values =c("darkorange2", "dodgerblue3")) +scale_linewidth_continuous(range =c(0.5, 2), limits =c(rank_min, rank_max)) +scale_alpha_continuous(range =c(0.75, 1), limits =c(rank_min, rank_max)) +labs(x ="FR 1", y ="FR 2", color ="TF State", fill ="Gene Type", linewidth ="Rank Weight", alpha ="Rank Weight") +theme_scLANE(umap =TRUE) +guides(fill =guide_legend(override.aes =list(label =""), order =1), color =guide_legend(order =2))LMO3_node <-which(names(V(graph_obj_linC)) =="LMO3")p13 <-fortify(make_ego_graph(graph_obj_linC, nodes = LMO3_node, order =1)[[1]], layout =layout_with_fr(graph =make_ego_graph(graph_obj_linC, nodes = LMO3_node, order =1)[[1]], grid ="nogrid", weights =E(make_ego_graph(graph_obj_linC, nodes = LMO3_node, order =1)[[1]])$weights),arrow.gap = .05) %>%mutate(Gene_Type =if_else(name %in% hs_tfs$hgnc_symbol, "TF", "Target Gene")) %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +facet_wrap(~"Lineage C") +geom_edges(aes(linewidth = Mean_Rank_Weight, alpha = Mean_Rank_Weight, color = State), arrow =arrow(length =unit(8, "pt"), type ="closed")) +geom_nodelabel(aes(label = name, fill = Gene_Type), size =5) +scale_color_manual(values =c("forestgreen", "firebrick")) +scale_fill_manual(values =c("darkorange2", "dodgerblue3")) +scale_linewidth_continuous(range =c(0.5, 2), limits =c(rank_min, rank_max)) +scale_alpha_continuous(range =c(0.75, 1), limits =c(rank_min, rank_max)) +labs(x ="FR 1", y ="FR 2", color ="TF State", fill ="Gene Type", linewidth ="Rank Weight", alpha ="Rank Weight") +theme_scLANE(umap =TRUE) +guides(fill =guide_legend(override.aes =list(label =""), order =1), color =guide_legend(order =2))p14 <- (p11 / p12 / p13) +plot_layout(guides ="collect", axes ="collect")p14```The `make_ego_graph()` function allows us to pull out the first-order interactions for a given transcription factor; here we display those (admittedly complex) relationships for eomesodermin (*EOMES*). This highlights the multi-faceted roles of TFs during development, and the complexity of GRNs in general. ```{r}#| code-fold: true#| fig-width: 12#| fig-height: 8#| fig-cap: Fruchterman-Reingold embedding of the first-order relationships for *EOMES*. Lines are weighted and shaded as in @fig-FR_embed_lineage. #| label: fig-FR_embed_EOMESEOMES_node <-which(names(V(graph_obj_linA)) =="EOMES")set.seed(5)p15 <-fortify(make_ego_graph(graph_obj_linA, nodes = EOMES_node, order =1)[[1]], layout =layout_with_fr(graph =make_ego_graph(graph_obj_linA, nodes = EOMES_node, order =1)[[1]], grid ="nogrid", weights =E(make_ego_graph(graph_obj_linA, nodes = EOMES_node, order =1)[[1]])$weights^3),arrow.gap = .05) %>%mutate(Gene_Type =if_else(name %in% hs_tfs$hgnc_symbol, "TF", "Target Gene")) %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +facet_wrap(~"Lineage A") +geom_edges(aes(linewidth = Mean_Rank_Weight, alpha = Mean_Rank_Weight, color = State), arrow =arrow(length =unit(8, "pt"), type ="closed")) +geom_nodelabel(aes(label = name, fill = Gene_Type), size =5) +scale_color_manual(values =c("forestgreen", "firebrick")) +scale_fill_manual(values =c("darkorange2", "dodgerblue3")) +scale_linewidth_continuous(range =c(0.5, 2)) +scale_alpha_continuous(range =c(0.75, 1)) +labs(x ="FR 1", y ="FR 2", color ="TF State", fill ="Gene Type", linewidth ="Rank Weight", alpha ="Rank Weight") +theme_scLANE(umap =TRUE) +guides(fill =guide_legend(override.aes =list(label =""), order =1), color =guide_legend(order =2))p15```# Conclusions {#sec-conclusions}Hopefully this analysis has demonstrated the utility of trajectory GRNs for comparing gene dynamics across lineages. Beyond just visual comparison of how gene expression changes over pseudotime, trajectory GRNs allow us to identify which transcription factors are activating or suppressing target genes, how top TF targets differ between lineages, and the ways in which TF-target relationships change based on the cell fate towards which cells are progressing. To conclude, I've written up the methodology for creating a trajectory GRN into a function, which is shown in the code block below. It includes functionality for a progress bar if desired, and takes as input matrices of gene dynamics (`dyn.mat`) and normalized expression (`expr.mat`), a set of dynamic genes (`dyn.genes`), and a set of transcription factors (`tx.factors`). Eventually I plan to expand this functionality and turn it into an R package, but for now it'll stay a simple function. ```{r}#| code-fold: trueestimateTGRN <-function(dyn.mat =NULL,expr.mat =NULL, dyn.genes =NULL, tx.factors =NULL, cor.method ="spearman",n.cores =4L, verbose =TRUE, random.seed =312) {# check inputs if (is.null(expr.mat) ||is.null(dyn.mat) ||is.null(dyn.genes) ||is.null(tx.factors)) { stop("Arguments to estimateTGRN() are missing.") }# ID transcription factors tx.factors <- tx.factors[tx.factors %in% dyn.genes]# set up progress bar if desiredif (verbose) { withr::with_output_sink(tempfile(), { pb <- utils::txtProgressBar(0, length(target_genes), style =3) }) progress_fun <-function(n) utils::setTxtProgressBar(pb, n) snow_opts <-list(progress = progress_fun) } else { snow_opts <-list() }# set up parallel processingif (n.cores >1L) { cl <- parallel::makeCluster(n.cores) doSNOW::registerDoSNOW(cl) } else { cl <- foreach::registerDoSEQ() }# set up LightGBM model settings lgbm_params <-list(objective ="gamma", tree_learner ="serial", metric ="l2", boosting_type ="dart",device ="cpu",num_threads =1L, seed = random.seed)# estimate trajectory GRN grn_res <- foreach::foreach(g =seq(dyn.genes), .combine ="list",.multicombine =ifelse(length(dyn.genes) >1, TRUE, FALSE),.maxcombine =ifelse(length(dyn.genes) >1, length(dyn.genes), 2),.packages =c("lightgbm", "dplyr"),.errorhandling ="pass",.inorder =TRUE,.verbose =FALSE,.options.snow = snow_opts) %dopar% { feature_mat <- dyn.mat[, tx.factors] feature_mat <- feature_mat[, colnames(feature_mat) != dyn.genes[g]] resp_var <- dyn.mat[, dyn.genes[g]] lgbm_data <- lightgbm::lgb.Dataset(data = feature_mat, label = resp_var) lgbm_cv <- lightgbm::lgb.cv(params = lgbm_params, data = lgbm_data, nrounds =100L, nfold =5L, stratified =FALSE) lgbm_model <- lightgbm::lgb.train(params = lgbm_params, data = lgbm_data, nrounds = lgbm_cv$best_iter) imp_table <-as.data.frame(lightgbm::lgb.importance(lgbm_model)) %>% dplyr::mutate(Target_Gene = target_genes[g], .before =1) imp_table }names(grn_res) <- target_genesif (n.cores >1L) { parallel::stopCluster(cl) }# format GRN table grn_table <- purrr::imap(grn_res, \(x, y) {if (!inherits(x, "data.frame")) { empty_res <-data.frame(Target_Gene = y, Feature =NA_character_, Gain =NA_real_, Cover =NA_real_, Frequency =NA_real_)return(empty_res) } else {return(x) } })# add mean rank of gain & frequency grn_table <- purrr::reduce(grn_table, rbind) %>% dplyr::select(-Cover) %>% dplyr::rename(Tx_Factor = Feature) %>% dplyr::filter(!is.na(Target_Gene), !is.na(Tx_Factor)) %>% dplyr::arrange(Tx_Factor, desc(Frequency)) %>% dplyr::with_groups(Tx_Factor, dplyr::mutate, Frequency_Rank =row_number()) %>% dplyr::arrange(Tx_Factor, desc(Gain)) %>% dplyr::with_groups(Tx_Factor, dplyr::mutate, Gain_Rank =row_number()) %>% dplyr::rowwise() %>% dplyr::mutate(Mean_Rank =mean(dplyr::c_across(c(Frequency_Rank, Gain_Rank)))) %>% dplyr::ungroup() %>% dplyr::arrange(Tx_Factor, Mean_Rank) %>% dplyr::mutate(Mean_Rank_Weight =1/ Mean_Rank)# add correlation & mutual information b/t TF and each target gene (dynamics and normalized expression) dyn_cormat <- stats::cor(dyn.mat[, dyn.genes], method = cor.method) expr_cormat <- stats::cor(expr.mat[, dyn.genes], method = cor.method) dyn_mimat <- minet::build.mim(dyn.mat[, dyn.genes], estimator ="mi.empirical", disc ="equalwidth") expr_mimat <- minet::build.mim(expr.mat[, dyn.genes], estimator ="mi.empirical", disc ="equalwidth") grn_table <- dplyr::mutate(grn_table, Dyn_Cor =NA_real_, Exp_Cor =NA_real_, Dyn_MI =NA_real_, Exp_MI =NA_real_)for (i inseq(nrow(grn_table))) { grn_table$Dyn_Cor[i] <- dyn_cormat[grn_table$Target_Gene[i], grn_table$Tx_Factor[i]] grn_table$Dyn_MI[i] <- dyn_mimat[grn_table$Target_Gene[i], grn_table$Tx_Factor[i]] grn_table$Exp_Cor[i] <- expr_cormat[grn_table$Target_Gene[i], grn_table$Tx_Factor[i]] grn_table$Exp_MI[i] <- expr_mimat[grn_table$Target_Gene[i], grn_table$Tx_Factor[i]] } grn_table <- dplyr::mutate(grn_table, State = dplyr::if_else(Dyn_Cor <0, "Repression", "Activation"))return(grn_table)}```# Session info {#sec-SI}```{r}sessioninfo::session_info()```