Our key mathematical advance is a generalization of Granger causality to partial orderings. We exploit the equivalence between such orderings and directed acyclic graphs (DAGs) and design a graph neural network framework to infer causality from the dynamics. We complement this with the conceptual advance of cell-state parallax from which we infer a DAG-based representation of single-cell multimodal dynamics. We combine these innovations into GrID-Net (Granger Inference on DAGs), a framework for Granger causal inference on DAGs that we apply to single-cell multimodal data for inferring causal noncoding locus-gene relationships (Fig. 1b, c). Leveraging lagged message-passing on a graph neural network, GrID-Net resolves the cell-state parallax between the epigenome and transcriptome. We demonstrate that it outperforms existing state-of-the-art approaches for accurately identifying noncoding locus-gene links across a variety of cell types and single-cell multimodal assays. We then use GrID-Net to link fine-mapped genetic variants associated with schizophrenia (SCZ) to the genes they affect. We uncover noncoding mechanisms of gene dysregulation that prominently feature disruptions of neural transcription-factor binding and synaptic pathways as putative drivers of SCZ disease pathogenesis (Fig. 1d). Our findings demonstrate the power of GrID-Net in uncovering tissue-specific causal mechanisms of disease and development, using data generated from just one sample. Our work also illustrates the utility of multimodal single-cell studies for investigating the impact of genetic variants on function and phenotype. The code for GrID-Net can be found at https://github.com/alexw16/gridnet, and the full list of SCZ variant-gene links can be found in Supplementary Data 1.
In general, inferring true causality can be a challenge, as it requires perturbations to confirm cause-and-effect relationships and validations to rule out potential confounders. Granger causality is a powerful statistical concept that can approximate causality in time-varying processes solely from observational data. It leans on the principle that a cause precedes its effect. Given two time-dependent variables (a potential cause) and (the effect), if the forecast for can be improved by accounting for the past values of , then is said to Granger cause because changes in temporally precede those in .
One of our key contributions is to broaden the set of domains to which Granger causal analyses can be applied. Specifically, we generalize Granger causal inference to dynamical systems that are characterized by partial orderings. To model the asynchrony between cause and effect, we modify standard graph neural networks to allow for lagged message passing. Each node (i.e., a cell state) in the directed acyclic graph accumulates information from its parent nodes (past states), leveraging this lagged information to forecast the present state in the context of the graph-structured dynamics. GrID-Net consists of two forecasting models: a reduced model forecasts an effect using only the past of , and a full model forecasts the effect using both the past of and the past of the candidate causal variable . The strength of the Granger causal link between and is evaluated by a one-tailed F-test, which quantifies the improvement in forecasting accuracy of the full model over that of the reduced model. A large improvement in forecasting accuracy by the full model indicates that the past values of are predictive of future values of , thereby suggesting a significant Granger causal relationship between and (Methods).
GrID-Net enables the detection of asynchronous interactions between gene regulatory elements along single-cell trajectories. For each single-cell multimodal dataset, a DAG of cell states is constructed from a k-nearest neighbor graph of cells, with edges oriented in accordance with pseudotime computed after a multimodal integration with Schema. RNA-velocity based edge orientations that capture local dynamics can also be supplied, so long as any resulting cycles are broken. We evaluated the Granger causal strength between each gene and chromatin accessible element ("peak") within one megabase (Mb) of the gene's transcription start site, ranking the peak-gene pairs by their Granger causal strengths (Methods).
We evaluated GrID-Net on four single-cell multimodal datasets (scATAC-seq and scRNA-seq) that characterize a range of biological processes, including cell differentiation and drug treatment response. We refer to these four datasets as the 10x Multiome, SHARE-seq, SNARE-seq, and sci-CAR datasets. Statistics describing the DAGs associated with each dataset can be found in Supplementary Table 1. For each dataset, we sought to validate noncoding locus-gene link predictions against expression quantitative trait (eQTL) loci profiled in matching tissue (Methods). eQTLs are an attractive choice for benchmarking such predictions, as they can show associations between genetic variation at a locus and the corresponding changes in a gene's expression. While using eQTL mapping to identify specific variant-gene associations can be prone to confounding factors such as linkage disequilibrium, we reasoned that a systematic assessment against the full collection of eQTLs specific to the matching tissue serves as a useful genome-wide proxy for perturbational validation of causal interactions between genomic loci and genes.
For each gene, we intersected all ATAC-seq peaks within 1 Mb of its transcription start site (TSS) with the set of eQTL variants linked to it, associating each peak with the most significant variant overlapping it. We then evaluated the peak-gene pairs prioritized by GrID-Net and two correlation-based methods used in previous work, Pearson correlation and pseudocell correlation. The latter computes correlations on measurements aggregated across multiple cells. In addition, to assess the advantages of GrID-Net's ability to operate on DAGs, we evaluated the performance of traditional, linear Granger causal inference using vector autoregression (Linear Granger) and a state-of-the-art generalized vector autoregressive method for nonlinear Granger causal inference (GVAR). We also evaluated the ABC model's predictions on bulk data from matching tissues, comparing it against the predictions from the single-cell datasets (Methods).
We evaluated the accuracy of each method in retrieving peak-gene pairs that overlap significant eQTLs (P < 10). GrID-Net was significantly more accurate than all of the other methods across all datasets (Fig. 2a, Welch's one-tailed t-test P < 0.01), with GVAR's performance on the sci-CAR dataset as the only exception. GrID-Net also exhibited 7-22% higher area under the receiver operating characteristic curve (AUROC) compared to the next best method in 3 of the 4 datasets. GrID-Net's ability to outperform these methods was robust to the choice of the eQTL significance threshold (Supplementary Fig. S1). GrID-Net's accuracy was also highly consistent across various hyperparameter choices that influence the construction of the DAG (Supplementary Fig. S2, Methods). GrID-Net demonstrated the smallest improvement in AUROC for the sci-CAR dataset, although all methods performed poorly on this dataset, likely due to the lower data quality of sci-CAR as the earliest single-cell multimodal assay developed. Meanwhile, the correlation-based methods exhibited near-random performance across all datasets, consistently underperforming not only GrID-Net but also the two Granger causal methods and the bulk data-based ABC model. A similar comparison against a recent multiome-specific approach, TRIPOD demonstrates GrID-Net's superior performance (Supplementary Fig. S3). GrID-Net's ability to substantially outperform widely used correlation-based methods and existing Granger causal approaches points to its potential to unravel much richer regulatory insights from these multimodal assays than has been possible.
To further benchmark GrID-Net using functional indicators of noncoding locus-gene interactions, we examined three-dimensional chromatin contacts between gene promoters and noncoding loci. Compared to the two correlation-based methods and the existing Granger causal methods, we found that GrID-Net more accurately prioritized peak-gene pairs that were in frequent three-dimensional contact (Fig. 2b, Methods). The ABC method uses bulk Hi-C data as an input, so we did not evaluate its agreement with chromatin contacts. GrID-Net also demonstrated significantly higher accuracy in prioritizing the most well-supported subset of peak-gene pairs that were corroborated by both three-dimensional contacts and eQTL data (Supplementary Fig. S4). GrID-Net's ability to consistently outperform existing methods in detecting peak-gene pairs supported by different but complementary functional indicators further highlights its ability to hone in on potential causal interactions between noncoding loci and genes. In the future, the GrID-Net framework could be expanded to also utilize chromatin contacts data when available -- as we describe later, incorporating one-dimensional sequence proximity as a signal into GrID-Net improves its results. However, chromatin contact data is currently available only for a limited set of tissues, and the ability to make regulatory inferences from just a single multimodal assay enables the convenient study of rare cell types, cancer mutations, and drug responses, etc.
To specifically investigate the benefit of representing tree-like structures of complex trajectories with a DAG, we next compared GrID-Net to applications of standard, linear Granger causality on individual branches of a trajectory that are more linear in nature. We focused this analysis on the SHARE-seq dataset, which profiles a tree-like trajectory with three separate branches that define the transition of transit-amplifying cells to cortex, medulla, or inner root sheath (IRS) cells. We applied linear Granger causality to the full trajectory as well as to each of the three distinct branches separately. While all of the linear Granger causality approaches yielded better-than-random performance in predicting eQTLs, GrID-Net showcased significantly higher accuracy than all of these approaches. This result not only demonstrates the limitations of studying locally linear components of a trajectory in isolation but also provides further evidence for GrID-Net's ability to harness the more complex tree-like structures of full trajectories to more fully capture critical regulatory biology.
We also assessed the robustness of the single-cell methods to sparsity in single-cell data. GrID-Net better handled sparse data (Supplementary Fig. S5), yielding higher accuracies than the other evaluated correlation-based methods for peak-gene predictions involving genes with sparse data (non-zero transcripts in fewer than 10% of cells) or sparse peaks (under 1% of cells). Notably, pseudocell correlation failed to consistently outperform standard Pearson correlation on these sparse peak-gene cases, even though the former is explicitly designed to compensate for the sparsity of measurements at the level of individual cells. This may be because pseudocell averaging reduces the effective number of independent observations and thus relinquishes the statistical advantages of single-cell heterogeneity. Altogether, these analyses indicate not only GrID-Net's robustness to the current sparsity of single-cell data but also the potential for its increased effectiveness as the sensitivity of these multimodal assays continues to improve.
To better understand the biological basis of GrID-Net's superior accuracy, we applied it on the 10x Multiome dataset profiling human corticogenesis. We then honed in on peak-gene pairs where GrID-Net and correlation-based methods disagree. Specifically, peak-gene pairs were ranked according to the difference between their GrID-Net and correlation-based prioritization scores (Methods), with higher ranking pairs being the ones that were more strongly prioritized by GrID-Net. We then investigated the distinguishing biological characteristics of these differentially-prioritized pairs.
One of our motivations behind GrID-Net is the observation that regions of chromatin around lineage-determining genes become accessible prior to gene expression, foreshadowing lineage choice. GrID-Net was designed to resolve this cell-state parallax between chromatin accessibility and gene expression. We assessed GrID-Net's ability to do so by comparing the lags associated with GrID-Net's detected peak-gene pairs to the lags of peak-gene pairs prioritized by correlation-based methods, which inherently focus on synchronous interactions. We first associated each peak and each gene with the pseudotime corresponding to its highest value and used the difference between these pseudotime points as an estimate of the lag associated with each peak-gene pair. For both correlation-based methods, we found that peak-gene pairs differentially prioritized by GrID-Net are marked by significantly longer pseudotime lags between chromatin accessibility and gene expression (P < 10, Welch's one-tailed t-test, Fig. 2d, Methods). This result supports GrID-Net's ability to account for the asynchrony between noncoding regulatory activity and downstream gene expression control that are expected to characterize causal interactions between the two. GrID-Net identifies these asynchronous relationships for peak-gene pairs that span a broad range of genomic distances. For example, GrID-Net was able to detect the lagged relationship between the expression of CBWD5, a gene implicated in human neuronal development, and the accessibility of a promoter region roughly 100 bp upstream of its TSS. In addition, GrID-Net identified longer-ranged lagged relationships, including a link between the expression of the neuroprotective gene TRAF1 and a peak 13.5 kb away from its TSS that has also been linked to TRAF1 via eQTL analyses (Fig. 2e).
Moreover, to explore the regulatory implications of this asynchrony, we obtained data on histone modification states of enhancers in neural progenitor cells and cross-referenced it with ATAC-seq peaks. In peak-gene pairs differentially prioritized by GrID-Net, more poised enhancers (H3K4me1 only) overlapped with the peaks than active enhancers (both H3K4me1 and H3K27ac) (Fig. 2f, Methods). This was not the case for peak-gene pairs differentially prioritized by correlation-based methods. Poised enhancers are regulatory sites that can trigger future gene expression changes but have not yet affected expression; in contrast, active enhancers correspond to ongoing transcription. This suggests that resolving cell-state parallax enables the detection of lagged relationships between the activities of enhancers and their target genes. GrID-Net can thus leverage the concept of chromatin regulatory potential to facilitate a deeper understanding of tissue-specific causal regulatory mechanisms.
To evaluate the biological significance of capturing these putative causal interactions, we examined the functions of genes implicated in peak-gene pairs differentially prioritized by GrID-Net. Genes in the peak-gene pairs that were specifically prioritized by GrID-Net over the correlation-based methods were highly enriched for gene sets associated with neurogenesis regulation and neural progenitor cells (FDR < 10, GSEA Preranked), which are directly relating to the process of corticogenesis profiled in this dataset. In comparison, genes associated with peak-gene pairs that were not specific to GrID-Net were functionally enriched for housekeeping genes rather than neuronal activity. This disparity in enriched gene sets between the GrID-Net specific and non-specific peak-gene pairs suggests that GrID-Net's ability to detect poised enhancers is key to discovering cell type-specific regulatory relationships. Gene expression control during differentiation has been reported to depend highly on asynchronous enhancer-gene interactions. The enrichment of these differentiation-related gene sets thus underscores GrID-Net's ability to construct accurate genome-wide maps of noncoding locus-gene links that go beyond correlative analyses and towards causality.
The ability to link noncoding loci to the genes they regulate enables the discovery of molecular mechanisms underlying complex diseases. Genome-wide association studies (GWAS) have identified thousands of loci associated with diseases. However, over 90% of GWAS variants lie in noncoding regions, and interpreting the functions of these noncoding variants involves the challenge of linking them to the specific genes that they regulate in a cell type-specific manner. With GrID-Net, we can leverage data from a single-cell assay on matching tissue to uncover these cell type-specific variant-gene links and help bridge the gap between genetic variants and gene dysregulation in disease.
We used GrID-Net to study variants associated with schizophrenia (SCZ), a psychiatric disorder for which inherited genetic variants are believed to underlie its pathogenesis. SCZ's high prevalence of around 1%, its substantial morbidity and mortality, and the lack of effective treatments are largely due to our limited understanding of SCZ pathophysiology. We applied the GrID-Net peak-gene links inferred from human corticogenesis, to interpret and analyze variants reported in the largest GWAS of SCZ to date. We refer to this original study as the Trubetskoy et al. study.
To boost GrID-Net's accuracy on shorter-range peak-gene interactions, we incorporated one-dimensional genomic distance as an additional feature for predicting peak-gene associations. We observed a higher likelihood of peak-gene interaction if the two are located close to each other, in line with reported patterns of genomic distance-dependent regulatory effects. We calibrated the relative importance of this feature on a dataset profiling mouse skin differentiation where we fitted a logistic regression model that combines the raw GrID-Net score with a genomic distance score (Methods). We found that integrating genomic distance with GrID-Net improved predictions of peak-gene links for mouse skin cells. The discriminative power on short-range peak-gene interactions improved, while the ability to detect long-range interactions was preserved. Furthermore, this model trained on mouse skin cell data performed well across species and cell types in human corticogenesis as well (Supplementary Fig. S6). This genomic distance-dependent model's scores were used for all remaining analyses.
From the Trubetskoy et al. SCZ GWAS study, we retrieved all fine-mapped genetic variants with posterior probability (PP) over 5%. We intersected these variants with peaks from the 10x Multiome human corticogenesis dataset. We then generated an initial set of candidate SCZ genes by associating each variant with up to 3 of its overlapping peak's highest scoring genes, retaining only variant-gene associations with scores in the top 50% of all peak-gene scores. This approach yielded a set of 132 genes that not only enriched for genes implicated in the original SCZ GWAS study (FDR = 0.037, Enrichr) but also included previously undetected SCZ gene candidates (Supplementary Data 1). For example, of the 12 SCZ variants that were prioritized by both GrID-Net and Trubetskoy et al., 9 were linked to the same target genes by both approaches (Fig. 3a). This overlap included the association between rs324017 and NAB2, a gene implicated in stress-induced neuroprotective maintenance that also is the downstream target of the SCZ drug cinnarizine. For these same 12 SCZ variants, GrID-Net also proposed 20 additional SCZ candidate genes, of which 12 (60%) were found to be specifically expressed in brain tissues and thus potentially relevant in SCZ etiology (Methods). Some of these additional candidates include NXPH4, a neurexophilin involved in synapse function and cerebellar motor control, and EPN2, an epsin linked to neural fate commitment. Moreover, of the 78 SCZ variants with PP over 50%, only 22 (28%) were linked to genes by the original SCZ GWAS study (Fig. 3b). Inclusion of SCZ variants that were linked to genes by GrID-Net increased this total to 30 (38%) SCZ variants. GrID-Net can, therefore, be used to complement standard strategies for linking variants to genes, such as those used in the original SCZ GWAS study. It can propose gene associations for variants with no annotations and also augment existing annotations by suggesting additional genes that a variant may be affecting.
Importantly, GrID-Net helps address a key challenge in GWAS analyses: the common strategy of assigning GWAS variants to their closest gene misses many biologically meaningful variant-gene relationships. By leveraging single-cell heterogeneity, GrID-Net can evaluate longer-range interactions. Many of the variant-gene links we identify are characterized by large genomic distances between the variant and the TSS of the linked gene. Moreover, the variants are often linked to genes that are multiple genes over from their closest gene (Fig. 3c). For example, DCP1B and NCKAP5L, each of which are connected to only a single variant, are separated from their respective variants by 18 and 11 genes (235 and 242 kilobases (kb)), respectively. DCP1B has been implicated in SCZ, hypomania, and memory performance, while NCKAP5L, a cytoskeletal gene involved in neuronal activity, has previously been associated with SCZ as well as amyotrophic lateral sclerosis, indicating possible roles for these genes in neuron-specific dysregulation in SCZ.
In addition, 52 candidate genes (39.4% of the overall set) were associated with more than one variant (Fig. 3d). These genes may be controlled by the coordinated activity of multiple regulatory elements and co-regulators relevant to SCZ. IGSF9B was linked to 7 noncoding SCZ variants -- the most of any gene. Coding variants in IGSF9B have been found to be associated with major depression and the negative symptoms of SCZ, underscoring its role in regulating affective symptoms in SCZ. Similarly, KCNG2, a potassium channel-encoding gene, was linked to 6 noncoding SCZ variants, and has been proposed as a drug target for SCZ, given the known disruption of potassium channel-related pathways in SCZ. Neither of these genes was prioritized in the original GWAS analysis.
We also assessed the cell type specificity of the genes linked to SCZ variants. We found that 104 candidate SCZ genes (78.8% of the overall set) are differentially expressed across the cell types profiled in the 10x Multiome human corticogenesis dataset (Welch's t-test with Benjamini-Hochberg correction P < 0.05, Fig. 3e). Further analysis of the SCZ variants also revealed that 59% of the variants lie in intronic regions, while only 7% of variants lie in intergenic regions more than 5 kb from any gene (Fig. 3f, Methods). As enhancers with tissue-specific effects are highly enriched in introns while intergenic enhancers more often regulate housekeeping genes, these findings support the potential cell-type-specificity of the regulatory disruptions underlying SCZ pathogenesis.
Transcription factor (TF) binding at the cis-regulatory elements located in noncoding genomic loci is a key mechanism of gene regulation. We observed that SCZ variants with high PP are associated with larger TF binding affinity changes (Supplementary Fig. S7a). To investigate the dysregulatory role of SCZ variants prioritized by GrID-Net, we, therefore, evaluated their effect on TF binding. For each variant associated with a gene by GrID-Net, we calculated the difference in the predicted TF binding affinity between the reference allele and the SCZ risk allele (Methods, Supplementary Data 2). Of the 115 SCZ variants linked to genes, 28 (these were associated with 49 target genes) were predicted to cause a 20-fold or stronger change in at least one TF's binding affinity (Fig. 4a). The TFs we identified as having the most widespread binding affinity changes are known to be implicated in SCZ, exemplifying GrID-Net's ability to deliver de novo mechanistic insight. VEZF1, a vasculature-related TF reported to play a role in SCZ dysregulation, displayed diminished binding affinity most frequently (in five variants). In addition, KLF15 and KLF6, two Kruppel-like family (KLF) TFs known to regulate neuronal growth and linked to SCZ, were each associated with reduced binding affinity in 4 variants (Supplementary Fig. S7b). Meanwhile, the TFs with the most widespread increase in binding affinity -- SP1, SP3, and KLF16 -- have also been reported to be disrupted in SCZ (Supplementary Fig. S7c). More broadly, of the TFs with a 20-fold binding affinity change in at least one variant, 57.9% (=33/57) have cell type-specific expression patterns in corticogenesis (Welch's t-test with Benjamini-Hochberg correction P < 0.05). Collectively, these results point to genetic disruption of neuron-specific regulatory elements as potential drivers of SCZ pathogenesis.
We reasoned that if the TF binding affinity changes at a SCZ variant are phenotypically meaningful, they should result in expression changes of the corresponding gene(s) (as identified by GrID-Net). Accordingly, we obtained gene expression data from glial progenitor cells (GPCs) produced from SCZ patient-derived induced pluripotent cells, as well as matching controls, and computed differential expression patterns in SCZ GPCs. Focusing on glial differentiation genes, we observed that the differential expression of a gene in SCZ GPCs was indeed significantly correlated with TF binding affinity changes at the corresponding SCZ variant (Pearson correlation coefficient = 0.59, Fig. 4b, Methods). The consistency between these independent observations provides support to our overall analytic approach and underscores the potential to leverage GrID-Net for examining causal, gene-specific regulatory mechanisms.
We drilled down into specific genes reported by our analysis, focusing on those linked to SCZ variants associated with at least 20-fold TF binding affinity changes (Fig. 4c). NAB2, the aforementioned gene implicated in neuroprotective processes, was linked to rs324017, which contributes to a 200-fold decline in binding affinity for the TF PURA. Intriguingly, PURA is also critical for neural function and development, and its dysfunction is implicated in several neurological and cognitive impairments. In addition, rs12991836 was linked to ZEB2, whose TSS is positioned 137 kb from the variant. ZEB2 has been previously reported as a target of long-range regulation, and heterozygous ZEB2 mutations are known to cause Mowat-Wilson syndrome, which features structural brain abnormalities and intellectual disability.
Furthermore, GrID-Net prioritized two potassium transport genes -- KCNG2 and SLC12A6 -- a class of genes which has been associated with compromised potassium uptake when dysregulated. This phenotype is characteristic of not only SCZ but also other neuron-related diseases, such as Huntington's disease, multiple sclerosis, and hemiplegic migraine. KCNG2 was linked to 6 SCZ variants, the second most of any gene, as described earlier. Using TF binding affinity changes to hone in on the most relevant variants, we identified rs8091497 as the variant with the largest TF binding affinity change. rs8091497 was associated with a 40-fold decline in binding affinity for SP1, which itself has been linked to behavioral abnormalities in SCZ. As for SLC12A6, its corresponding variant (rs117799466) increases the binding affinity for many TFs with repressive capabilities. The three TFs with the greatest binding affinity gains at this locus -- SP1, ZN281, and MAZ -- all play a role in neuronal differentiation, and both SP1 and ZN281 have also shown increased expression levels in SCZ as well. Combined with their increased binding affinity incurred by rs117799466, the upregulation of these TFs may contribute to the repression of SLC12A6, an outcome that has been previously linked to the potassium uptake impairments that characterize SCZ glia. Together, these findings suggest a possible mechanism for SLC12A6 downregulation in SCZ that implicates enhanced binding of neuron-related repressors to a SLC12A6-targeting enhancer to reduce its expression levels.
We then broadened our analysis to gene modules, exploring if the genes identified by GrID-Net are systematically involved in SCZ-related pathways. We focused on the 14 most disruptive SCZ variants that cause at least 40-fold changes in TF binding affinity. These SCZ variants are linked by GrID-Net to 26 unique genes. To identify pathways that may be functionally impacted by disruptions of this set of genes, we searched for additional genes that are co-expressed with these genes in SCZ GPCs. In doing so, we detected 3 distinct gene modules that are highly co-expressed with several of the 26 GrID-Net-linked genes (Fig. 4d, Methods). Module A (SLC12A6, DLX1, and DLX2) is enriched for genes related to noradrenergic neuron differentiation (FDR < 0.05) and morphogenesis (FDR < 0.02); Module B (MFAP4, STAT6, NXPH4) is enriched for extracellular matrix organization (FDR < 10) and angiogenesis regulation (FDR < 0.02) genes, and Module C (EPN2, P2RX7, ICA1) is enriched for synaptic transmission (FDR < 10) and postsynaptic membrane potential (FDR < 0.05) (Enrichr). Each of these pathways has been previously associated with SCZ. Noradrenergic neurons have long been implicated in disrupted neuromodulation in SCZ, neuronal angiogenesis underlies the vascular abnormalities observed in SCZ, and genetic studies have pointed to the critical role of synaptic function in SCZ. These pathways, therefore, enable the interpretation of SCZ variants beyond the dysregulation of the specific genes that they directly affect, presenting an opportunity to discover systems-level mechanisms of SCZ pathogenesis.