Integrating many co-splicing networks to reconstruct splicing regulatory modules
© Dai et al.; licensee BioMed Central Ltd. 2012
Published: 16 July 2012
Skip to main content
© Dai et al.; licensee BioMed Central Ltd. 2012
Published: 16 July 2012
Alternative splicing is a ubiquitous gene regulatory mechanism that dramatically increases the complexity of the proteome. However, the mechanism for regulating alternative splicing is poorly understood, and study of coordinated splicing regulation has been limited to individual cases. To study genome-wide splicing regulation, we integrate many human RNA-seq datasets to identify splicing module, which we define as a set of cassette exons co-regulated by the same splicing factors.
We have designed a tensor-based approach to identify co-splicing clusters that appear frequently across multiple conditions, thus very likely to represent splicing modules - a unit in the splicing regulatory network. In particular, we model each RNA-seq dataset as a co-splicing network, where the nodes represent exons and the edges are weighted by the correlations between exon inclusion rate profiles. We apply our tensor-based method to the 38 co-splicing networks derived from human RNA-seq datasets and indentify an atlas of frequent co-splicing clusters. We demonstrate that these identified clusters represent potential splicing modules by validating against four biological knowledge databases. The likelihood that a frequent co-splicing cluster is biologically meaningful increases with its recurrence across multiple datasets, highlighting the importance of the integrative approach.
Co-splicing clusters reveal novel functional groups which cannot be identified by co-expression clusters, particularly they can grant new insights into functions associated with post-transcriptional regulation, and the same exons can dynamically participate in different pathways depending on different conditions and different other exons that are co-spliced. We propose that by identifying splicing module, a unit in the splicing regulatory network can serve as an important step to decipher the splicing code.
Alternative splicing provides an important means for generating proteomic diversity. Recent estimates indicate that nearly 95% of human multi-exon genes are alternatively spliced . The mechanism for regulating alternative splicing is still poorly understood, and its complexity attributes to the combinatorial regulation of many factors, e.g. splicing factors, cis-regulatory elements, and RNA secondary structure [2, 3]. A fundamental task of alternative splicing research is to decipher splicing code and understand the mechanism of how an exon is alternatively spliced in tissue-specific manner.
A central concept in transcription regulation is the transcription module, defined as a set of genes that are co-regulated by the same transcription factor(s). Analogously, such coordinated regulation also occurs at the splicing level [4–6]. For example, the splicing factor Nova regulates exon splicing of a set of genes that shape the synapse . However, the study of such coordinated splicing regulation has thus far been limited to individual cases [5–9]. In this paper, we define a splicing module as a set of exons that are regulated by the same splicing factors. The exons in a splicing module can belong to different genes, but they exhibit correlated splicing patterns (in terms of being included or excluded in their respective transcripts) across different conditions, thus form an exon co-splicing cluster.
The recent development of RNA-seq technology provides a revolutionary tool to study alternative splicing. From each RNA-seq dataset, we can derive not only the expression levels of genes, but also those of exons and transcripts (i.e., splicing isoforms). Given an RNA-seq dataset containing a set of samples, we can calculate the inclusion rate of each exon (In this study we only consider cassette exons, which are common in alternative splicing events. Henceforth, the term "exon" always means "cassette exons".) In every sample, as the ratio between its expression level and that of the host gene. A recent study provided a nice example of studying splicing regulatory relationships using a network of exon-exon, exon-gene, and gene-gene links . Here, we construct from each RNA-seq dataset a weighted co-splicing network where the nodes represent exons and the edge weights are correlations between the inclusion rates of two exons across all samples in the dataset. While directly comparing the inclusion rates for the same exon in different datasets could be biased by platforms and protocols, the correlations between inclusion rates for a given exon pair are comparable across datasets. From a series of RNA-seq datasets, we can therefore derive a series of co-splicing networks, which can be subjected to comparative network analysis and provide an effective way to integrate a large number of RNA-seq experiments conducted in different laboratories and using different technology platforms.
A heavy subgraph in a weighted co-splicing network represents a set of exons that are highly correlated in their inclusion rate profiles; i.e., they are co-spliced. A set of exons which frequently form a heavy subgraph in multiple datasets are likely to be regulated by the same splicing factors, and thus form a splicing module. We call such patterns frequent co-splicing clusters. Due to the enhanced signal to noise separation, frequent clusters are more robust and are more likely to be regulated by the same splicing factors (thus more likely to represent splicing modules) than those heavy subgraphs derived from a single dataset. In our previous research , we showed that the likelihood for a gene co-expression cluster to be a transcription module increases significantly with the recurrence of clusters in multiple datasets. A similar principle applies to splicing modules.
We applied our tensor algorithm to 38 weighted exon co-splicing networks derived from human RNA-seq datasets. We identified an atlas of frequent co-splicing clusters and validated them against four biological knowledge bases: Gene Ontology annotations, RNA-binding motif database, 191 ENCODE genome-wide ChIP-seq profiles, and protein complex database. We demonstrate that the likelihood for an exon cluster to be biologically meaningful increases with its recurrence across multiple datasets, highlighting the benefit of the integrative approach. Moreover, we show that co-splicing clusters can reveal novel functional groups that cannot be identified by co-expression clusters. Finally, we show that the same exons can dynamically participate in different pathways, depending on different conditions and different other exons that are co-spliced.
We identified 38 human RNA-seq datasets from the NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra) (see additional file 1: Section S6 to get details of these datasets), each with at least 6 samples providing transcriptome profiling under multiple experimental conditions, such as diverse tissues or various diseases. For each dataset, we used Tophat  tool to map short reads to the hg19 reference genome and applied the transcript assembly tool Cufflinks  to estimate expressions for all transcripts with known UCSC transcript annotations . We calculated the inclusion rate of each exon, as the ratio between its expression (the sum of FPKM (FPKM stands for "Fragments Per Kilobase of exon per Million fragments mapped", as defined in .) over all transcripts that cover the exon) and the host gene's expression (the sum of FPKM over all transcripts of the gene). It is worth noting that in RNA-seq experiments, a gene expression with low FPKM is usually not precisely estimated because the number of reads mapped to the gene is quite small. In order to work with reasonably accurate estimates of exon inclusion rates, as pointed out by , we calculated inclusion rates only for those genes whose expressions are above 80th percentile across at least 6 samples. This criterion resulted in inclusion rate profiles for 16,024 exons covering 9,532 genes. Based on these profiles, we constructed an exon co-splicing network from each RNA-seq dataset by using Pearson's correlation between exons' inclusion rate profiles. Details of data processing refer to additional file 1 (Section S5).
We applied our method to 38 RNA-seq datasets generated under various experimental conditions. Adopting the empirical criteria of "heaviness" ≥ 0.4 and cluster size ≥5 exons, we identified 7,194/3,104/1,422/594 co-splicing clusters with recurrences ≥3/4/5/6.
To assess the biological significance of the identified patterns, we evaluate the extent to which these exon clusters represent functional modules, splicing modules, transcriptional regulatory modules, and protein complexes. Due to the difference of background "gene" numbers, we set different p-value thresholds for significance test.
By construction, the exons in our identified co-splicing clusters have highly correlated inclusion rate profiles across different experimental conditions. Clusters meeting this criterion are likely to consist of exons co-regulated by the same splicing factors. It has been shown that splicing factors can affect alternative splicing by interacting with cis-regulatory elements in a position-dependent manner . We collected the experimental RNA target motifs (2220 RNA binding sites) of 62 splicing factors from the SpliceAid2 database . To identify possible splicing factors associated with a co-splicing cluster, for each exon of a co-splicing cluster, we retrieved the internal exon region and its 50bp flanking intron region which are enriched in the motifs of those 62 splicing factors by performing BLAST search (E-score < 0.001). If the exons of a cluster are highly enriched in the targets of a splicing factor, we consider the cluster to be "splicing homogeneous". Although the collection of known splicing motifs is very limited, at the p-value < 0.05 level (based on hypergeometric test), we still observed that 4.9% clusters with ≥5 exons and ≥6 recurrences are splicing homogenous, compared to 1.6% of randomly generated patterns with the same size distribution. Its enrichment fold ratio is 3.0. Performing the same analysis for less frequent clusters, we found that as the recurrence increases, so does the enrichment fold ratio (Figure 2B). The five most frequently enriched splicing factors are hnRNP E2, 9G8, hnRNP U, SRp75 and SRp30c. hnRNP E2 and hnRNP U both belong to heterogeneous nuclear ribonu-cleoprotein family, which generally suppress splicing through binding to exonic splicing silencer . Studies show that hnRNP E2 can repress exon usage when present at high levels in vitro , and hnRNP U bind pre-mRNA as well as nuclear mRNA and play an important role in processing and transport of mRNA . 9G8, SRp75 and SRp30c all belong to the SR family of splicing regulators. 9G8 protein excluding other SR factors can rescue the splicing activity of a 9G8-depleted nuclear extract, indicating 9G8 plays a crucial role in splicing . SRp75 is present in messenger ribonucleoproteins in both cycling and differentiated cells, and shuttles between nucleus and cytoplasm, implicating its widespread roles in splicing regulation . SRp30c can function as a repressor of 3' splice site utilization and SRp30c-CE9 interaction may contribute to the control of hnRNP A1 alternative splicing .
We found that some splicing factors tend to co-bind to the cis-regulatory regions of exons in a co-splicing cluster, suggesting the combinatorial regulation of those splicing factors. Trans-acting SR proteins Tra2α and SRp30c are simultaneously enriched in 18 clusters (with recurrence ≥3), whose major functions (by GO term enrichment) are related to post-transcriptional regulation, such as "ribonucleoprotein binding" (p-value = 2.11E-5), "nuclear mRNA splicing, via spliceosome" (p-value = 7.66E-5), "RNA export from nucleus" (p-value = 4.81E-5), and "translational initiation" (p-value = 2.48E-5). Current study indeed shows that there is a cooperative interaction between Tra2α and SRp30c in exonic splicing enhancer dependent GnRG pre-mRNA splicing . Splicing regulators SRp20, SRp30c and SRp75 are simultaneously enriched in 2 clusters (with recurrence ≥3), which are also associated with post-transcriptional regulation. For example, "RNA splicing" (p-value = 3.25E-6), "translation initiation factor activity" (p-value = 7.42E-5), and "eukaryotic translation initiation factor 3 complex" (p-value = 2.17E-4). Our results suggest that combinatorial splicing regulation can occur in post-transcriptional processes.
To evaluate how co-splicing is affected by transcriptional regulation, we used 191 ChIP-seq profiles generated by the Encyclopedia of DNA Elements (ENCODE) consortium . This dataset includes the genome-wide bindings of 40 transcription factors (TF), 9 histone modification marks, and 3 other markers (DNase, FAIRE, and DNA methylation) on 25 different cell lines. For a detailed description of the signal extraction procedure, see the additional file 1 (Section S7). If the host genes of an exon cluster are highly enriched in the targets of any regulatory factor, we consider the cluster to be "transcription homogenous". At the significance level p-value <0.01, 74.9% clusters with recurrences ≥3 are transcription homogenous, compared to only 21.2% of randomly generated clusters with the same sizes. As expected, the enrichment fold ratio increases with recurrence (Figure 2C). This result suggests a strong association between transcription and splicing. The four most frequently enriched regulatory factors are TAF8, GABP, FOS and NFYB. TAF8 is a subunit of transcription initiation factor TFIID, which is required for accurate and regulated initiation by RNA polymerase II . As an ETS transcription factor, GABP plays a key role in regulating genes which are intimately involved in cell cycle control, protein synthesis and cellular metabolism . FOS can dimerise with c-Jun to form AP-1 transcription factor, which upregulates transcription of a wide range of genes involved in proliferation and differentiation to defense against invasion and cell damage . NFYB is a subunit of an ubiquitous heteromeric transcription factor NF-Y, which regulates 30% of mammalian promoters .
We evaluate the extent to which host genes of our identified exon clusters are protein complexes by using the Comprehensive Resource of Mammalian protein complexes database (CORUM, September 2009 version) . At the significance level p-value <0.05, 18.1% of co-splicing clusters with recurrences ≥3 are enriched in genes belonging to a protein complex, versus only 0.7% of randomly generated clusters with the same sizes. The enrichment fold ratio for protein complexes increases with the cluster recurrence (Figure 2D). The five most frequently enriched protein complexes are "Spliceosome", "CCT micro-complex", "large Drosha complex", "Nop56p-associated pre-rRNA complex", and "C complex spliceosome". At least 1/3 of subunits in the enriched complex "large Drosha complex" contain proteins associated with splicing function, especially heterogeneous nuclear ribonucleoproteins such as HNRNPH1, HNRNPM, HNRNPU, HNRNPUL1 and HNRNPDL .
Studies have shown that genes that are co-regulated transcriptionally do not necessarily overlap with those that are co-spliced . Therefore, the identification of co-splicing clusters can reveal functionally related genes that could not be discovered from transcription analysis. In order to identify novel functions associated with co-splicing but not co-expression, we complement the above analysis by constructing a gene co-expression network from each RNA-seq dataset. The nodes of these networks represent genes, and the edges are weighted by Pearson's correlation between two gene expression profiles. We then apply our tensor-based pattern mining algorithm to identify frequent co-expression clusters in the 38 co-expression networks with the same criteria as those of identifying co-splicing clusters. The same functional enrichment analysis described above for co-splicing clusters was performed on the resulting co-expression clusters. We found that 98.8% of co-splicing clusters with recurrences ≥ 3 have low expression correlations (average correlations ≤ 0.2). Therefore, many of the functions associated with post-transcriptional regulation are enriched in co-splicing clusters but not in co-expression clusters. These functions include "maintenance of protein location", "regulation of protein catabolic process", "cytoplasmic sequestering of protein", "regulation of intracellular protein transport", "regulation of ubiquitin-protein ligase activity", "ribonucleoprotein complex assembly", "RNA splicing, via transesterification reactions", and "RNA export from nucleus".
For example, one co-splicing cluster has seven host genes: HNRNPUL1, HNRNPC, DHX9, BAT1, PSMA5, RAD23 and RPS9. This cluster cannot be found from co-expression data, for the expression profiles of the host genes have low correlations. However, this set of host genes is enriched with several splicing associated functions, including "RNA splicing" (p-value = 1.89E-6) and "RNA helicase activity" (p-value = 4.68E-5). Out of seven host genes, HNRNPUL1 and HNRNPC belong to heterogeneous nuclear ribonucleoprotein family, which generally suppress splicing through binding to exonic splicing silencer . DHX9, known as RNA helicase A, is a highly conserved DEAD-box protein that activates transcription, modulates RNA splicing, binds the nuclear pore complex and involves in spliceosome assembly [34, 35]. Previous research illustrated that DHX9 mediates association of CBP and RNA polymerase II , and current study further shows that DHX9 interacts with post-transcriptional control element RNA in the nucleus and cytoplasm to facilitate efficient translation . Interestingly, HNRNPC and DHX9 are indeed tightly functionally associated: silencing of DHX9 seriously disturbed the nuclear distribution of the hnRNP C protein . As an essential splicing factor, BAT1 also belongs to DEAD-box protein family, and plays an important role in mRNA export from the nucleus to the cytoplasm, supported by recent experimental evidence that knocked down BAT1 induces spliced mRNA, as well as total polyA RNA accumulating in nuclear speckle domains, not exporting to the cytoplasm . Clearly, co-splicing clusters can provide complementary information on functionally related gene groups in addition to co-trancriptional clusters. In particular, co-splicing clusters can grant new insights into functions associated with post-transcriptional regulation.
Alternatively skipping or including a cassette exon can change the functions of a protein by deleting or inserting a protein domain. In other words, protein isoforms alternatively spliced from the same gene may participate in different pathways. In our results, we observed that 70.3%/52.3%/38.3%/27.1% of exons are members of at least two clusters (recurrence≥3/4/5/6) with different functions. For example, exon8 of the gene Rela appears in four co-splicing clusters with recurrences ≥3, which are enriched with the following distinct functions respectively: "ER-associated protein catabolic process" (p-value = 2.20E-5), "response to extracellular stimulus" (p-value = 3.80E-5), "regulation of gene-specific transcription" (p-value = 8.89E-5), and "positive regulation of intracellular protein kinase cascade" (p-value = 2.49E-5). Rela encodes the transcription factor p65, which is an important subunit of the NF-κB complex that affects several hundred genes by NF-κB signaling. Recent research has identified several alternative splice variants of Rela, e.g. p65A, p65A2 and p65A3. In fact, p65A arises by the use of an alternative splice site located 30 nucleotides into exon8, and p65A3 was identified as a splice variant lacking exon7 and exon8 . These facts are consistent with our finding that exon8 is dynamically included in multiple co-splicing clusters. As another example, exon2 of the gene EIF5 appears in three co-splicing clusters with recurrences ≥3, which are enriched with following distinct functions respectively: "RNA splicing" (p-value = 6.27E-5), "mRNA polyadenylation" (p-value = 1.57E-5), and "regulation of translational initiation" (p-value = 8.18E-5). As a translation initiation factor, EIF5 plays critical roles for the accurate recognition of correct start codon during translation initiation . Our result suggests that except for translation initiation regulation, EIF5 may also involve in post-transcriptional regulation, such as RNA splicing and mRNA polyadenylation by dynamically including exon2 in multiple co-splicing clusters. These examples demonstrate that exons can contribute to different functionalities of proteins depending on different splicing regulatory mechanisms.
Splicing code is determined by a combination of many factors, such as cis-regulatory elements and transacting factors. If some exons share the same splicing code, they may form a splicing module: a unit in the splicing regulatory network. Therefore, identifying co-splicing clusters first and then investigating their cis-regulatory elements and associated trans-acting factors can serve as an important step to decipher the splicing code. Our tensor-based approach can identify co-spliced exon clusters that frequently appear in multiple RNA-seq datasets. The exons in a frequent co-splicing cluster can belong to different genes, but are very likely to be co-regulated by the same splicing factors, thus forming a splicing module. We demonstrated that the identified clusters represent meaningful biological modules, i.e. functional modules, splicing modules, transcriptional modules, and protein complexes, by validating against four biological knowledge databases. In all four types of enrichment results, the likelihood that a co-splicing cluster is biologically meaningful increases with its recurrence. This consistent behavior highlights the importance of the integrative approach. We also showed that the co-splicing clusters can reveal novel functional related genes that cannot be identified by co-expression clusters, and that the same exons can dynamically participate in different pathways depending on different conditions and different other exons that are co-spliced. The NCBI Sequence Read Archive database currently stores 8099 RNA-seq profiles, and this number is expected to dramatically increase in the near future. We expect to apply our approach to the rapidly accumulating RNA-seq data of multiple organisms, and to identify a large number of splicing modules and their associated phenotype conditions. This analysis can serve as a first step towards the reconstruction of tissue- and disease-specific splicing regulatory networks.
Note that only the weights of edges a ijk with x i = x j = y k = 1 are counted in . Thus, measures the "heaviness" of the FSC defined by x and y. The problem of discovering a frequent co-splicing cluster can be formulated as a discrete combinatorial optimization problem: among all patterns of fixed size (K 1 member exons and K 2 member networks), we look for the heaviest. This is also an integer programming problem: find the binary membership vectors x and y that jointly maximize under the constraints and . However, there are several major drawbacks to this discrete formulation.
where ℝ+ is a non-negative real space, and f(x) and g(y) are vector norms. After solving Eq. (2), users can easily identify the top-ranking networks (after sorting the tensor by y) and top-ranking exons (after sorting each network by x) contributing to the objective function. After rearranging the networks in this manner, the FSC with the largest heaviness occupies a corner of the 3D tensor. We can then mask all edges in the heaviest FSC with zeros, and optimize Eq. (2) again to search for the next FSC.
We performed simulations to determine suitable values for the parameters p, α, and q, by applying our tensor method to collections of random weighted networks. We randomly placed FSCs of varying size, recurrence, and heaviness in a subset of the random networks. We then tried different combinations of p, α, and q, and adopted the combination (p = 0.8, α = 0.2, and q = 10) that led to the discovery of the most FSCs. More details on these simulations are provided in the additional file 1 (Section S4).
Since the vector norm f(x) is non-convex, our tensor method requires an optimization protocol that can deal with non-convex constraints. The quality of the optimum discovered for a non-convex problem depends heavily on the numerical procedure. Standard numerical techniques such as gradient descent converge to a local minimum of the solution space, and different procedures often find different local minima. Thus, it is important to find a theoretically justified numerical procedure. We use an advanced framework known as multi-stage convex relaxation, which has good numerical properties for non-convex optimization problems . In this framework, concave duality is used to construct a sequence of convex relaxations that give increasingly accurate approximations to the original non-convex problem. We approximate the sparse constraint function f(x) by the convex function , where h(x) is a specific convex function h(x) = x 2 and is the concave dual of the function (defined as ). The vector v contains coefficients that will be automatically generated during the optimization process. After each optimization, the new coefficient vector v yields a convex function that more closely approximates the original non-convex function f(x). Details of our tensor-based optimization method can be found in the additional file 1 (Section S3).
Once the membership vectors (i.e., the solution of Eq. (2)) have been found by optimization, the frequent co-splicing clusters can be intuitively obtained by including those exons and networks with large membership values. However, any given solution can result in multiple overlapping patterns whose "heaviness" is greater than a specified threshold. Here, heaviness is defined as the average weight of all edges in the pattern. To identify the most representative pattern, we first rank exons and networks in decreasing order of their membership values in and . Then we extract two representative patterns that satisfy the heaviness threshold: the pattern that occurs in the most networks while having at least the minimum number of top-ranking exons (e.g., 5), and the pattern with the largest number of top-ranking exons while appearing in at least the minimum number of top-ranking networks (e.g., 3). Both patterns are included as co-splicing clusters in our results. After discovering a pattern, we can mask its edges in those networks where they occur (replacing those elements of the tensor with zeroes) and optimize Eq. (2) again to search for the next frequent co-splicing cluster.
The work presented in this paper was supported by National Institutes of Health Grants R01GM074163 and NSF Grant 0747475 to XJZ, and National Science Foundation of China 60970063, Program for New Century Excellent Talents in University NCET-10-0644, the Ph.D. Programs Foundation of Ministry of Education of China 20090141110026 and the Fundamental Research Funds for the Central Universities 6081007 to JL.
This article has been published as part of BMC Systems Biology Volume 6 Supplement 1, 2012: Selected articles from The 5th IEEE International Conference on Systems Biology (ISB 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S1.