Integrating many co-splicing networks to reconstruct splicing regulatory modules

Background Alternative splicing is a ubiquitous gene regulatory mechanism that dramatically increases the complexity of the proteome. However, the mechanism for regulating alternative splicing is poorly understood, and study of coordinated splicing regulation has been limited to individual cases. To study genome-wide splicing regulation, we integrate many human RNA-seq datasets to identify splicing module, which we define as a set of cassette exons co-regulated by the same splicing factors. Results We have designed a tensor-based approach to identify co-splicing clusters that appear frequently across multiple conditions, thus very likely to represent splicing modules - a unit in the splicing regulatory network. In particular, we model each RNA-seq dataset as a co-splicing network, where the nodes represent exons and the edges are weighted by the correlations between exon inclusion rate profiles. We apply our tensor-based method to the 38 co-splicing networks derived from human RNA-seq datasets and indentify an atlas of frequent co-splicing clusters. We demonstrate that these identified clusters represent potential splicing modules by validating against four biological knowledge databases. The likelihood that a frequent co-splicing cluster is biologically meaningful increases with its recurrence across multiple datasets, highlighting the importance of the integrative approach. Conclusions Co-splicing clusters reveal novel functional groups which cannot be identified by co-expression clusters, particularly they can grant new insights into functions associated with post-transcriptional regulation, and the same exons can dynamically participate in different pathways depending on different conditions and different other exons that are co-spliced. We propose that by identifying splicing module, a unit in the splicing regulatory network can serve as an important step to decipher the splicing code.


Contents
S7 Signal extraction procedure of the Encyclopedia of DNA Elements (EN-CODE) data S14 S2 S1 NP-hardness of the Heaviest (K 1 , K 2 )-Frequent Heavy Subgraph Problem The frequent co-splicing cluster can be also called as a graph terminology "frequent heavy subgraph". In this supplementary material, we will use the term "frequent heavy subgraph", not "frequent co-splicing cluster". Given a set of m undirected graphs G 1 , . . . , G m with the same n vertices V but different topologies (and without self-loops), i.e., G = {G 1 (V, E 1 ), . . . , G m (V, E m )} with V = {v 1 , . . . , v n } and non-negative weights a ijk for edges (v i , v j ) ∈ E k in the k th graph, the (K 1 , K 2 )-FHS problem is formally defined as follows, Problem S1.1. Given G, the (K 1 , K 2 )-Frequent Heavy Subgraph (FHS) problem is to determine a subset S V ∈ V of K 1 vertices and a subset S G ∈ G of K 2 graphs such that the total sum of edge weights of the subgraphs induced by S V in each graph of S G is maximized. A straightforward cubic 0-1 formulation of (K 1 , K 2 )-FHS is for any 1 i n y j ∈ {0, 1} for any 1 j m Next we prove the NP-hardness of this problem.
Proof. We can reduce the well-known NP-complete K-clique problem 1 (i.e., is there a clique with K vertices in a graph? ) to this problem, and therefore prove its NP-hardness. Let G(V, E) be an undirected unweighted graph without self-loops. We can copy this graph m times to generate a graph set consisting of the m graphs G = {G 1 (V, E), . . . , G m (V, E)} in which all graphs have the same vertex set V and edge set E. Then the question of "is there a clique with K vertices in G? " can be answered by solving the (K, K 2 )-FHS problem in the set of m graphs G, because we can easily claim: • If the heaviest (K, K 2 )-FHS found in the graph set G is a clique with K vertices recurring in the K 2 graphs, then there exist a clique with K vertices in G. This claim is obvious and straightforward.
• If the heaviest (K, K 2 )-FHS found in the graph set G is not a clique with K vertices recurring in the K 2 graphs, then a clique with K vertices does not exist in G. This claim can be proved by contradiction: supposing there exists a clique with K vertices in G, since all graphs G i in the graph set G were copied from G, this K-clique in G must also exist at least K 2 graphs of the graph set G. Among all (K, K 2 )-FHSs in m unweighted graphs, the K-cliques recurring in K 2 graphs must be the (K, K 2 )-FHS having the largest total sum of edge weights. This K-clique recurring in K 2 graphs must be the solution of the heaviest (K, K 2 )-FHS. So it contradicts with the statement "the heaviest K, K 2 -FHS found is not a clique with K vertices recurring in the K 2 graphs".
Since the K-clique problem is NP-complete, the (K 1 , K 2 )-FHS problem is NP-hard. Figure S1. Two-dimensional contour plots of the vector norm constraints. (A) contour of the vector norm L p (0 < p < 1). When p → 0, L p → L 0 and therefore L p can make x = (x 1 , x 2 ) T sparser. (B) Contour of the vector norm L p (p > 1). When p → ∞, L p → L ∞ and therefore L p can make x = (x 1 , x 2 ) T more even

S2 Details of vector norms
The L p (p > 0) norm of a vector x ∈ R n×1 is defined as x p = ( n i=1 |x i | p ) 1/p . Two extreme cases of L p norm is zero norm L 0 = card{x i |x i = 0}, where card is the set cardinality (i.e., the number of non-zero elements of x), and infinity norm (or maximum norm) L ∞ = max{x 1 , . . . , x n }. When the L p vector norm is used as optimization's constraint, the set of all vectors with norm 1 (i.e., L p (x) = 1) defines the feasible region, whose two-dimensional case is shown in Figure S1. It can be observed that the closer p is to zero, the sparser x is; while the closer p is to ∞, the smoother or more even x is. In practice, L p with p < 1 is often used to approximate L 0 , and L p with p 2 is often used to approximate L ∞ .

S3 Tensor-Based Optimization Method
Since the vector norm f (x) is non-convex, our tensor method requires an optimization protocol that can deal with non-convex constraints. While the global optimum of a convex problem can be easily computed, the quality of the optimum discovered for a non-convex problem depends heavily on the numerical procedure. Standard numerical techniques such as gradient descent converge to a local minimum of the solution space, and different procedures often find different local minima. Considering the fact that our sparse constraint is non-convex, it is important to find a theoretically justified numerical procedure. We use an advanced optimization framework known as multi-stage convex relaxation, which has good numerical properties for non-convex optimization problems [1]. In this context, concave duality is used to construct a sequence of convex relaxations that give increasingly accurate approximations to the original non-convex problem. We approximate the sparse constraint function f (x) by the convex function ). In practice, h = 2 is an effective choice as the convex upperbound of f (x). The vector v contains coefficients that will be automatically generated during the optimization process. After each optimization, the new coefficient vector v yields a convex function f v (x) that more closely approximates the original non-convex function f (x).

S3.1 Concave duality
Given a continuous regularization function f (x) which may be non-convex, we are interested in rewriting it using concave duality. Detail refer to [1]. Let h(x) : R n → Ω ⊂ R n be a vector function. It may not be a one-to-one map. However, we assume that there exists a function f h (u) defined on Ω such that f (x) = f h (h(x)) holds.
We assume that we can find h so that the function f h (u) is a concave function of u on Ω. Under this assumption, we can rewrite the regularization function f (x) as: using concave duality (Page 308 in [2]). In this case, the function f * h (v) given below is the concave dual of f h (u): Moreover, it is well-known that the minimum of the right hand side of Eq.

S3.2 Multi-stage convex relaxation
The solution of our tensor formulation is a stationary point of the following regularized optimization problem: where λ > 0 and µ > 0 are Lagrange multipliers. Since f (x) is non-convex and g(y) is convex, we consider a numerical procedure for solving Eq. (6) with convex loss and non-convex regularization f (x). Let h(x) = j h j (x) be a convex relaxation of f (x) that dominates f (x) (for example, the smallest convex upperbound, i.e., the inf over all convex upperbounds). A simple convex relaxation of Eq. (6) becomes Outputs: the exon membership vector x and network membership vector y Initializev j = 1. Repeat the following two steps (a stage) until convergence: Figure S2. Multi-stage convex relaxation method for the tensor-based problem.
[x,ŷ] = arg max It is possible that this simple relaxation yields a solution that is not close to the solution of (6). However, if h satisfies the condition of Subsection S3.1, then it is possible to write f (x) as Eq. (2). In this new representation, we can rewrite Eq. (6) as This is clearly equivalent to Eq. (6) because of Eq. (2). If we can find a good approximation of v that improves upon the initial value ofv = [1, . . . , 1] T , then the above formulation can lead to a refined problem in x that is a better relaxation than Eq. (7).
Our numerical procedure exploits the above fact, trying to improve the estimation of v j over the initial choice of v j = 1 in Eq. (8) using an iterative algorithm. This can be done by repeatedly applying the following two steps: • First, optimize x and y with v fixed.
• Second, optimize v with x and y fixed. This problem has the closed form solution given by Eq. (4). Figure S2 presents our two-stage protocol to solve the regularized form of our problem. The procedure can be regarded as a generalization of concave-convex programming [3], which takes h(x) = x. By repeatedly refining the parameters in v, we can obtain better and better convex relaxations leading to a solution superior to that of the initial convex relaxation with v j = 1. The initial values of x and y could be uniform, randomly chosen, or taken from prior knowledge. In practice, an appropriate solver for Step 1, the time complexity of MSCR is often linear with respect to the total number of edges in the tensor.
Let h(x) = |x 1 | h , . . . , |x n | h and h = 2. This function is practically effective as the small convex upperbound of f (x). The problem in Step 1 of Figure S2 can be approximately implemented by the power method presented in Figure S3. However instead of tuning λ and µ, this is implicitly done by using the gradient projection scheme so that the constraints f (x) = 1 and g(y) = 1 are enforced S6 after each iteration via normalization. The idea is to optimize x with y fixed, then normalize x such that x satisfies the constraint f (x) = 1. Similarly, we optimize y with x fixed, then normalize y such that y satisfies the constraint g(y) = 1. The initial values x (0) and y (0) can be the vector with all entries one, i.e., 1 = [1, . . . , 1] T . As discussed in Section S3.1, the solution of Step 2 is given in Eq. (5) when f (x) is relaxed to h(x).
Repeat the following two updates until convergence: • Update y:  Therefore, there are the total 4 × 5 × 5 = 100 FHS patterns generated. Each FHS pattern is then placed into a set of networks, by randomly selecting K 1 /K 2 exons/networks and replacing edges weights with the corresponding FHS pattern's edges weights. These simulated networks and patterns have taken into account of various factors that may affect the performance. We can evaluate the performance by counting the number of these predefined patterns found/hitted by the method with each parameter combination. We performed our tensor method with different values of the parameters p and α on each set of networks and obtained the result as shown in Table S1. As explained in this . In this table, we chose the α = 0.2 whose hitting values are always those of other α given the same p; similarly, we chose the p = 0.8 whose hitting values are always those of other p given the same α. They are highlighted by graying the first column and the last row.
the location of each remaining exon to determine whether it is contained in all the known transcripts belonging to the gene model. If an exon is not contained in all the transcripts, the exon is a cassette exon. We applied this method to all the UCSC genes, and identify 22,310 cassette exons in human genome. It is worth noting that some cassette exons are always contained in the same transcripts, which means they are always co-spliced together as long as the transcripts are expressed. Since this trivial case dominates heaviness calculation in co-splicing clusters, and produce biased result, we combined those cassette exons that are always included in the same transcripts as a cluster, and used one cassette exon to represent the whole cassette exon clusters.

S5.2 RNA-Seq Datasets Selection and Processing
From NCBI's Sequence Read Archive (SRA) we selected all human RNA-seq datasets, each of which contains at least six samples (the minimum for robust correlation estimation). This results in a total of 38 datasets. For each dataset, we used the Tophat [5] tool to map short reads to the hg19 reference genome and applied the transcript assembly tool Cufflinks [6] to estimate expressions for all transcripts with known UCSC transcript annotations [4]. We calculated the inclusion rate of each exon in every sample, as the ratio between its expression (i.e., sum of FPKM over all transcripts that cover the exon) and the expression of the host gene (i.e., sum of FPKM over all transcripts of the gene). It is worth noting that in RNA-seq experiments, a gene expression with low FPKM is usually not precisely estimated because the number of reads mapped to the gene is quite small. In order to work with reasonably accurate estimates of exon inclusion rates, as pointed out by [7], we calculated inclusion rates only for those exons whose host genes' expressions are above 80 th percentile across at least 6 samples. Throughout this study, we only considered the genes containing cassette exons whose inclusion rate profiles met the above criterion. This resulted in 16,024 exons covering 9,532 genes. The 38 datasets that met these criteria on January 30 2011 were used for the analysis described herein.

S5.3 Network Construction
For each RNA-seq dataset containing a set of samples, a weighted exon co-splicing network can be constructed where the nodes represent exons and the edge weights are correlations between the inclusion rates of two exons. To determine the weights, we first compute the correlation between two exons as the leave-one-out Pearson correlation coefficient estimate [8]. The resulting correlation estimate is conservative and sensitive to similarities in the patterns, yet robust to single experimental outliers. To make the correlation estimates comparable across datasets, we then applied Fisher's z transform [9]. Given a correlation estimate r, Fisher's transformation score is calculated as z = √ n−3 2 ln 1+r 1−r . Note that the sample size n may be different for different datasets and even for different exon pairs due to missing values. Practically, we observed the distributions of z-scores may still vary from dataset to dataset, we standardized the z-scores to enforce zero mean and unit variance in each dataset by following the normalization procedure introduced in [10,11]. Then, the "normalized" correlations r are obtained by inverting the z-score with the same virtual sample size n = 10: r = . Finally, the absolute value of r is used as the edge weight of networks.

S5.4 Non-Uniform Sampling for Fast Computation
Even though our optimization method is efficient, its computation time can still be long for large sets of networks with many edges. In such cases, edge sampling can provide an efficient approximation to many graph problems [12,13]. From the perspective of matrix or tensor computation, such sampling methods can also be viewed as matrix/tensor sparsification [14]. As FSCs predominately contain edges with large weights, we designed a non-uniform sampling method that preferentially selects edges with large weights. Specifically, given a tensor A, each edge a ijk is sampled with probability p ijk : whereã ∈ (0, 1), b ∈ [1, ∞) and p ∈ (0,ã b ] are constants that control the number of sampled edges. Note that Eq. (9) always samples edges with weights ã. It selects an edge of weight a ijk <ã with probability p ijk proportional to the b th power of the weight. We chooseã = 0.4, b = 3, and p = 0.1 as a reasonable tradeoff between computational efficiency and the quality of the sampled tensor, meanwhile satisfying the conditions of Theorem S5.1, i.e. p ã b−1 and b 1.
To correct the bias caused by this sampling method, the weight of each edge is corrected by its relative probability: a ijk = a ijk /p ijk . The expected weight of the sampled network, E( a ijk ), is therefore equal to the weight of the original network. However, in practice, when the adjusted edge weight a ijk >ã (but the original edge weight a ijk <ã), we enforced it to be a ijk =ã to avoid too large edge weights. The overall edge sampling procedure adopts the simple random-sampling based single-pass sparsification procedure introduced in [14]. Details of the sampling procedure is given below. This single-pass sampling procedure's time complexity is O(n 2 m). It is obvious that the sparsification procedure in [14] is a special case of our sampling procedure when all entries of A are non-negative and p = 1,ã = n+n+m , b = 1. After edge sampling, the procedure described above will use the corrected tensor A = ( a ijk ) n×n×m instead of the original tensor A.
Edge sampling procedure of the tensor A = (a ijk ) n×n×m Procedure Sampling (A, p,ã, b) for each i ∈ [1, n], j ∈ [1, n], k ∈ [1, m] do if a ijk ã then a ijk = a ijk else Based on the well-known Chernoff-Hoeffding bounds [15], we gave the bound of the non-zero entries in the corrected tensor A after sampling in Theorem S5.1. Therefore, the computational complexity of the tensor MSCR algorithm on the tensor A after sampling is linear to the number of the non-zero entries of A, i. Proof. This proof is similar to Lemma 1 in [14]. Since S = i,j,k a ijk , the number of a ijk that are no less thanã is at most S a ; otherwise, the sum of all entries ã would be greater than S. Now consider all non-zero a ijk entries that are smaller thanã.
The Chernoff bound [15] asserts that if X 1 , X 2 , . . . , X N are indicator random variables and X = i X i with E[X] = µ, then for any δ > 0 In our case, we set up indicator random variables X ijk which are 0 or 1 depending on whether a = 0 or not. Then X = i,j,k X ijk is the number of non-zero entries of A, and Since p ã b−1 and a b ijk a ijk (because b 1 and 0 a ijk 1), we have p Then we have We use the Chernoff bound with (1 + δ) = e and arrive at the following So the claim holds.

S6 Descriptions of 38 RNA-Seq Datasets
See Table S2. We used a total of 251,933,381 short-read sequence tags generated from various types of transcriptome analyses in order to characterize 6039 iTSCs. SRP000626 6 We carried out the first analysis of alternative splicing complexity in human tissues using mRNA-Seq data. New splice junctions were detected in 20% of multiexon genes, many of which are tissue specific. By combining mRNA-Seq and EST-cDNA sequence data, we estimate that transcripts from 95% of multiexon genes undergo alternative splicing and that there are 100,000 intermediate-to high-abundance alternative splicing events in major human tissues. From a comparison with quantitative alternative splicing microarray profiling data, we also show that mRNA-Seq data provide reliable measurements for exon inclusion levels. 32-nucleotide sequence reads from six human tissues including brain, cerebral cortex, heart, liver, lung and skeletal muscle.  Illumina RNA-Seq analysis to survey transcriptome profiles from total brain, frontal and temporal lobe of healthy and AD post-mortem tissue. SRP004903 8 Examination exon/gene expression of liver and muscle in quadraplicates using both the array technology and RNA-Seq. SRP005242 21 The second wave of next generation sequencing technologies, referred to as single-molecule sequencing (SMS), carries the promise of profiling samples directly without employing polymerase chain reaction steps used by amplification-based sequencing (AS) methods. To examine the merits of both technologies, we examine mRNA sequencing results from single-molecule and amplification-based sequencing in a set of human cancer cell lines and tissues. SRP005408 31 Using Illumina's Genome Analyzer, we profiled gene expression in postmortem hippocampus using RNAseq. Comparison was performed among different groups of addicted human samples (Cocaine Overdose, Excited Delirium, Alcohol Abuse and Control). Continued on next page S7 Signal extraction procedure of the Encyclopedia of DNA Elements (ENCODE) data