Structured sparse CCA for brain imaging genetics via graph OSCAR

Background Recently, structured sparse canonical correlation analysis (SCCA) has received increased attention in brain imaging genetics studies. It can identify bi-multivariate imaging genetic associations as well as select relevant features with desired structure information. These SCCA methods either use the fused lasso regularizer to induce the smoothness between ordered features, or use the signed pairwise difference which is dependent on the estimated sign of sample correlation. Besides, several other structured SCCA models use the group lasso or graph fused lasso to encourage group structure, but they require the structure/group information provided in advance which sometimes is not available. Results We propose a new structured SCCA model, which employs the graph OSCAR (GOSCAR) regularizer to encourage those highly correlated features to have similar or equal canonical weights. Our GOSCAR based SCCA has two advantages: 1) It does not require to pre-define the sign of the sample correlation, and thus could reduce the estimation bias. 2) It could pull those highly correlated features together no matter whether they are positively or negatively correlated. We evaluate our method using both synthetic data and real data. Using the 191 ROI measurements of amyloid imaging data, and 58 genetic markers within the APOE gene, our method identifies a strong association between APOE SNP rs429358 and the amyloid burden measure in the frontal region. In addition, the estimated canonical weights present a clear pattern which is preferable for further investigation. Conclusions Our proposed method shows better or comparable performance on the synthetic data in terms of the estimated correlations and canonical loadings. It has successfully identified an important association between an Alzheimer’s disease risk SNP rs429358 and the amyloid burden measure in the frontal region.


Background
In recent years, the bi-multivariate analyses techniques [1], especially the sparse canonical correlation analysis (SCCA) [2][3][4][5][6][7][8], have been widely used in brain imaging genetics studies. These methods are powerful in *Correspondence: shenli@iu.edu Alzheimer's Disease Neuroimaging Initiative, Data used in preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/ uploads/how_to_apply/ADNI_Acknowledgement_List.pdf 1 School of Medicine, Indiana University, Indianapolis, USA Full list of author information is available at the end of the article identifying bi-multivariate associations between genetic biomarkers, e.g., single nucleotide polymorphisms (SNPs), and the imaging factors such as the quantitative traits (QTs).
Witten et al. [3,9] first employed the penalized matrix decomposition (PMD) technique to handle the SCCA problem which had a closed form solution. This SCCA imposed the 1 -norm into the traditional CCA model to induce sparsity. Since the 1 -norm only randomly chose one of those correlated features, it performed poorly in finding structure information which usually existed in biology data. Witten et al. [3,9] also implemented the fused lasso based SCCA which penalized two adjacent features orderly. This SCCA could capture some structure information but it demanded the features be ordered. As a result, a lot of structured SCCA approaches arose. Lin et al. [7] imposed the group lasso regularizer to the SCCA model which could make use of the nonoverlapping group information. Chen et al. [10] proposed a structure-constrained SCCA (ssCCA) which used a graph-guided fused 2 -norm penalty for one canonical loading according to features' biology relationships. Du et al. [8] proposed a structure-aware SCCA (S2CCA) to identify group-level bi-multivariate associations, which combined both the covariance matrix information and the prior group information by the group lasso regularizer. These structured SCCA methods, on one hand, can generate a good result when the prior knowledge is well fitted to the hidden structure within the data. On the other hand, they become unapproachable when the prior knowledge is incomplete or not available. Moreover, it is hard to precisely capture the prior knowledge in real world biomedical studies.
To facilitate structural learning via grouping the weights of highly correlated features, the graph theory were widely utilized in sparse regression analysis [11][12][13]. Recently, we notice that the graph theory has also been employed to address the grouping issue in SCCA. Let each graph vertex and each feature has a one-to-one correspondence relationship, and ρ ij be the sample correlation between features i and j. Chen et al. [4,5] proposed a networkstructured SCCA (NS-SCCA) which used the 1 -norm of |ρ ij |(u i − sign(ρ ij )u j ) to pull those positively correlated features together, and fused those negatively correlated features to the opposite direction. The knowledge-guided SCCA (KG-SCCA) [14] was an extension of both NS-SCCA [4,5] and S2CCA [8]. It used 2 -norm of ρ 2 ij (u i − sign(r ij )u j ) for one canonical loading, similar to what Chen proposed, and employed the 2,1 -norm penalty for another canonical loading. Both NS-SCCA and KG-SCCA could be used as a group-pursuit method if the prior knowledge was not available. However, one limitation of both models is that they depend on the sign of pairwise sample correlation to recover the structure pattern. This probably incur undesirable bias since the sign of the correlations could be wrongly estimated due to possible graph misspecification caused by noise [13].
To address the issues above, we propose a novel structured SCCA which neither requires to specify prior knowledge, nor to specify the sign of sample correlations. It will also work well if the prior knowledge is provided. The GOSC-SCCA, named from Graph Octagonal Selection and Clustering algorithm for Sparse Canonical Correlation Analysis, is inspired by the outstanding feature grouping ability of octagonal selection and clustering algorithm for regression (OSCAR) [11] regularizer and graph OSCAR (GOSCAR) [13] regularizer in regression task. Our contributions can be summarized as follows 1) GOSC-SCCA could pull those highly correlated features together when no prior knowledge is provided. While those positively correlated features will be encouraged to have similar weights, those negatively correlated ones will also be encouraged to have similar weights but with different signs. 2) Our GOSC-SCCA could reduce the estimation bias given no requirement for specifying the sign of sample correlation. 3) We provide a theoretical quantitative description for the grouping effect of GOSC-SCCA. We use both synthetic data and real imaging genetic data to evaluate GOSC-SCCA. The experimental results show that our method is better than or comparable to those state-of-the-art methods, i.e., L1-SCCA, FL-SCCA [3] and KG-SCCA [14], in identifying stronger imaging genetic correlations and more accurate and cleaner canonical loadings pattern. Note that the PMA software package were used to implement the L1-SCCA (SCCA with lasso penalty) and FL-SCCA (SCCA with fused lasso penalty) methods. Please refer to http://cran.r-project.org/web/ packages/PMA/ for more details.

Methods
We denote a vector as a boldface lowercase letter, and denote a matrix as a boldface uppercase letter. m i indicates the i-th row of matrix M = (m ij ). Matrices X = {x 1 ; . . . ; x n } ⊆ R p and Y = {y 1 ; . . . ; y n } ⊆ R q denote two separate datasets collected from the same population. Imposing lasso into a traditional CCA model [15], the L1-SCCA model is formulated as follows [3,9]: where ||u|| 1 ≤ c 1 and ||v|| 1 ≤ c 2 are sparsity penalties controlling the complexity of the SCCA model. The fused lasso [2][3][4]9] can also be used instead of lasso. In order to make the problem be convex, the equal sign is usually replaced by less-than-equal sign, i.e. ||u|| 2 2 ≤ 1, ||v|| 2 2 ≤ 1 [3].

The graph OSCAR regularization
The OSCAR regularizer is firstly introduced by Bondell et al. [11], which has been proved to have the ability of grouping features automatically by encouraging those highly correlated features to have similar weights. Formally, the OSCAR penalty is defined as follows, Note that this penalty is applied to each feature pair.
To make OSCAR be more flexible, Yang et al. [13] introduce the GOSCAR, where E u and E v are the edge sets of the u-related and vrelated graphs, respectively. Obviously, the GOSCAR will reduce to OSCAR when both graphs are complete [13].
Applying max{|u i |, |u j |} = 1 2 (|u i − u j | + |u i + u j |), the GOSCAR regularizer takes the following form, The GOSC-SCCA model Since the grouping effect is also an important consideration in SCCA learning, we propose to expand L1-SCCA to GOSC-SCCA by imposing GOSCAR instead of L1 only as follows.
where (c 1 , c 2 , c 3 , c 4 ) are parameters and they could control the solution path of the canonical loadings. Since the S2CCA [8] has proved that the covariance matrix information could help improve the prediction ability, we also use ||Xu|| 2 2 ≤ 1 and ||Yv|| 2 2 ≤ 1 other than ||u|| 2 2 ≤ 1, ||v|| 2 2 ≤ 1. As a structured sparse model, GOSC-SCCA will encourage u i . = u j if the i-th feature and the j-th feature are highly correlated. We will give a quantitative description for this later.

The proposed algorithm
We can write the objective function into unconstrained formulation via the Lagrange multiplier method, i.e.
Taking the derivative regarding u and v respectively, and letting them be zero, we obtain, where 1 is a diagonal matrix with the k 1 -th element as 1 2||u k 1 || 1 (k 1 ∈[ 1, p] ), and 2 with the k 2 -th element as 1 2||v k 2 || 1 (k 2 ∈[ 1, q] ); L 1 is the Laplacian matrix which can be obtained from L 1 = D 1 − W 1 ;L 1 is a matrix which is fromL 1 =D 1 +Ŵ 1 . L 2 andL 2 have the same entries as L 1 andL 1 separately based on v.
In the initialization, both W 1 andŴ 1 have the same entry with each element as 1 2 except the diagonal elements. But W 1 andŴ 1 become different after each iteration, i.e., If ||u i − u j || 1 = 0, the corresponding element in matrix W 1 will not exist. So we regularize it as We also approximate ||u i || 1 = 0 with ||u i || 2 1 + ζ for 1 . Then the objective function regarding u is . It is easy to prove that L * (u) will reduce to problem (6) regarding u when ζ → 0. The cases of ||v i || 1 = 0 and ||v i − v j || 1 = 0 can be addressed using a similar regularization method. D 1 is a diagonal matrix and its i-th diagonal element is obtained by summing the i-th row of W 1 , i.e. d i = j w ij .
The diagonal matrixD 1 is also obtained fromd i = jŵ ij .
Likewise, we can calculate W 2 ,Ŵ 2 , D 2 andD 2 by the same method in terms of v.
Then according to Eqs. (7-8), we can obtain the solution to our problem with respect to u and v separately. (11) We observe that L 1 ,L 1 and 1 depend on u which is an unknown variable, and v is also unknown which is used to calculate L 2 ,L 2 and 2 . Thus we propose an effective iterative algorithm to solve this problem. We first fix v to solve u; and then fix u to solve v.
Algorithm 1 exhibits the pseudo code of the proposed GOSC-SCCA algorithm. For the key calculation steps, i.e., Step 5 and Step 10, we solve a system of linear equations with quadratic complexity other than computing the matrix inverse with cubic complexity. Thus the
whole algorithm can work with desired efficiency. In addition, the algorithm is guaranteed to converge and we will prove this in the next subsection.

Convergence analysis
We first introduce the following lemma.

Theorem 1
The objective function value of GOSC-SCCA will monotonically decrease in each iteration till the algorithm converges.
Proof The proof consists of two parts.
(1) Part 1: From Step 3 to Step 7 in Algorithm 1, u is the only unknown variable to be solved. The objective function (6) can be equivalently transferred to According to Step 5 we have We first multiply 2λ 1 on both sides of Eq. (13) for each feature pair separately, and do the same to both sides of Eq. (14). After that, we multiply β 1 on both sides of Eq. (12). Finally, by summing all these inequations together to both sides of Eq. (15) accordingly, we arrive at Therefore, GOSC-SCCA will decrease the objective function in each iteration, i.e., L(ũ, v) ≤ L(u, v).
(2) Part 2: From Step 8 to Step 12, the only unknown variable is v. Similarly, we can arrive at Thus GOSC-SCCA also decreases the objective function in each iteration during the second phase, i.e., Based on the analysis above, we easily have L(ũ,ṽ) ≤ L(u, v) according to the transitive property of inequalities. Therefore, the objective value monotonically decreases in each iteration. Note that the CCA objective

and both u T X T Xu
and v T Y T Yv are constrained to be 1. Thus the −u T X T Yv is lower bounded by -1, and so Eq. (6) is lower bounded by -1. In addition, Eqs. (16)(17) imply that the KKT condition is satisfied. Therefore, the GOSC-SCCA algorithm will converge to a local optimum.

The grouping effect of GOSC-SCCA
For the structured sparse learning in high-dimensional situation, the automatic feature grouping property is of great importance [18]. In regression analysis, Zou and Hastie [18] have suggested that a regressor behaviors grouping effect when it can set those regression coefficients of the same group to similar weights. This is also the case for structured SCCA methods. So, it is important and meaningful to investigate the theoretical boundary of the grouping effect.
We have the following theorem in terms of GOSC-SCCA.
Theorem 2 Let X and Y be two data sets, and (λ, β, γ ) be the pre-tuned parameters. Letũ be the solution to our SCCA problem of Eqs. (10)(11). Suppose the i-th feature and j-th feature only link to each other on the graph,ũ i and u j are their optimal solutions, thus sgn(ũ i ) = sgn(ũ j ) holds. The solutions toũ i andũ j satisfy where ρ ij is the sample correlation between features i and j, and w i,j is the corresponding element in u-related matrix W 1 .
Proof Letũ be the solution to our problem Eq. (6), we have the following equations after taking the partial derivative with respect toũ i andũ j , respectively.
We know that features i and u j are only linked to each other, thus D ii = D jj = A ij = w ij for those intermediate matrices. Besides, we also know that sgn( Then according to the definition of L 1 ,L 1 and 1 , we can arrive at Subtracting these two equations, we obtain (20) Then we take 2 -norm on both sides of Eq. (20), apply the triangle inequality, and use the equality ||( We have known that our problem implies ||Yv|| 2 2 ≤ 1, thus we arrive at Now the upper bound for the canonical loadings v can also be obtained, i.e.
where ρ ij is the sample correlation between the i-th and j-th feature in v, and w ij is the corresponding element in v-related matrix W 2 . Theorem 2 provides a theoretical upper bound for the difference between the estimated coefficients of the i-th feature and j-th feature. It seems that this is not a tight enough bound. However our bound is slack since it does not bound much more the pairwise difference of features i and j if ρ ij 1. This is desirable for two irrelevant features [19]. Suppose two features with very small correlation, i.e. ρ ij 0, their coefficients do not need to be the same or similar. So we do not care about their coefficients' pairwise difference, and will not set their pairwise difference a tight bound. This quantitative description for the grouping effect makes the GOSCAR penalty an ideal choice for structured SCCA.

Results
We compare GOSC-SCCA with several state-of-the-art SCCA and structured SCCA methods, including L1-SCCA [3], FL-SCCA [3], KG-SCCA [14]. We do not compare GOSC-SCCA with S2CCA [8], ssCCA [7] and CCA-SG (CCA Sparse Group) [10] since they require prior knowledge available in advance. We do not choose NS-SCCA [5] as benchmark either, due to the following two reasons. (1) NS-SCCA generates many intermediate variables during its iterative procedure. As the authors stated, NS-SCCA's per-iteration complexity is linear in (p + |E|), and thus the complexity becomes O(p 2 ) when it is in the group pursuit mode. (2) Its penalty term is similar to that of KG-SCCA which has been selected for comparison.
There are six parameters to be decided before using the GOSC-SCCA, thus it will take too much time by blindly tuning. We tune the parameters following two principles. On one hand, Chen and Liu [5] found out that the result is not very sensitive to γ 1 and γ 2 . So we choose them from a small scope [0.1, 1, 10]. On the other hand, if the parameters are too small, the SCCA will reduce to CCA due to the subtle influence of the penalties. And, too large parameters will over-penalize the results. Therefore, we tune the rest of the parameters within the range of {10 −3 , 10 −2 , 10 −1 , 10 0 , 10 1 , 10 2 , 10 3 }. In this study, we conduct all the experiments using the nested 5-fold crossvalidation strategy, and the parameters are only tuned from the training set. In order to save time, we only tune these parameters on the first run of the cross-validation. That is, the parameters are tuned when the first four folds are used as the training set. Then we directly use the tuned parameters for all the remaining experiments. All these methods use the same partition for cross-validation in the experiment.

Evaluation on synthetic data
We generate four synthetic datasets to investigate the performance of GOSC-SCCA and those benchmarks. Following [4,5], these datasets are generated by four steps: 1) We predefine the structures and use them to create u and v respectively. 2) We create a latent vector z from N(0, I n×n ). 3) We create X with each For the first group of nonzero features in u, we change half of their signs, and also change the signs of the corresponding data. Since the synthetic datasets are order-independent, this setup is equivalent to randomly change a portion of features' signs in u. Now that we change the sign of both coefficients and the data simultaneously, we still have X u = Xu where X and u indicate the data and coefficients after the sign swap. We do the same on the Y side to make our simulation more challenging [13]. In addition, we set all four datasets with n = 80, p = 100 and q = 120. They also have different correlation coefficients and different group structures. Therefore, the simulation is designed to cover a set of diverse cases for a fair comparison.
The estimated correlation coefficients of each method on four datasets are contained in Table 1. The best values and those are not significantly worsen than the best values are shown in bold. On the training results, we observe that GOSC-SCCA either estimates the largest correlation coefficients (Dataset 1 and Dataset 4), or is not significantly worse than the best method (Dataset 2 and Dataset 3). GOSC-SCCA also has the best average correlation coefficients. On the testing results, GOSC-SCCA also outperforms those benchmarks in terms of the average correlation coefficients, though KG-SCCA does not perform significantly worse than our method. For the overall average obtained across four datasets, GOSC-SCCA obtains the better correlation coefficients than the competing methods on both training set and testing set. Figure 1 shows the estimated canonical loadings of all four SCCA methods in a typical run. As we can see, L1-SCCA cannot accurately recover the true signals. For those coefficients with sign swapped, it fails to recognize them. The FL-SCCA slightly improves L1-SCCA's performance but cannot identify those coefficients with sign changed either. Our GOSC-SCCA successfully groups those nonzero features together, and accurately recognizes the coefficients whose signs are changed. No matter what structures are within the dataset, GOSC-SCCA is able to estimate true signals which are very close to the ground truth. Although KG-SCCA also recognizes the coefficients with sign swapped, it is unable to recover every group of nonzero coefficients. For example, KG-SCCA misses two groups of nonzero features in terms of v for the second dataset. The results on synthetic datasets reveal that GOSC-SCCA can not only estimate stronger correlation coefficients than the competing methods, but also identifies more accurate and cleaner canonical loadings.

Evaluation on real neuroimaging genetics data
Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial  The estimated correlation coefficients and their MEAN are shown. 'NaN' means a method fails to estimate a pair of canonical loadings. '0.00' means a very small correlation coefficients. 'AVG.' denotes the MEAN across all four datasets.
The best values and those that are NOT significantly worse than the best ones (t-test with p-value smaller than 0.05) are shown in bold  Table 2 contains the characteristics of the ADNI dataset used in this work. Participants including 568 non-Hispanic Caucasian subjects, including 196 healthy control (HC), 343 MCI and 28 AD participants. However, many participants's data are incomplete due to various factors such as data loss. After cleaning those participants with incomplete information, we get 282 participants in our experiments. The genotype data were downloaded from LONI (adni.loni.usc.edu), and the preprocessed [11C] Florbetapir PET scans (i.e., amyloid imaging data) were also obtained from LONI. Before conducting the experiment, the amyloid imaging data had been preprocessed and the specific pipeline could be found in [14]. These imaging measures were adjusted by removing the effects of the baseline age, gender, education, and handedness via the regression weights derived from HC participants. We finally obtained 191 region-ofinterest (ROI) level amyloid measurements which were extracted from the MarsBaR AAL atlas. We included four genetic markers, i.e., rs429358, rs439401, rs445925 and rs584007, from the known AD risk gene APOE. We intend to investigate if our GOSC-SCCA could identify this widely known associations between amyloid deposition and APOE SNPs. Shown in Table 3 are the 5-fold cross-validation results of various SCCA methods. We observe that GOSC-SCCA and KG-SCCA obtain similar correlation coefficients on every run, including the training performance and testing performance. Besides, they both are significantly better than L1-SCCA and FL-SCCA, which is consistent with the analysis in [14]. This result shows that GOSC-SCCA can improve the ability of identifying interesting imaging genetic associations compared with L1-SCCA and FL-SCCA. Figure 2 contains the estimated canonical loadings obtained from 5-fold cross-validation. To facilitate the interpretation, we employ the heat map for this real data. Each row denotes a method, and u (genetic markers) is shown on the left panel and v (imaging markers) is on the right. As we can see, on the genetic side, all four SCCA exhibit similar canonical loading pattern. Since every SCCA here incorporates the lasso ( 1 -norm), they select only the APOE e4 SNP (rs429358), which The estimated correlation coefficients and their MEAN are shown. The best correlation coefficients and those that are NOT significantly worse than the best ones (t-test with p-value smaller than 0.05) are shown in bold is a widely known AD risk marker, with those irrelevant ones discarded to assure sparsity. On the imaging side, L1-SCCA identifies many signals which is hard to interpret. FL-SCCA fuses those adjacent features together due to its pairwise smoothness, which can be easily observed from the figure. But it is difficult to interpret either. GOSC-SCCA and KG-SCCA perform similarly again in this run. They both identify the imaging signals in accordance with the findings in [20]. It is easily to observe that they estimated a very clean signal pattern, and thus is easy to conduct further investigation. Recall the results in Table 3, the association between the marker rs429358 and the amyloid accumulation in the brain is relatively strong, and thus the signal can be well captured by both KG-SCCA and GOSC-SCCA. In addition, the correlations among the imaging variables and those among genetic variables are high enough so that the signs of these correlations can hardly be impeded by the noises. That is, the signs of sample correlations tend to be correctly estimated. Therefore, KG-SCCA does not suffer sign directionality issue, and so performs similarly to GOSC-SCCA. However, if some sample correlations are not very strong and their signs are mis-estimated, KG-SCCA may not work very well (see the results of the second synthetic dataset). In summary, this reveals that our method has better generalization ability, and could identify biologically meaningful imaging genetic associations.

Discussion
In this paper, we have proposed a structured SCCA method GOSC-SCCA, which intended to reduce the estimation bias caused by the incorrect sign of sample correlation. GOSC-SCCA employed the GOSCAR (Graph OSCAR) regularizer which is an extension of the popular penalty OSCAR. The GOSC-SCCA could pull those highly correlated features together no matter that they were positively correlated or negatively correlated. We also provide a theoretical quantitative description of the grouping effect of our SCCA method. An effective algorithm was also proposed to solve the GOSC-SCCA problem and the algorithm was guaranteed to converge. We evaluated GOSC-SCCA and three other popular SCCA methods on both synthetic datasets and a real imaging genetics dataset. The synthetic datasets consisted of different ground truth, i.e. different correlation coefficients and canonical loadings. GOSC-SCCA was capable of consistently identifying strong correlation coefficients on both training set and testing set, and either outperformed or performed similarly to the competing methods. Besides, GOSC-SCCA successfully and accurately recognized the signals which were the closest to the ground truth when compared with the competing methods.
The results on the real data showed that both GOSC-SCCA and KG-SCCA could find an important association between the APOE SNPs and the amyloid burden measure in the frontal region of the brain. KG-SCCA performs similarly to GOSC-SCCA on this real data largely because of the strong correlations between the variables within the genetic data, as well as those within the imaging data. In this case, the signs of the correlation coefficients between these variables tend to be correctly calculated, and so KG-SCCA does not have the sign directionality issue. On the other hand, if the correlations among some variables are not very strong, the performance of KG-SCCA can be affected by the mis-estimation of some correlation signs. In this case, GOSC-SCCA, which is designed to overcome the sign directionality issue, is expected to perform better than KG-SCCA. This fact has already been validated by the results of the second synthetic dataset.
The satisfactory performance of GOSC-SCCA, coupled with its theoretical convergence and grouping effect, demonstrates the promise of our method as an effective structured SCCA method in identifying meaningful bimultivariate imaging genetic associations. The following are a few possible future directions. (1) Note that the identified pattern between the APOE genotype and amyloid deposition is a well-known and relatively strong imaging genetic association. Thus one direction is to apply GOSC-SCCA to more complex imaging genetic data for revealing novel but less obvious associations. (2) The data tested in this study is brain wide but targeted only at APOE SNPs. Another direction is to apply GOSC-SCCA to imaging genetic data with higher dimensionality, where more effective and efficient strategies for parameter tuning and cross-validation warrant further investigation. (3) The third direction is to employ GOSC-SCCA as a knowledgedriven approach, where pathways, networks or other relevant biological knowledge can be incorporated in the model to aid association discovery. In this case, comparative study can also been done between GOSC-SCCA and other state-of-the-arts knowledge-guided SCCA methods in bi-multivariate imaging genetics analyses.

Conclusions
We have presented a new structured sparse canonical analysis (SCCA) model for analyzing brain imaging genetics data and identifying interesting imaging genetic associations. This SCCA model employs a regularization item based on the graph octagonal selection and clustering algorithm for regression (GOSCAR). The goal is twofold: (1) encourage highly correlated features to have similar canonical weights, and (2) reduce the estimation bias via removing the requirement of pre-defining the sign of the sample correlation. As a result, it could pull highly correlated features together no matter whether they are positively or negatively correlated. Empirical results on both synthetic and real data have demonstrated the promise of the proposed method.

Declarations
Publication charges for this article have been funded by the corresponding author. This article has been published as part of BMC Systems Biology Volume 10 Supplement 3, 2016: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM) 2015: systems biology. The full contents of the supplement are available online at http://bmcsystbiol.biomedcentral.com/articles/supplements/volume-10supplement-3.

Availability of data and materials
Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu).