Volume 10 Supplement 3
Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM) 2015: systems biology
Structured sparse CCA for brain imaging genetics via graph OSCAR
- Lei Du^{1},
- Heng Huang^{2},
- Jingwen Yan^{1},
- Sungeun Kim^{1},
- Shannon Risacher^{1},
- Mark Inlow^{3},
- Jason Moore^{4},
- Andrew Saykin^{1},
- Li Shen^{1}Email author and
- for the Alzheimer’s Disease Neuroimaging Initiative
https://doi.org/10.1186/s12918-016-0312-1
© The Author(s) 2016
Published: 26 August 2016
Abstract
Background
Recently, structured sparse canonical correlation analysis (SCCA) has received increased attention in brain imaging genetics studies. It can identify bi-multivariate imaging genetic associations as well as select relevant features with desired structure information. These SCCA methods either use the fused lasso regularizer to induce the smoothness between ordered features, or use the signed pairwise difference which is dependent on the estimated sign of sample correlation. Besides, several other structured SCCA models use the group lasso or graph fused lasso to encourage group structure, but they require the structure/group information provided in advance which sometimes is not available.
Results
We propose a new structured SCCA model, which employs the graph OSCAR (GOSCAR) regularizer to encourage those highly correlated features to have similar or equal canonical weights. Our GOSCAR based SCCA has two advantages: 1) It does not require to pre-define the sign of the sample correlation, and thus could reduce the estimation bias. 2) It could pull those highly correlated features together no matter whether they are positively or negatively correlated. We evaluate our method using both synthetic data and real data. Using the 191 ROI measurements of amyloid imaging data, and 58 genetic markers within the APOE gene, our method identifies a strong association between APOE SNP rs429358 and the amyloid burden measure in the frontal region. In addition, the estimated canonical weights present a clear pattern which is preferable for further investigation.
Conclusions
Our proposed method shows better or comparable performance on the synthetic data in terms of the estimated correlations and canonical loadings. It has successfully identified an important association between an Alzheimer’s disease risk SNP rs429358 and the amyloid burden measure in the frontal region.
Keywords
Brain imaging genetics Canonical correlation analysis Structured sparse model Machine learningBackground
In recent years, the bi-multivariate analyses techniques [1], especially the sparse canonical correlation analysis (SCCA) [2–8], have been widely used in brain imaging genetics studies. These methods are powerful in identifying bi-multivariate associations between genetic biomarkers, e.g., single nucleotide polymorphisms (SNPs), and the imaging factors such as the quantitative traits (QTs).
Witten et al. [3, 9] first employed the penalized matrix decomposition (PMD) technique to handle the SCCA problem which had a closed form solution. This SCCA imposed the ℓ _{1}-norm into the traditional CCA model to induce sparsity. Since the ℓ _{1}-norm only randomly chose one of those correlated features, it performed poorly in finding structure information which usually existed in biology data. Witten et al. [3, 9] also implemented the fused lasso based SCCA which penalized two adjacent features orderly. This SCCA could capture some structure information but it demanded the features be ordered. As a result, a lot of structured SCCA approaches arose. Lin et al. [7] imposed the group lasso regularizer to the SCCA model which could make use of the non-overlapping group information. Chen et al. [10] proposed a structure-constrained SCCA (ssCCA) which used a graph-guided fused ℓ _{2}-norm penalty for one canonical loading according to features’ biology relationships. Du et al. [8] proposed a structure-aware SCCA (S2CCA) to identify group-level bi-multivariate associations, which combined both the covariance matrix information and the prior group information by the group lasso regularizer. These structured SCCA methods, on one hand, can generate a good result when the prior knowledge is well fitted to the hidden structure within the data. On the other hand, they become unapproachable when the prior knowledge is incomplete or not available. Moreover, it is hard to precisely capture the prior knowledge in real world biomedical studies.
To facilitate structural learning via grouping the weights of highly correlated features, the graph theory were widely utilized in sparse regression analysis [11–13]. Recently, we notice that the graph theory has also been employed to address the grouping issue in SCCA. Let each graph vertex and each feature has a one-to-one correspondence relationship, and ρ _{ ij } be the sample correlation between features i and j. Chen et al. [4, 5] proposed a network-structured SCCA (NS-SCCA) which used the ℓ _{1}-norm of |ρ _{ ij }|(u _{ i }−s i g n(ρ _{ ij })u _{ j }) to pull those positively correlated features together, and fused those negatively correlated features to the opposite direction. The knowledge-guided SCCA (KG-SCCA) [14] was an extension of both NS-SCCA [4, 5] and S2CCA [8]. It used ℓ _{2}-norm of \(\rho _{ij}^{2}(u_{i}-sign(r_{ij})u_{j})\) for one canonical loading, similar to what Chen proposed, and employed the ℓ _{2,1}-norm penalty for another canonical loading. Both NS-SCCA and KG-SCCA could be used as a group-pursuit method if the prior knowledge was not available. However, one limitation of both models is that they depend on the sign of pairwise sample correlation to recover the structure pattern. This probably incur undesirable bias since the sign of the correlations could be wrongly estimated due to possible graph misspecification caused by noise [13].
To address the issues above, we propose a novel structured SCCA which neither requires to specify prior knowledge, nor to specify the sign of sample correlations. It will also work well if the prior knowledge is provided. The GOSC-SCCA, named from Graph Octagonal Selection and Clustering algorithm for Sparse Canonical Correlation Analysis, is inspired by the outstanding feature grouping ability of octagonal selection and clustering algorithm for regression (OSCAR) [11] regularizer and graph OSCAR (GOSCAR) [13] regularizer in regression task. Our contributions can be summarized as follows 1) GOSC-SCCA could pull those highly correlated features together when no prior knowledge is provided. While those positively correlated features will be encouraged to have similar weights, those negatively correlated ones will also be encouraged to have similar weights but with different signs. 2) Our GOSC-SCCA could reduce the estimation bias given no requirement for specifying the sign of sample correlation. 3) We provide a theoretical quantitative description for the grouping effect of GOSC-SCCA. We use both synthetic data and real imaging genetic data to evaluate GOSC-SCCA. The experimental results show that our method is better than or comparable to those state-of-the-art methods, i.e., L1-SCCA, FL-SCCA [3] and KG-SCCA [14], in identifying stronger imaging genetic correlations and more accurate and cleaner canonical loadings pattern. Note that the PMA software package were used to implement the L1-SCCA (SCCA with lasso penalty) and FL-SCCA (SCCA with fused lasso penalty) methods. Please refer to http://cran.r-project.org/web/packages/PMA/ for more details.
Methods
where ||u||_{1}≤c _{1} and ||v||_{1}≤c _{2} are sparsity penalties controlling the complexity of the SCCA model. The fused lasso [2–4, 9] can also be used instead of lasso. In order to make the problem be convex, the equal sign is usually replaced by less-than-equal sign, i.e. \(||\mathbf {u}||_{2}^{2} \leq 1, ||\mathbf {v}||_{2}^{2} \leq 1\) [3].
The graph OSCAR regularization
Note that this penalty is applied to each feature pair.
where E _{ u } and E _{ v } are the edge sets of the u-related and v-related graphs, respectively. Obviously, the GOSCAR will reduce to OSCAR when both graphs are complete [13].
The GOSC-SCCA model
where (c _{1},c _{2},c _{3},c _{4}) are parameters and they could control the solution path of the canonical loadings. Since the S2CCA [8] has proved that the covariance matrix information could help improve the prediction ability, we also use \(||\mathbf {Xu}||_{2}^{2} \leq 1\) and \(||\mathbf {Yv}||_{2}^{2} \leq 1\) other than \(||\mathbf {u}||_{2}^{2} \leq 1, ||\mathbf {v}||_{2}^{2} \leq 1\).
As a structured sparse model, GOSC-SCCA will encourage \(u_{i} \doteq u_{j}\) if the i-th feature and the j-th feature are highly correlated. We will give a quantitative description for this later.
The proposed algorithm
where (λ _{1},λ _{2},β _{1},β _{2}) are tuning parameters, and they have a one-to-one correspondence to parameters (c _{1},c _{2},c _{3},c _{4}) in GOSC-SCCA model [4].
where Λ _{1} is a diagonal matrix with the k _{1}-th element as \(\frac {1}{2||u_{k_{1}}||_{1}} (k_{1} \in [1,p])\), and Λ _{2} with the k _{2}-th element as \(\frac {1}{2||v_{k_{2}}||_{1}} (k_{2} \in [1,q])\); L _{1} is the Laplacian matrix which can be obtained from L _{1}=D _{1}−W _{1}; \(\hat {\mathbf {L}}_{1}\) is a matrix which is from \(\hat {\mathbf {L}}_{1} = \hat {\mathbf {D}}_{1} + \hat {\mathbf {W}}_{1}\). L _{2} and \(\hat {\mathbf {L}}_{2}\) have the same entries as L _{1} and \(\hat {\mathbf {L}}_{1}\) separately based on v.
If ||u _{ i }−u _{ j }||_{1}=0, the corresponding element in matrix W _{1} will not exist. So we regularize it as \(\frac {1}{2\sqrt {||u_{i}-u_{j}||_{1}^{2}+\zeta }}\) (ζ is a very small positive number) when ||u _{ i }−u _{ j }||_{1}=0. We also approximate ||u _{ i }||_{1}=0 with \(\sqrt {||u_{i}||_{1}^{2}+\zeta }\) for Λ _{1}. Then the objective function regarding u is \(\mathbf {\mathcal {L^{*}}(u)} = \sum _{i=1}^{p} (-u^{i} \mathbf {x}_{i}^{T} \mathbf {Y} \mathbf {v} + \lambda _{1}\sum || \sqrt {||u_{i}||_{1}^{2}+\zeta }||_{\text {GOSCAR}}+\frac {\beta 1}{2}\sqrt {||u_{i}||_{1}^{2}+\zeta } +\frac {\gamma _{1}}{2}||\mathbf {x}_{i}u_{i}||_{2}^{2})\). It is easy to prove that \(\mathcal {L^{*}}(\mathbf {u})\) will reduce to problem (6) regarding u when ζ→0. The cases of ||v _{ i }||_{1}=0 and ||v _{ i }−v _{ j }||_{1}=0 can be addressed using a similar regularization method.
D _{1} is a diagonal matrix and its i-th diagonal element is obtained by summing the i-th row of W _{1}, i.e. \(d_{i} = \sum _{j} w_{ij}\). The diagonal matrix \(\hat {\mathbf {D}}_{1}\) is also obtained from \({\hat {d}}_{i} = \sum _{j} {\hat {w}}_{ij}\). Likewise, we can calculate W _{2}, \(\hat {\mathbf {W}}_{2}\), D _{2} and \(\hat {\mathbf {D}}_{2}\) by the same method in terms of v.
We observe that L _{1}, \(\hat {\mathbf {L}}_{1}\) and Λ _{1} depend on u which is an unknown variable, and v is also unknown which is used to calculate L _{2}, \(\hat {\mathbf {L}}_{2}\) and Λ _{2}. Thus we propose an effective iterative algorithm to solve this problem. We first fix v to solve u; and then fix u to solve v.
Algorithm 1 exhibits the pseudo code of the proposed GOSC-SCCA algorithm. For the key calculation steps, i.e., Step 5 and Step 10, we solve a system of linear equations with quadratic complexity other than computing the matrix inverse with cubic complexity. Thus the whole algorithm can work with desired efficiency. In addition, the algorithm is guaranteed to converge and we will prove this in the next subsection.
Convergence analysis
We first introduce the following lemma.
Lemma 1
Proof
Given the lemma in [16], we have \(||\tilde {\mathbf {u}}||_{2}-\frac {||\tilde {\mathbf {u}}||_{2}^{2}}{2||\mathbf {u}||_{2}} \leq ||\mathbf {u}||_{2}-\frac {||\mathbf {u}||_{2}^{2}}{2||\mathbf {u}||_{2}}\) for any two nonzero vectors. We also have \(||\tilde {u}||_{1}=||\tilde {u}||_{2}\) and ||u||_{1}=||u||_{2} for any two nonzero real numbers, which completes the proof. □
when \(|\tilde {u}' - u'|\), \(|\tilde {u} - u|\), \(|\tilde {u}' + u'|\) and \(|\tilde {u} + u|\) are nonzero.
We now have the following theorem regarding GOSC-SCCA algorithm.
Theorem 1
The objective function value of GOSC-SCCA will monotonically decrease in each iteration till the algorithm converges.
Proof
The proof consists of two parts.
Therefore, GOSC-SCCA will decrease the objective function in each iteration, i.e., \(\mathbf {\mathcal {L}(}\tilde {\mathbf {u}}\mathbf {,v)} \leq \mathbf {\mathcal {L}(u,v)}\).
Thus GOSC-SCCA also decreases the objective function in each iteration during the second phase, i.e., \(\mathbf {\mathcal {L}}(\tilde {\mathbf {u}},\tilde {\mathbf {v}}) \leq \mathbf {\mathcal {L}}(\tilde {\mathbf {u}},\mathbf {v})\).
Based on the analysis above, we easily have \(\mathbf {\mathcal {L}}(\tilde {\mathbf {u}},\tilde {\mathbf {v}}) \leq \mathbf {\mathcal {L}(u,v)}\) according to the transitive property of inequalities. Therefore, the objective value monotonically decreases in each iteration. Note that the CCA objective \(\mathbf {\frac {u^{T}X^{T}Yv}{\sqrt {u^{T}X^{T}Xu}\sqrt {v^{T}Y^{T}Yv}}}\) ranges from [-1,1], and both u ^{ T } X ^{ T } X u and v ^{ T } Y ^{ T } Y v are constrained to be 1. Thus the −u ^{ T } X ^{ T } Y v is lower bounded by -1, and so Eq. (6) is lower bounded by –1. In addition, Eqs. (16–17) imply that the KKT condition is satisfied. Therefore, the GOSC-SCCA algorithm will converge to a local optimum. □
Based on the convergence analysis, to facilitate the GOSC-SCCA algorithm, we set the stopping criterion of Algorithm 1 as max{|δ|∣δ∈(u _{ t+1}−u _{ t })}≤τ and max{|δ|∣δ∈(v _{ t+1}−v _{ t })}≤τ, where τ is a predefined estimation error. Here we set τ=10^{−5} empirically from the experiments.
The grouping effect of GOSC-SCCA
For the structured sparse learning in high-dimensional situation, the automatic feature grouping property is of great importance [18]. In regression analysis, Zou and Hastie [18] have suggested that a regressor behaviors grouping effect when it can set those regression coefficients of the same group to similar weights. This is also the case for structured SCCA methods. So, it is important and meaningful to investigate the theoretical boundary of the grouping effect.
We have the following theorem in terms of GOSC-SCCA.
Theorem 2
where ρ _{ ij } is the sample correlation between features i and j, and w _{ i,j } is the corresponding element in u-related matrix W _{1}.
Proof
□
where ρ ij′ is the sample correlation between the i-th and j-th feature in v, and \(w^{\prime }_{ij}\) is the corresponding element in v-related matrix W _{2}.
Theorem 2 provides a theoretical upper bound for the difference between the estimated coefficients of the i-th feature and j-th feature. It seems that this is not a tight enough bound. However our bound is slack since it does not bound much more the pairwise difference of features i and j if ρ _{ ij }≪1. This is desirable for two irrelevant features [19]. Suppose two features with very small correlation, i.e. ρ _{ ij }≪0, their coefficients do not need to be the same or similar. So we do not care about their coefficients’ pairwise difference, and will not set their pairwise difference a tight bound. This quantitative description for the grouping effect makes the GOSCAR penalty an ideal choice for structured SCCA.
Results
We compare GOSC-SCCA with several state-of-the-art SCCA and structured SCCA methods, including L1-SCCA [3], FL-SCCA [3], KG-SCCA [14]. We do not compare GOSC-SCCA with S2CCA [8], ssCCA [7] and CCA-SG (CCA Sparse Group) [10] since they require prior knowledge available in advance. We do not choose NS-SCCA [5] as benchmark either, due to the following two reasons. (1) NS-SCCA generates many intermediate variables during its iterative procedure. As the authors stated, NS-SCCA’s per-iteration complexity is linear in (p+|E|), and thus the complexity becomes O(p ^{2}) when it is in the group pursuit mode. (2) Its penalty term is similar to that of KG-SCCA which has been selected for comparison.
There are six parameters to be decided before using the GOSC-SCCA, thus it will take too much time by blindly tuning. We tune the parameters following two principles. On one hand, Chen and Liu [5] found out that the result is not very sensitive to γ _{1} and γ _{2}. So we choose them from a small scope [0.1, 1, 10]. On the other hand, if the parameters are too small, the SCCA will reduce to CCA due to the subtle influence of the penalties. And, too large parameters will over-penalize the results. Therefore, we tune the rest of the parameters within the range of {10^{−3},10^{−2},10^{−1},10^{0},10^{1},10^{2},10^{3}}. In this study, we conduct all the experiments using the nested 5-fold cross-validation strategy, and the parameters are only tuned from the training set. In order to save time, we only tune these parameters on the first run of the cross-validation. That is, the parameters are tuned when the first four folds are used as the training set. Then we directly use the tuned parameters for all the remaining experiments. All these methods use the same partition for cross-validation in the experiment.
Evaluation on synthetic data
We generate four synthetic datasets to investigate the performance of GOSC-SCCA and those benchmarks. Following [4, 5], these datasets are generated by four steps: 1) We predefine the structures and use them to create u and v respectively. 2) We create a latent vector z from N(0,I _{ n×n }). 3) We create X with each \(\mathbf {x}_{i} \sim N(z_{i}\mathbf {u},\sum _{x})\) where \((\sum _{x})_{jk}=\exp ^{-|u_{j}-u_{k}|}\) and Y with each \(\mathbf {y}_{i} \sim N(z_{i}\mathbf {v},\sum _{y})\) where \((\sum _{y})_{jk}=\exp ^{-|v_{j}-v_{k}|}\). 4) For the first group of nonzero features in u, we change half of their signs, and also change the signs of the corresponding data. Since the synthetic datasets are order-independent, this setup is equivalent to randomly change a portion of features’ signs in u. Now that we change the sign of both coefficients and the data simultaneously, we still have X ^{′} u ^{′}=X u where X ^{′} and u ^{′} indicate the data and coefficients after the sign swap. We do the same on the Y side to make our simulation more challenging [13]. In addition, we set all four datasets with n=80, p=100 and q=120. They also have different correlation coefficients and different group structures. Therefore, the simulation is designed to cover a set of diverse cases for a fair comparison.
5-fold cross-validation results on synthetic data
Training results | |||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods | Dataset 1 | MEAN | Dataset 2 | MEAN | Dataset 3 | MEAN | Dataset 4 | MEAN | AVG. | ||||||||||||||||
L1-SCCA | 0.52 | 0.56 | 0.52 | 0.53 | 0.51 | 0.53 | 0.25 | 0.29 | 0.16 | 0.20 | 0.23 | 0.23 | 0.56 | 0.24 | 0.57 | 0.53 | 0.52 | 0.48 | 0.46 | 0.50 | 0.53 | 0.48 | 0.35 | 0.46 | 0.43 |
FL-SCCA | 0.52 | 0.60 | 0.52 | 0.53 | 0.50 | 0.53 | NaN | NaN | 0.17 | NaN | 0.23 | 0.08 | 0.63 | 0.43 | 0.56 | 0.55 | 0.55 | 0.54 | 0.51 | 0.56 | NaN | 0.53 | 0.40 | 0.40 | 0.39 |
KG-SCCA | 0.52 | 0.55 | 0.52 | 0.53 | 0.53 | 0.53 | 0.25 | 0.29 | 0.15 | 0.20 | 0.22 | 0.22 | 0.56 | 0.24 | 0.43 | 0.52 | 0.52 | 0.45 | 0.51 | 0.56 | 0.48 | 0.52 | 0.40 | 0.49 | 0.42 |
GOSC-SCCA | 0.57 | 0.62 | 0.57 | 0.59 | 0.63 | 0.60 | 0.26 | 0.30 | 0.15 | 0.21 | 0.17 | 0.22 | 0.64 | 0.31 | 0.42 | 0.61 | 0.59 | 0.51 | 0.51 | 0.56 | 0.55 | 0.54 | 0.41 | 0.52 | 0.46 |
Testing results | |||||||||||||||||||||||||
L1-SCCA | 0.57 | 0.43 | 0.58 | 0.49 | 0.59 | 0.53 | 0.00 | 0.21 | 0.32 | 0.17 | 0.08 | 0.16 | 0.36 | 0.20 | 0.37 | 0.49 | 0.46 | 0.38 | 0.45 | 0.29 | 0.20 | 0.40 | 0.67 | 0.40 | 0.37 |
FL-SCCA | 0.56 | 0.38 | 0.57 | 0.49 | 0.59 | 0.52 | NaN | NaN | 0.48 | NaN | 0.08 | 0.11 | 0.30 | 0.80 | 0.36 | 0.51 | 0.41 | 0.47 | 0.55 | 0.30 | NaN | 0.46 | 0.72 | 0.40 | 0.38 |
KG-SCCA | 0.56 | 0.43 | 0.57 | 0.49 | 0.58 | 0.53 | 0.00 | 0.21 | 0.31 | 0.18 | 0.07 | 0.15 | 0.37 | 0.20 | 0.45 | 0.50 | 0.45 | 0.39 | 0.52 | 0.29 | 0.34 | 0.46 | 0.71 | 0.46 | 0.38 |
GOSC-SCCA | 0.73 | 0.39 | 0.68 | 0.56 | 0.45 | 0.56 | 0.02 | 0.09 | 0.57 | 0.20 | 0.38 | 0.25 | 0.23 | 0.18 | 0.43 | 0.44 | 0.43 | 0.34 | 0.53 | 0.31 | 0.31 | 0.36 | 0.72 | 0.45 | 0.40 |
Evaluation on real neuroimaging genetics data
Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). For up-to-date information, see www.adni-info.org.
Real data characteristics
HC | MCI | AD | |
---|---|---|---|
Num | 196 | 343 | 28 |
Gender(M/F) | 102/94 | 203/140 | 18/10 |
Handedness(R/L) | 178/18 | 309/34 | 23/5 |
Age (mean ±std.) | 74.77 ±5.39 | 71.92 ±7.47 | 75.23 ±10.66 |
Education (mean ±std.) | 15.61 ±2.74 | 15.99 ±2.75 | 15.61 ±2.74 |
5-fold cross-validation results on real data
Methods | Training results | MEAN | Testing results | MEAN | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
L1-SCCA | 0.50 | 0.50 | 0.53 | 0.53 | 0.54 | 0.52 | 0.56 | 0.61 | 0.45 | 0.47 | 0.38 | 0.49 |
FL-SCCA | 0.44 | 0.43 | 0.46 | 0.45 | 0.46 | 0.45 | 0.49 | 0.56 | 0.39 | 0.43 | 0.37 | 0.45 |
KG-SCCA | 0.53 | 0.52 | 0.55 | 0.54 | 0.56 | 0.54 | 0.56 | 0.61 | 0.47 | 0.52 | 0.45 | 0.52 |
GOSC-SCCA | 0.53 | 0.52 | 0.55 | 0.55 | 0.56 | 0.54 | 0.56 | 0.62 | 0.47 | 0.51 | 0.45 | 0.52 |
Discussion
In this paper, we have proposed a structured SCCA method GOSC-SCCA, which intended to reduce the estimation bias caused by the incorrect sign of sample correlation. GOSC-SCCA employed the GOSCAR (Graph OSCAR) regularizer which is an extension of the popular penalty OSCAR. The GOSC-SCCA could pull those highly correlated features together no matter that they were positively correlated or negatively correlated. We also provide a theoretical quantitative description of the grouping effect of our SCCA method. An effective algorithm was also proposed to solve the GOSC-SCCA problem and the algorithm was guaranteed to converge.
We evaluated GOSC-SCCA and three other popular SCCA methods on both synthetic datasets and a real imaging genetics dataset. The synthetic datasets consisted of different ground truth, i.e. different correlation coefficients and canonical loadings. GOSC-SCCA was capable of consistently identifying strong correlation coefficients on both training set and testing set, and either outperformed or performed similarly to the competing methods. Besides, GOSC-SCCA successfully and accurately recognized the signals which were the closest to the ground truth when compared with the competing methods.
The results on the real data showed that both GOSC-SCCA and KG-SCCA could find an important association between the APOE SNPs and the amyloid burden measure in the frontal region of the brain. KG-SCCA performs similarly to GOSC-SCCA on this real data largely because of the strong correlations between the variables within the genetic data, as well as those within the imaging data. In this case, the signs of the correlation coefficients between these variables tend to be correctly calculated, and so KG-SCCA does not have the sign directionality issue. On the other hand, if the correlations among some variables are not very strong, the performance of KG-SCCA can be affected by the mis-estimation of some correlation signs. In this case, GOSC-SCCA, which is designed to overcome the sign directionality issue, is expected to perform better than KG-SCCA. This fact has already been validated by the results of the second synthetic dataset.
The satisfactory performance of GOSC-SCCA, coupled with its theoretical convergence and grouping effect, demonstrates the promise of our method as an effective structured SCCA method in identifying meaningful bi-multivariate imaging genetic associations. The following are a few possible future directions. (1) Note that the identified pattern between the APOE genotype and amyloid deposition is a well-known and relatively strong imaging genetic association. Thus one direction is to apply GOSC-SCCA to more complex imaging genetic data for revealing novel but less obvious associations. (2) The data tested in this study is brain wide but targeted only at APOE SNPs. Another direction is to apply GOSC-SCCA to imaging genetic data with higher dimensionality, where more effective and efficient strategies for parameter tuning and cross-validation warrant further investigation. (3) The third direction is to employ GOSC-SCCA as a knowledge-driven approach, where pathways, networks or other relevant biological knowledge can be incorporated in the model to aid association discovery. In this case, comparative study can also been done between GOSC-SCCA and other state-of-the-arts knowledge-guided SCCA methods in bi-multivariate imaging genetics analyses.
Conclusions
We have presented a new structured sparse canonical analysis (SCCA) model for analyzing brain imaging genetics data and identifying interesting imaging genetic associations. This SCCA model employs a regularization item based on the graph octagonal selection and clustering algorithm for regression (GOSCAR). The goal is twofold: (1) encourage highly correlated features to have similar canonical weights, and (2) reduce the estimation bias via removing the requirement of pre-defining the sign of the sample correlation. As a result, it could pull highly correlated features together no matter whether they are positively or negatively correlated. Empirical results on both synthetic and real data have demonstrated the promise of the proposed method.
Notes
Declarations
Acknowledgements
At Indiana University, this work was supported by NIH R01 LM011360, U01 AG024904, RC2 AG036535, R01 AG19771, P30 AG10133, UL1 TR001108, R01 AG 042437, R01 AG046171, and R03 AG050856; NSF IIS-1117335; DOD W81XWH-14-2-0151, W81XWH-13-1-0259, and W81XWH-12-2-0012; NCAA 14132004; and CTSI SPARC Program. At University of Texas at Arlington, this work was supported by NSF CCF-0830780, CCF-0917274, DMS-0915228, and IIS-1117965. At University of Pennsylvania, the work was supported by NIH R01 LM011360, R01 LM009012, and R01 LM010098.
Data collection and sharing for this project was funded by the Alzheimer’s Disease Neu-roimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengi-neering, and through generous contributions from the following: AbbVie, Alzheimer’s Asso-ciation; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Bio-gen; Bristol-Myers Squibb Company; CereSpir, Inc.; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Develop-ment LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; Neu-roRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeu-tics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the Na-tional Institutes of Health (www.fnih.org). The grantee organization is the Northern Califor-nia Institute for Research and Education, and the study is coordinated by the Alzheimer’s Disease Cooperative Study at the University of California, San Diego. ADNI data are dis-seminated by the Laboratory for Neuro Imaging at the University of Southern California.
Declarations
Publication charges for this article have been funded by the corresponding author.
This article has been published as part of BMC Systems Biology Volume 10 Supplement 3, 2016: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM) 2015: systems biology. The full contents of the supplement are available online at
http://bmcsystbiol.biomedcentral.com/articles/supplements/volume-10-supplement-3.
Availability of data and materials
Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu).
Authors contributions
LD, JM, AS, and LS: overall design. LD, HH, and MI: modeling and algorithm design. LD and JY: experiments. SK, SR, and AS: data preparation and result evaluation. LD, JY and LS: manuscript writing. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Vounou M, Nichols TE, Montana G. Discovering genetic associations with high-dimensional neuroimaging phenotypes: A sparse reduced-rank regression approach. NeuroImage. 2010; 53(3):1147–59.View ArticlePubMedGoogle Scholar
- Parkhomenko E, Tritchler D, Beyene J. Sparse canonical correlation analysis with application to genomic data integration. Stat Appl Genet Mol Biol. 2009; 8(1):1–34.Google Scholar
- Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009; 10(3):515–34.View ArticlePubMedPubMed CentralGoogle Scholar
- Chen X, Liu H, Carbonell JG. Structured sparse canonical correlation analysis. In: International Conference on Artificial Intelligence and Statistics, JMLR Proceedings 22, JMLR.org: 2012.Google Scholar
- Chen X, Liu H. An efficient optimization algorithm for structured sparse cca, with applications to eqtl mapping. Stat Biosci. 2012; 4(1):3–26.View ArticleGoogle Scholar
- Chi EC, Allen G, Zhou H, Kohannim O, Lange K, Thompson PM, et al. Imaging genetics via sparse canonical correlation analysis. In: Biomedical Imaging (ISBI), 2013 IEEE 10th Int Sym On: 2013. p. 740–3, doi:http://dx.doi.org/10.1109/ISBI.2013.6556581.
- Lin D, Calhoun VD, Wang YP. Correspondence between fMRI and SNP data by group sparse canonical correlation analysis. Medical image analysis. 2014; 18(6):891–902.View ArticlePubMedGoogle Scholar
- Du L, Yan J, Kim S, Risacher SL, Huang H, Inlow M, Moore JH, Saykin AJ, Shen L. A novel structure-aware sparse learning algorithm for brain imaging genetics. In: International Conference on Medical Image Computing and Computer Assisted Intervention. Berlin, Germany: Springer: 2014. p. 329–36.Google Scholar
- Witten DM, Tibshirani RJ. Extensions of sparse canonical correlation analysis with applications to genomic data. Stat Appl Genet Mol Biol. 2009; 8(1):1–27.Google Scholar
- Chen J, Bushman FD, et al. Structure-constrained sparse canonical correlation analysis with an application to microbiome data analysis. Biostatistics. 2013; 14(2):244–58.View ArticlePubMedGoogle Scholar
- Bondell HD, Reich BJ. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics. 2008; 64(1):115–23.View ArticlePubMedGoogle Scholar
- Li C, Li H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008; 24(9):1175–82.View ArticlePubMedGoogle Scholar
- Yang S, Yuan L, Lai YC, Shen X, Wonka P, Ye J. Feature grouping and selection over an undirected graph. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM: 2012. p. 922–30.View ArticleGoogle Scholar
- Yan J, Du L, Kim S, Risacher SL, Huang H, Moore JH, Saykin AJ, Shen L. Transcriptome-guided amyloid imaging genetic analysis via a novel structured sparse learning algorithm. Bioinformatics. 2014; 30(17):564–71.View ArticleGoogle Scholar
- Hardoon D, Szedmak S, Shawe-Taylor J. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 2004; 16(12):2639–64.View ArticlePubMedGoogle Scholar
- Nie F, Huang H, Cai X, Ding CH. Efficient and robust feature selection via joint 2, 1-norms minimization. In: Advances in Neural Information Processing Systems. Massachusetts, USA: The MIT Press: 2010. p. 1813–21.Google Scholar
- Grosenick L, Klingenberg B, Katovich K, Knutson B, Taylor JE. Interpretable whole-brain prediction analysis with graphnet. NeuroImage. 2013; 72:304–21.View ArticlePubMedGoogle Scholar
- Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Stat Soc Ser B (Stat Method). 2005; 67(2):301–20.View ArticleGoogle Scholar
- Lorbert A, Eis D, Kostina V, Blei DM, Ramadge PJ. Exploiting covariate similarity in sparse regression via the pairwise elastic net. In: International Conference on Artificial Intelligence and Statistics, JMLR Proceedings 9, JMLR.org: 2010. p. 477–84.Google Scholar
- Ramanan VK, Risacher SL, Nho K, Kim S, Swaminathan S, Shen L, Foroud TM, Hakonarson H, Huentelman MJ, Aisen PS, et al. Apoe and bche as modulators of cerebral amyloid deposition: a florbetapir pet genome-wide association study. Mole psychiatry. 2014; 19(3):351–7.View ArticleGoogle Scholar