Analyzing networks of phenotypes in complex diseases: methodology and applications in COPD

Background The investigation of complex disease heterogeneity has been challenging. Here, we introduce a network-based approach, using partial correlations, that analyzes the relationships among multiple disease-related phenotypes. Results We applied this method to two large, well-characterized studies of chronic obstructive pulmonary disease (COPD). We also examined the associations between these COPD phenotypic networks and other factors, including case-control status, disease severity, and genetic variants. Using these phenotypic networks, we have detected novel relationships between phenotypes that would not have been observed using traditional epidemiological approaches. Conclusion Phenotypic network analysis of complex diseases could provide novel insights into disease susceptibility, disease severity, and genetic mechanisms.


Background
Complex diseases like diabetes, stroke, many types of cancer, and chronic obstructive pulmonary disease (COPD) are likely heterogeneous syndromes composed of multiple disease subtypes that manifest a similar pathological or physiological outcome. These subtypes may have different genetic determinants. In order to understand this heterogeneity, a variety of clinical, physiological, imaging, pathological, and biochemical disease-related phenotypes have been analyzed [1]. In standard clinical epidemiological approaches, univariate and multivariate regression analyses are performed to determine significant and independent predictors of disease development. However, the available disease-related phenotypes may be crude assessments of disease pathophysiology; any analyses that are performed may be confounded by grouping multiple Full list of author information is available at the end of the article subtypes together. The challenge we face, in part, is deconvoluting these disease-related phenotypes and defining their relationships to one another and to specific genetic determinants.
Network analysis has the potential to provide a holistic approach to the understanding of disease complexity, rather than focusing on individual components of disease [2]. Network approaches can capture emergent properties that are not apparent when network components are analyzed in a pair-wise manner. However, network medicine approaches to complex diseases have largely focused on relating a disease to the underlying cellular and molecular interaction network [3]. Correlation-based networks have been frequently used to analyze gene expression data [4,5], but these methods have not been widely applied to the study of disease-related phenotypes. Barabási and colleagues [6] used diagnostic coding data to assess phenotypic network relationships between different disease categories, but not to analyze multiple quantitative phenotypes within one complex disease. Using COPD as an example, we describe the application of network inference http://www.biomedcentral.com/1752-0509/8/78 methods to explore the relationships between diseaserelated phenotypes that have been found to be relevant in determining disease severity and outcome, and, ultimately, to begin to define the complex heterogeneity of the disease.

Network inference and comparison
To infer phenotypic networks, we used the Gaussian graphical model (GGM) introduced by [7] and [8]. Briefly, the model, which is based on the assumption that the variables have Gaussian distributions, infers the connection between each pair of variables and creates a phenotypic network based on partial correlations.
Assume that we have P phenotype variables and K subjects. We begin by constructing a P × K matrix, Y , where we assume that the elements of Y follow a multivariate normal distribution: Here, y ji represents the jth phenotype variable in the ith subject, μ is the mean vector and is the covariance matrix. The covariance matrix Y and the partial correlation matrix (denoted by ) for Y are estimated (see [9]). The partial correlation (PCOR) ω jk measures the correlation between variable j and variable k while controlling for all other variables. Therefore, ω jk represents the conditional dependency between variable j and variable k, with ω jk = 0 if the two variables are independent conditional on all other variables and ω jk = 0 if they are conditionally correlated. For each pair of variables that are conditionally dependent, the presumed causal relationship between the variables is a direct one and independent of all other variables. We assume that these partial correlations represent the hidden connections between phenotypic variables that may help to refine disease subtypes.
Under the null hypothesis in which all variables are independent, Hotelling [10] gives the null distribution of sample partial correlation ω as where κ is the degrees of freedom (K − P + 1). Therefore, we can compute the p-values for the estimated partial correlation coefficients for each pair of phenotypic variables and test for the presence of a significant connection between those variables in the phenotypic network. In addition, we can also test for differences in the network connectivity between two groups of subjects by permutation tests. For example, to test for differential connectivity between COPD cases and controls, we randomly swap the labels of cases and controls and calculate the PCORs in the shuffled groups, repeated 10,000 times, to obtain the distribution of PCORs under the null hypothesis in which the presence or absence of connections is not associated with the case-control status. The empirical p-values are reported. Analogously, we have also tested differential connectivity between different genotypes for two previously identified genome-wide significant SNPs associated with COPD using the same approach.
Opgen-Rhein and Strimmer [11] have extended the GGM method to infer the directionality of the edges between each pair of variables. They proposed a test of directionality based on the log-ratios of standardized partial variances. This method enables identification of a "partially directed graph" where some of the significant edges identified by GGM methods will have directions, which might imply causality, while other edges remain undirected.

Study populations and phenotypic variable selection
COPD is a disease defined by abnormal physiology, with chronic airflow obstruction as the common, key feature [12]. Chronic airflow obstruction is characterized by reductions in the forced expiratory volume in one second (FEV1) and in the ratio of the FEV1 to the forced vital capacity (FVC), which are assessed by spirometry. Clinical epidemiological studies have identified multiple factors that contribute to COPD, including cigarette smoking (often quantified as pack-years, where an average of one pack of cigarettes smoked per day for one year is one packyear) and increasing age. In addition, a variety of diseaserelated phenotypes have been studied related to imaging, exercise capacity, respiratory symptoms, and physiology. Computerized tomography (CT) imaging enables assessment of the severity and distribution of emphysema-the destruction of lung parenchyma-as well as thickening of airways [13][14][15]. The underlying assumption in our analysis is that these phenotypic variables are not independent, but, rather, interact to define distinct groups of patients (subtypes). By defining these subtypes, we might better be able to classify patients, understand their unique disease characteristics, and ultimately direct them to appropriate therapies.
The COPDGene Study [16] is a multi-center genetic and epidemiologic investigation to study COPD and other smoking-related lung diseases. In this study, 10,192 smokers (including 6,784 non-Hispanic Whites (NHW) and 3,408 African-Americans (AA)) have completed a detailed protocol, including questionnaires, pre-and postbronchodilator spirometry, high-resolution CT scanning of the chest, exercise capacity (assessed by six minute walk distance), and blood samples for genotyping. Samples were genotyped using the Illumina OmniExpress platform, which assayed genetic polymorphisms at over 700,000 sites along the genome; the genotype data have gone through standard quality-control procedures for genome-wide association analysis. Briefly, a total of 221 http://www.biomedcentral.com/1752-0509/8/78 subjects and 83,423 markers were excluded for quality control reasons, including identity-by-descent, gender mismatches, genotype missingness, Hardy-Weinberg disequilibrium in controls, and low minor allele frequency. The details of the quality control procedures are available at http://www.copdgene.org/sites/default/files/ GWAS_QC_Methodology_20121115.pdf.
For phenotypic network analysis, we selected 10 key quantitative COPD-related phenotypes based on clinical experts' opinions (co-authors EKS, CPH, and MHC). The phenotypes were chosen to represent major diseaserelated components, including imaging, physiology, exercise capacity, and exacerbations, as well as important demographic variables (Table 1). Although over 300 variables were captured by questionnaires, clinical assessments, and CT scanning in COPDGene, we chose phenotypes to avoid duplicate assessment of the same aspect of the disease (e.g., lung function, emphysema severity, and airway wall thickness). For example, we included FEV1 but excluded FEV1/FVC, as they are both lung function phenotypes which assess airflow obstruction. Subjects with missing data in any of the 10 quantitative variables were excluded. Therefore, a complete set of 8,141 subjects were used in the following analyses, including 5,478 NHWs and 2,514 AAs. Case subjects were defined by FEV1 < 80% predicted and FEV1/FVC < 0.7, while control subjects were defined by FEV1 ≥ 80% predicted and FEV1/FVC ≥ 0.7. In addition to assessment based on case-control status, we compared groups of subjects homozygous for risk-and non-risk alleles at known GWAS SNPs, excluding heterozygotes from the genotypestratified phenotypic networks to maximize phenotypic effects. To assess the impact of including phenotypic variables that are not closely related to COPD on our phenotypic networks, we also created networks including heart rate and systolic blood pressure as well as networks including two randomly generated variables. Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints (ECLIPSE, [17]) is a large longitudinal study of COPD patients and controls with comprehensive phenotyping similar to COPDGene. Therefore, we used a subset of 1,705 COPD cases (including 1,667 white subjects) with complete data for the 10 quantitative variables at their baseline study visit to build phenotypic networks. All variables in Table 1 were available in ECLIPSE, except for Emphysema Distribution and Gas Trapping. Therefore, networks with 8 variables were built for both COPDGene and ECLIPSE for comparison.

Whole population phenotypic network in COPDGene
The ten selected COPD-related phenotypes in COPDGene were found to be highly connected in the whole study population. Out of 45 pairs of phenotypes, 37 had significant PCORs with p-values< 0.05, and 29 pairs were significant with p-values< 0.001 (density = 64.44%, where the density of a network is defined by the portion of all possible connections in a network that are actual connections, see Figure 1 and Table 2). The most highly connected nodes were FEV1 and Gas Trapping (see Figure 1), with Gas Trapping significantly connected with all of the analyzed phenotypes. In addition, the 16 pairs that were not directly connected (p-values > 0.001) were connected through only one transitive node based on shortest path analysis [18]. The majority of shortest paths connected through gas trapping (9 out of 16), suggesting that gas trapping is a "hub" in the phenotypic network. This finding is consistent with the high correlation observed between CT gas trapping and spirometric measures [19], and also with the observation that CT gas trapping encompasses the two major pathological processes in COPD-emphysema and small airway disease. Most edges in this whole population network remained statistically significant after we stratified by race, while the NHW network edges were slightly more significant than the AA network likely due to larger sample size and better power. FEV1 and Gas Trapping remained highly connected in the race-stratified networks. The top four pairs (CT Emphysema/Gas Trapping, FEV1/Gas Trapping, FEV1/Pi10, and Gas Trapping/Age) all stayed consistently top-ranked for the whole population and race-stratified networks and were all highly significant (see Table 2).
Since the ten variables were chosen based on their association with COPD, it was not surprising to find that http://www.biomedcentral.com/1752-0509/8/78 Whole Population Network, p<0.001 most of the variables were highly connected. To assess the effects of variable selection, we repeated the analysis with two scenarios: (1) we added two extra variables randomly generated from a standard Gaussian distribution; and (2) we added two "extraneous" variables that were presumably less closely related to COPD: systolic blood pressure (SysBP) and heart rate (HR). As expected, the Gaussian random variables were not connected to any other variables. However, when the two "extraneous" clinical variables were introduced they were found to be connected to some of the other variables, but they were not an integral part of the graph. The network was sparser, as there were fewer edges between these presumably less related variables and other network components. Using the same threshold (p < 0.001), 37 pairs of variables were significantly connected (density = 56.06%), including all 29 pairs that were significantly connected in the original network analysis. The extra 8 edges resulting from the two presumably unrelated variables included some clinically expected pairs including demographic variables, such as SysBP/BMI, SysBP/Age, and HR/Age. There were also a few connections between HR and COPD phenotypes which could be of potential interest (See Additional file 1: Figure S1). Therefore, variables selection does play an important role in the degree in which the variables are connected.
Although our primary phenotypic network analysis was based on undirected edges, we also created a phenotypic network using directed edges-where they could be defined with certainty. The partially directed network analysis showed that for 9 out of the 29 edges directionality could be established, with 7 variables directed toward Gas Trapping (except for FEV1 and Emphysema, See Figure 2).

Case-control phenotypic network comparison in COPDGene
We then built phenotypic networks for COPD case and control groups separately to examine the similarities and differences in phenotypic relationships between these two groups. Separate GGM networks were estimated in COPDGene for smoking controls with normal spirometry (n = 3597) and COPD cases with moderate to very severe airflow obstruction (GOLD Stage ≥ 2, n = 2894) to explore the impact of COPD on phenotypic relationships. Using p-value = 0.05 as the threshold for statistical significance, the case and control networks each had 30 significant edges. The top pairs of variables were consistent in these two phenotypic networks, including CT Emphysema/Gas Trapping, Gas Trapping/BMI, Gas Trapping/Age, and CT Emphysema/Pi10, with a total of 17 edges present in both subgroups. However, the presence http://www.biomedcentral.com/1752-0509/8/78 of these edges in both groups does not exclude the possibility that these partial correlations could be associated with case-control status. The permutation tests showed some additional differences between the networks, where 24 pairs with significantly different p-values in the comparison between case and control networks were observed (See Additional file 2: Table S1). For example, the Gas-Trapping/BMI pair had significant negative connections in both groups, but was more strongly connected in the case group than the control group. There were 32 pairs significantly associated with case-control status (See Figure 3). While most pairs had very similar patterns of correlation in both groups, one of the notable exceptions was between CT Emphysema and BMI. Higher CT emphysema was associated with higher BMI in the control group but was associated with lower BMI in the case group, and both associations were statistically significant.

Moderate/Severe COPD network comparison in COPDGene
Next, we constructed phenotypic networks in COPDGene moderate COPD cases (GOLD= 2, n = 1563) and severe to very severe COPD cases (GOLD ≥ 3, n = 1331) to test for association between the phenotypic networks and disease severity. The moderate COPD network had 25 edges under the p-value < 0.05 threshold, while the severe COPD network had 24-slightly fewer connections than the case-control networks in the section above, likely due to smaller sample size (See Additional file 2: Table S1). Globally, the differences between the two networks of COPD cases were less pronounced than the http://www.biomedcentral.com/1752-0509/8/78  population (N=8,141). The topology of the network is identical to the correlation graph in Figure 1, but the edges with significant directionality are oriented.

Whole Population Network, directed
case-control network comparison, with only 17 pairs with significant differential connections according to the permutation testing (Figure 4). However, when we compared the smoking controls with the moderate and severe COPD case groups separately, we observed many pronounced differences in the control/severe COPD comparison, with multiple pairs significantly positively correlated in one network and negatively correlated in another. In many cases, we find the control group and severe COPD group at the opposite ends of the distribution, with the moderate COPD group in the middle. For example, CT Emphysema/6MWD had a negative correlation in the severe COPD network, no correlation in the moderate COPD group, and a strongly positive correlation in the control group. We also found that CT Emphysema and BMI were positively correlated in the control group, not correlated in the moderate COPD group, and negatively correlated in the severe COPD group. Figure 5 shows the BMI-CT Emphysema partial residual plot (the residuals of BMI and CT Emphysema from regressing out the other 8 variables) in the three groups, and we can see that the negative association between BMI and emphysema was only present in severe COPD cases. Table 3 shows the partial correlation coefficients and Pearson correlation coefficients between BMI and emphysema, and we observe that the opposite relationships between the control and COPD groups only became apparent after we regressed out the other 8 variables in the partial correlation framework. These results suggest that partial correlations could provide additional insight about the relationships between these phenotypes beyond standard epidemiological analyses.

Genetic-based network comparison in COPDGene
We also constructed phenotypic networks for COPDGene subjects defined by their genotypes at two SNPs previously associated with COPD in genome-wide association studies: rs1980057 (HHIP) [20,21] and rs7671167 (FAM13A) [22]. Separate networks were built for homozygous samples (2 copies of the COPD-risk allele or 2 copies of the non-risk allele) for each of these SNPs. Note that in both loci, the minor allele has been associated with COPD protection. We only built genotype-stratified phenotypic networks for NHW subjects, as this FAM13A SNP did not have a significant association with COPD in the AA population in previous GWAS [23], and the HHIP SNP was a relatively uncommon variant in AA population (MAF=0.10) with few homozygous minor allele subjects. Using permutation tests, we observed that only a few phenotype pairs significantly differed between these genotype-stratified phenotypic networks, and none of the edges was significant in opposite directions (See Figure 6 and Additional file 3: Table S2, Additional file 4: Table S3). http://www.biomedcentral.com/1752-0509/8/78 The most discordant phenotype pair for FAM13A was FEV1/Emphysema, which was negatively correlated in the FAM13A COPD-non-risk group but not correlated in the FAM13A COPD-risk group. Other pairs that showed differential connection between the two groups include Pi10/Exacerbation Frequency and Age/CT Emphysema. For HHIP, we found FEV1/Exacerbation Frequency to be negatively correlated in both homozygous groups, but the partial correlation was significantly stronger in the COPD-non-risk group than in the COPD-risk group (−0.21 vs. −0.13). There were a few other pairs with only one homozygous group deviating from the null distribution (p-values < 0.05) based on the permutation tests, and the signal was not as strong. Overall, the genetic variables did not have effects on the phenotypic networks that were as great as case-control status or COPD severity.

ECLIPSE network comparison
Finally, we constructed phenotypic networks from ECLIPSE, another independent COPD population, and compared the results between the ECLIPSE and COPDGene networks. The major difference between these two cohorts is that ECLIPSE contains mostly moderate to severe COPD samples (GOLD 2-4) and mostly Caucasians. Therefore, we performed the comparative studies on two sets of sub-populations: (1) All COPD cases (GOLD 2-4, n=2,894 for COPDGene and n=1,705 for ECLIPSE); (2) NHW COPD cases only (n=2,264 for COPDGene and n=1,667 for ECLIPSE). Only 8 out of 10 variables in Table 1 were available in ECLIPSE, therefore the networks were built with 8 nodes and 28 possible edges. The results show that the networks from two populations were very similar (see Additional file 5: Table S4, Additional file 6: Table S5, and Additional file 7: Figure S2.), with minor differences. In all COPD cases, 15 pairs were significant with p<0.001 in COPDGene, out of which 12 were also significant in ECLIPSE (all of them were in the same direction). In white COPD cases, 16 pairs were significant with p<0.001 in COPDGene, out of which 13 were also significant in ECLIPSE. The most striking difference is that Pi10/BMI had the second highest correlation in ECLIPSE (in both analyses), but it was not significant in COPDGene. Overall, the networks from these two populations are reasonably comparable, and most of the strongest connections from COPDGene can be found in another independent population.

Discussion
Complex diseases are assessed using an array of diseaserelated phenotypic variables, which may have subtle, hidden http://www.biomedcentral.com/1752-0509/8/78 relationships that are not captured by standard epidemiological analyses. Understanding the relationships between these disease-related phenotypes in large, wellcharacterized study populations may provide insight into disease heterogeneity. Different approaches have been proposed to study the relationship between multiple phenotypes, including structural equation modeling [24] and mutual information [25]. We have developed an approach for constructing networks of phenotypic variables based on partial correlations between quantitative, diseaserelated phenotypes; for testing the statistical significance of those partial correlations within one phenotypic network; and for comparing those partial correlations between phenotypic networks constructed using different groups of subjects. The correlation-based networks that we analyzed are highly connected and not scale-free, as opposed to the sparse, scale-free networks that are observed in many biological and physical phenomena [2]. This is not surprising, as we built the networks based on a modest number of pre-selected variables closely related to the complex disease of interest.

Moderate vs Severe COPD cases
The correlation-based networks have enabled us to detect novel relationships between disease-related phenotypes that would not have been observed in a singlevariable analysis. Network based approaches are particularly useful in the studies of COPD, which is a complex disease with diverse clinical and molecular phenotypic profiles that might represent different subtypes [26]. In our study, the COPD network in the whole COPDGene study population provided a variety of clinically intuitive observations, such as the central location of gas trapping in the network-which includes both of the major COPD-related causes of airflow obstruction, emphysema and small airway disease. This key role of gas trapping was especially notable in the partially directed network (Figure 2). Comparisons between COPD cases and control subjects showed similar relationships for most variables, but an intriguing switch in the direction of the relationship between body mass index and CT emphysema was observed in controls compared to cases. Cachexia can accompany advanced COPD with severe emphysema, so a negative relationship between BMI and emphysema in COPD cases is clinically reasonable. Since the same radiation dose was used in all COPDGene subjects, the positive relationship between BMI and emphysema in control subjects could relate to the increased radiation noise with http://www.biomedcentral.com/1752-0509/8/78  Figure 5 Partial residual plot. The partial residual plot between BMI and CT Emphysema for the smoking controls (black), moderate COPD cases (green), and severe COPD cases (red) networks. The partial residuals are the residuals of BMI and CT Emphysema from regressing out the other 8 variables.
higher BMI, which could flatten the density histogram and artifactually increase the estimated degree of emphysema using densitometric thresholds. Other phenotypic relationships are less intuitive but may point to important biological pathways. Comparison between moderate and severe/very severe COPD subjects showed a variety of interesting correlations between phenotypes. For example, increased emphysema was associated with reduced exercise capacity (6MWD) in severe COPD subjects but not in the moderate COPD group. Exercise capacity, assessed by 6MWD, includes many components but is likely significantly related to inspiratory capacity. In severe disease, inspiratory capacity is limited by baseline hyperinflation, which is observed by emphysema on CT scan. However, in moderate disease, other parameters are major determinants of 6MWD. Inspiratory capacity may be limiting with dynamic hyperinflation in moderate COPD subjects, but inspiratory capacity will likely not be closely associated with emphysema in this subgroup. It is unclear why airway wall area (Pi10) was significantly correlated with body mass index (BMI) in one of our study populations (ECLIPSE) but not the other (COPDGene). One possible explanation is that the CT radiation dose for ECLIPSE was substantially lower than in COPDGene, and this difference in radiation dose could have impacted how BMI influenced airway wall measurements.
Phenotypic networks have previously been studied in the context of multi-dimensional analyses that have included both phenotypic and genetic information [27,28]. Our method can also be applied in such integrative analyses. In particular, we examined the effects of genetic perturbations on the relationships between the phenotypes. In our COPD example, relatively few phenotypic interactions were different between homozygotes for alternative alleles of COPD GWAS regions near HHIP and FAM13A. Given the modest effects of these variants, and most other complex disease GWAS regions, these results are not surprising. However, the observed differences, such as the FEV1-emphysema relationship in alternate FAM13A genotypes, could provide clues regarding the underlying mechanisms by which these GWAS regions influence disease susceptibility. These results suggest that FAM13A may lead to reduced FEV1 through mechanisms other than increased emphysema, which is  (1) HHIP in non-Hispanic White (NHW) subjects (2 copies of the COPD-risk or non-risk allele) and (2) FAM13A NHW (2 copies of the COPD-risk or non-risk allele). The green edges are present in both groups (p < 0.05) and the partial correlations have the same sign, but the magnitude of effect is significantly different between genotype groups. The black edges are present in one group but not the other.
a testable hypothesis for future research. Similarly, the weaker relationship of FEV1 and exacerbation frequency in the COPD-associated group could indicate that any relationship of the HHIP locus to exacerbations may not be mediated through reduced FEV1. Although published studies have described methods for assessing relationships between disease diagnostic categories in a network context [6,29], we instead focused on multiple disease-related phenotypes within one complex disease. While we believe this represents an important new approach, several limitations of our work need to be acknowledged. It is not clear whether it is preferable to use a weighted network, in which all edges are present but of variable magnitude, or an unweighted network, with an admittedly somewhat arbitrary threshold for placing an edge. Further work will also be required to determine the optimal approach for assessing the impact of genetic factors on phenotypic networks. We have compared alternate homozygous classes, but that approach eliminates the information in the typically larger heterozygous genotype group.

Conclusion
In conclusion, we have presented a framework for analyzing and comparing partial correlations between multiple, quantitative disease-related phenotypes in networks. These phenotypic networks could provide insights into disease susceptibility, disease severity, and genetic mechanisms. Future directions will involve refining the approaches for selecting phenotypes to include in such networks as well as improved approaches for incorporating genetic information. Ultimately, these phenotypic networks may prove useful in developing novel classification systems for complex diseases. http://www.biomedcentral.com/1752-0509/8/78