To identify reliable gene signature of breast cancer by integrating various gene signatures, we propose a graph centrality based method to identify disease genes from a constrained PIN and the overview of this method is provided in Figure 1. Briefly, as shown in Figure 1, the method proposed here has three steps:
-
1.
Collect genes from different breast cancer gene signatures, and discard the genes that exist in only one signature.
-
2.
Project the genes collected in Step 1. to human PIN to construct a context-constrained PIN that consisted. Therefore, to some extent, all genes in this context-constrained network are related to breast cancer. However, we don't know which genes are the most important ones to the breast cancer and can be used to predict clinical outcome.
-
3.
To determine the relationship between genes and breast cancer, we calculated graph centrality of each gene in this constrained PIN. Since the constrained PIN is built based on breast cancer gene signatures, graph centrality of genes in this network indicates their relationship to breast cancer. Output given number of genes with highest graph centrality as the new unified breast cancer signature.
Details of the three steps are described in following three subsections and then validation methods are presented.
Collecting genes from different signatures
GeneSignDB (http://compbio.dfci.harvard.edu/genesigdb/) [17] is a curated gene signatures database that collected gene signatures for various species and diseases. Keywords "breast cancer" for disease and "human" for species are used to search gene signatures for human breast cancer in GeneSignDB. 94 distinct human breast cancer signatures are obtained, which are reported in 58 different literatures. Since the genes which are included in only one gene signature may be generated by chance, we discard these unreliable genes.
Construction of context-constrained PIN
A complete human PIN is constructed by integrating protein interaction data from Human Protein Reference Database (HPRD) and BioGRID interaction database [18, 19]. After removal of duplicate edges and self-interactions, we got a PIN that is consisted by 51057 distinct interactions among 11465 proteins.
Then, the genes we collected in the first step are projected to the complete human PIN and a constrained PIN for human breast cancer is obtained. This constrained PIN contains 2924 proteins and 4698 interactions.
Use graph centrality to quantify the relationship between genes and breast cancer
Various definitions of graph centrality have been proposed from different perspectives to evaluate the importance of nodes in a graph. The concept has been widely used in bioinformatics, such as discovery of essential proteins in protein networks [20]. Because it is difficult to infer which definition is best for identifying disease genes in the context-constrained network, we evaluated six different definitions in our work.
For a protein interaction network G(V,E), the six measurements of centrality used in this study are defined as following:
• Degree centrality(DC): The degree centrality DC(i) of vertex i is the number of edges connecting node i and its neighbors [21].
(1)
where Deg(i) is the degree of vertexes i.
• Betweenness centrality(BC): The betweenness centrality BC(i) of a node i is the average fraction of shortest paths that pass through the node i[22].
(2)
where σ
st
denotes the total number of shortest paths between s and t and σ
st
(i) denotes the number of shortest paths from s to t that pass through the node i.
Closeness centrality(CC): The closeness centrality CC of node i can defined as [23]:
(3)
CC is a global metric which describes how the given node i connects to other nodes.
• Subgraph centrality(SC): The subgraph centrality SC(i) of node i can be defined as [24]:
(4)
where μ
l
(i) denotes the number of closed walks of length l which starts and ends at node i.
• Eigenvector centrality(EC): The eigenvector centrality EC(i) of node i is defined as the i th component of the principal eigenvector of A, where A is an adjacent matrix. Let λ be an eigenvalue and e be the eigenvector. Then for an equation λe = Ae, we can obtain EC(i) = e1(i), where e1 corresponds to the largest eigenvalue of A[25].
Information centrality(IC): The information centrality IC(i) of node i in a is defined as [26]:
(5)
where n is the number of nodes in graph G and I
ij
= (r
ii
+ r
jj
- r
ij
) - 1, where r
ij
is the element of matrix R. Let D be a diagonal matrix of the weighted degree of each node and J be a matrix with all its elements equal to one. Then, we get R = (r
ij
) = [D - A+ J] - 1. For computational purposes, I
ii
is defined as infinite. Thus, .
High centrality of a gene indicates that it is important to the constrained PIN and probably plays an important role in mechanism of breast cancer development. Therefore, according to the graph centrality of genes, we get a gene list that is ordered by the genes' importance to human breast cancer. Depending on specific purpose, a given number of top genes can be selected to construct a reliable gene signature of breast cancer. The reliable gene signature is the integration of the disjoint original signatures.
KEGG pathway and Gene Ontology enrichment analysis
p-value based on the hypergenometirc distribution is widely used as a measurement of the extent to which the clusters are annotated by a specific GO term [27–30]. Basically, the p-value is defined as following:
(6)
where C is the size of the gene set containing k gene with a given GO term; G is the size of the universal set of known genes and contains n genes with the annotation.
Low P in Formula 6 indicates that the module closely corresponds to the GO annotation because the network has a rare chance to produce the module. To simplify our analysis, we define p-score as the negative of log(P) with the annotation [31].
Gene set enrichment analysis for KEGG pathways is very similar to the one for GO annotations. In Equation 6, C is the size of the gene set containing k genes that exist in a given KEGG pathway; G is the size of the universal set of known genes and contains n genes that exist in the pathway. Similarly, p-score can be used to measure the relationship between the gene set and a specific KEGG pathway.
In this study, both KEGG and GO enrichment analysis are performed on DAVID [32].
Validate on microarray dataset
To evaluate the signature's ability to predict clincal outcome, we used expression intensity of the genes in the signature to cluster microarray datasets of breast cancer patients with different pathologic parameters. Patients with similar pathologic parameters should be clustered togather. For a given pathologic parameter, the p-value of the clustering result indicates the signature's ability to predict the pathologic parameter.
In this study, euclidean distance between samples are calculated by using the expression intensity of genes in gene signature. Then hierarchical clustering is used to cluster the microarry datasets of breast cancer patients.