The emergence of next-generation sequencing technology and sophisticated microarray technology has enhanced the diversity of high-throughput biological experiments. In addition to gene expression profiling, epigenetic data, including DNA methylation and histone modifications, and mutation analysis in cancer have been studied comprehensively in a genome-wide manner. It is absolutely indispensable to use biological knowledge-based analysis methods to translate the results of these experiments into a better understanding of the underlying phenomena and to plan the next stages of research.
Biological knowledge, such as pathways or gene sets, is compiled in various databases. In these databases, biological knowledge is represented as a precompiled, divided set of genes, such as the “P53 signaling pathway” or “apoptotic signaling pathway”. These pathways are utilized by various knowledge-based analysis methods. Over-representation analysis (ORA) is a widely used method for mapping a list of genes onto these pathways automatically, and this technique can determine the pathways or functional gene sets that are enriched in a given list of genes obtained experimentally. ORA is frequently implemented as a web application, such as the NCI-Nature Pathway Interaction Database
[1, 2] and the DAVID bioinformatics resources
, that receive an input list of genes and calculate the p-values based on the frequency of the appearance of the input genes in each precompiled gene set. However, using the ORA methodology, the input list of genes is simply characterized with respect to the already-known pathways. Thus, researchers can rarely discover something new related to their input.
Another type of knowledge-based analysis is the network-based analysis method, which uses an interaction network of biomolecules as the knowledge. In this type of network, the biomolecules (proteins or genes) correspond to the nodes, and the edges indicate the relationships between the molecules (e.g., “protein A induces protein B” or “protein B phosphorylates protein C”). The assembled network is often called a protein-protein interaction (PPI) network or a biomolecular network, and several methodologies are available for analyzing experimental results using this network-based biological knowledge
[4–6]. Many network-based analysis methods extract modules, which are sets of tightly connected nodes consisting of the input genes, and it is strongly expected that the genes in a module achieve a biological function in a coordinated manner. In addition, these modules sometimes include nodes that were not present in the input list. Thus, the network-based analysis methods partially overcome the disadvantages of ORA, in terms of the limitation to the predefined pathways or gene sets. However, these module-centric methods restrict the results of the analysis to a certain area of each module, even though the input genes are spread over the whole biomolecular network. Furthermore, when the modules of the analysis results become larger or more complex, it is almost impossible to understand their biological meanings.
Consequently, it would be beneficial to identify the nodes in the network as the key molecules that are relevant to the input list of genes. One of the most prominent characteristics of a node in a network is its degree, or number of neighbors. However, the degree contains information only about its neighbors, and in a similar way, other network measures, such as the clustering coefficient and assortativity, merely reflect the situations of their neighbors
. In contrast, certain node centralities can determine the importance of each node in a network by taking into consideration the topology of the entire network. Although there are various types of centralities, such as degree centrality, closeness centrality, eigenvector centrality, betweenness centrality and others, it is known that almost all of the centralities correlate with the degree of the node
. Partially because the role of hub nodes in biomolecular networks still remains an intensive research target
[9–11], the methods based on these centralities
[12–14] tend to produce analysis results that are biased toward major hub nodes.
In this study, we present a new network-based method for identifying the hidden key molecules, a description that indicates that the molecules are biologically relevant to the input but do not have as many neighbors as the major hub nodes have. We have developed a centrality measure derived from betweenness centrality
[15, 16], named node-limited betweenness centrality (nlBC). First, we validated the method using a well-known pathway, the ErbB (EGFR) signaling pathway. Next, we applied it to a practical cancer mutation dataset and explored the availability of our method.