Predicting implicit associated cancer genes from OMIM and MEDLINE by a new probabilistic model
© Zhu et al; licensee BioMed Central Ltd. 2007
Published: 8 May 2007
Discovering cancer associated genes can facilitate the understanding of tumour pathogenesis, the medical diagnoses and the treatment of patients. Here we mined OMIM and MEDLINE to discover implicitly associated cancer genes by applying a new probabilistic model, mixture aspect model (MAM) , on cancer gene co-occurrence data in OMIM and MEDLINE. Through cross-validation experiments, the accuracy of predicting associated cancer genes was shown to be improved by incorporating gene-gene co-occurrence pairs from MEDLINE into cancer-gene co-occurrence pairs in OMIM. Furthermore, some implicit associated cancer genes were predicted and analyzed preliminarily. The detailed result was presented on line http://www.bic.kyoto-u.ac.jp/pathway/zhusf/CancerInformatics/Supplemental2006.html for the reference of interested researchers and further validation by biologists.
Materials and methods
We extracted cancer-gene and cancer-cancer co-occurrence pairs from OMIM, a human curated knowledgebase on human genes and inherited diseases. A software tool CGMIM was used to extract the description section of OMIM to obtain cancers and associated genes . This software maps genetic disorders into 21 different types of cancers. To avoid the difficulty of recognizing gene names, we extracted a human curated database, Entrez Gene, to obtain a subset of high quality MEDLINE records, where we obtained gene-gene co-occurrence data. MAM was proposed by us to mine implicit "chemical compound-gene" relations by integrating three types of co-occurrence data (compound-compound, gene-gene and compound-gene) in the literature . The main advantage of MAM is the ability of integrating different type of co-occurrence data from heterogeneous data sources. MAM was first estimated by an EM algorithm to fit the existing co-occurrence data of cancer and gene, and then was used to predict the likelihood of the association of an unobserved pair of a cancer and a gene. See Table 1.
AUCs and t-values obtained in the cross-validation experiment.
The ratio of training to test data
The size of co-occurrence datasets
For each type of cancer, we list the top specific implicit associated gene.
This work is partly supported by JSPS (Japan Society for the Promotion of Science) Postdoctoral Fellowship.
- Zhu S, Okuno Y, Tsujimoto G, Mamitsuka H: A probabilistic model for mining implicit 'chemical compound-gene' relations from literature. Bioinformatics. 2005, 21 (Suppl 2): ii245-ii251. 10.1093/bioinformatics/bti1141PubMedView ArticleGoogle Scholar
- Bajdik CD, Kuo B, Rusaw S, Jones S, Brooks-Wilson A: CGMIM: automated text-mining of Online Mendelian Inheritance in Man (OMIM) to identify genetically-associated cancers and candidate genes. BMC Bioinformatics. 2005, 6: 78-84. 10.1186/1471-2105-6-78PubMedPubMed CentralView ArticleGoogle Scholar
- Bharaj BB, Luo LY, Jung K, Stephen C, Diamandis EP: Identification of single nucleotide polymorphisms in the human kallikrein 10 (KLK10) gene and their association with prostate, breast, testicular, and ovarian cancers. Prostate. 2002, 51 (1): 35-41. 10.1002/pros.10076PubMedView ArticleGoogle Scholar
- Zhu S, Okuno Y, Tsujimoto G, Mamitsuka H: Application of a new probabilistic model for mining implicit associated cancer genes from OMIM and Medline. Cancer Informatics. 2006, 2: 361-371.PubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd.