Data extraction and processing
Papers related to E. coli in PubMed were obtained by searching PubMed for “E. coli” OR “Escherichia coli". Papers related to Brucella in PubMed were obtained by searching PubMed for “Brucella OR brucellosis". The PubMed IDs (PMIDs), titles, abstracts, and MeSH terms of all articles related to E. coli and Brucella that had been parsed from PubMed using the PubMed literature XML format were downloaded from PubMed, including over 300,000 E. coli-related papers and over 15,000 Brucella-related papers. The parsed and downloaded literature information was then stored in a pre-defined MySQL database.
The community-based EcoGene database [16] was utilized to obtain the information of a comprehensive list of E. coli genes. For each gene, the information obtained from the EcoGene database includes EcoGene ID, gene symbol, gene symbol synonyms, protein name, and different protein synonyms. Normalized Brucella gene names were obtained from genome-wide ortholog Brucella gene analysis and gene name normalization as described in our previous study [41]. Basically, those ortholog genes with different names were grouped, and the different names become synonyms. A manual annotation was also applied to confirm the results of the ortholog-based grouping. In this study, each bacterial gene was identified by a primary symbol and protein name, together with a list of possible gene and protein synonyms. During text searching, gene symbols were defined as case-sensitive, except for the first letter. This approach identified and distinguished genes such as “folD” or “FolD” from the word “fold". Hypothetical and unknown genes lacking distinct gene symbols or protein names were not discussed in publications and hence discarded (Step 1 in Figure 1). For each E. coli or Brucella gene, the name matching method was used to identify all publications that contained specific gene or protein names (or their synonyms) shown in the title or abstract of each manuscript.
These publications were defined as related to the gene (Step 2 in Figure 1). From each publication identified, the MeSH terms assigned to the publication were retrieved and updated according to the MeSH term weighting as described below. From this information the gene-MeSH matrix that contains the frequency of occurrences of all MeSH terms listed for individual E. coli genes was formulated (Step 3). The gene-gene matrix was generated by calculating the dissimilarity score between every gene pair based on the methods described below (Step 4). Once all gene pair-wise dissimilarities were computed, all the dissimilarities were sorted, and the empirical P-value for each gene pair were calculated based on its ranked position in the sorted dissimilarity scores. Hierarchical clustering was implemented using the R hclust program (Step 5 in Figure 1).
Optimization of weighting and dissimilarity calculations
(1) MeSH term weighting:
MeSH term weighting is based on TF*IDF [17]. Specifically, TF is the MeSH term frequency in all PubMed articles associated with a specific E. coli gene. IDF is the Inverse Document Frequency (IDF) used to weigh the value for each MeSH term. For a specific MeSH term i, IDF is first implemented using the classical logarithm method shown below:
The number of occurrences of all MeSH terms in the database is calculated by counting the total occurrences of this MeSH term in all 560,757 PubMed articles related to 33 representative bacteria or viruses as described in our publication concerned with a pathogen-host interaction data integration and analysis system (PHIDAS) [42]. Additional file 1 provides the full list of these 33 bacteria and viruses. The selection of 33 organisms other than E. coli alone was to make the MeSH term analysis broader in scope. The number of occurrences of the MeSH term i is defined as the frequency of the MeSH term in the database associated with E. coli only.
A separate, square root-based IDF weighting scheme was also implemented and tested:
All the terms defined in this scheme are the same as the classical logarithm method. As described in the Results section, this square root-based IDF weighting method was compared with the classical logarithm method in a Receiver Operating Characteristic (ROC) study (Figure 2).
(2) Six functions for calculating the dissimilarity score between two genes:
Six widely cited functions used for calculating distances or dissimilarity scores were explored [19–21]. The terms used are defined as follows:
n = number of unique MeSH terms
X = (x
1
, ..., x
n
), where x
n
= number of papers associated with term i for gene a,
Y = (y
1
, ..., y
n
), where y
n
= number of papers associated with term i for gene b,
X and Y are defined as vector representations of two genes, denoting the frequencies of MeSH terms associated with each gene. Given these definitions, the four similarity functions shown below were evaluated:
Two dissimilarity functions were also implemented:
(3) Calculation of dissimilarity scores based on weighted MeSH terms:
The revised dissimilarity measure (D
M
) based on the Cosine coefficient is defined as:
where i is a specific MeSH term, w
i
is the weight assigned to the i th MeSH term (one of the two IDF-based weighting schemes). In the Cosine coefficient formula, the x
i
and y
i
have been changed to (w
i;
x
i
) and (w
i;
y
i
), respectively. The dissimilarity scores based on other similarity coefficients are defined in a similar manner.
The revised dissimilarity measure (D
M
) based on the Manhattan distance is defined as:
where the variables are defined as the same as shown above. The revised dissimilarity measure based on the Euclidean distance is defined similarly.
(4) Verification and optimization of MeSH term weighting and dissimilarity score calculation
To test whether the actual quantitative value in the MeSH term dissimilarity measure is indicative of the relationships of the two selected genes, the ROC analysis was applied [23]. Genes from 13,549 gene pairs of transcriptional factors and their individually regulated genes available in RegulonDB [22] were used as the gold standard. The calculation methods described above were used to calculate the specificity and sensitivity of analyzing the gene-gene relationships using the true gene pairs in the gold standard data compared to the same number of randomized gene pairs in the GenoMesh database. One hundred gene pairs were selected randomly from the standard set, and 100 pairs were selected randomly from the GenoMesh database. The true positive rate (Sensitivity) and false positive rate (1-Specificity) were then calculated based on gradually increasing dissimilarity cut-off values (between 0 and 1). The calculations were repeated 100 times and the averages recorded. A ROC curve was plotted for all sets of data to verify the GenoMesh algorithm and to optimize the method of calculating a MeSH-based dissimilarity score based on data in the literature.
Development of the GenoMesh web server
The GenoMesh web server (http://genomesh.hegroup.org) was developed using a three-tier architecture built on two HP ProLiant DL380 G6 servers which run the Redhat Linux operating system (Redhat Enterprise Linux ES 5). Users can submit database or analysis queries through the web. The queries are processed using PHP/Perl/SQL (middle-tier, application server based on Apache) against a MySQL (version 5.0) relational database (back-end, database server). The result of each query is presented to the user in the web browser. Two servers are regularly scheduled to backup each other's data. The GenoMesh system currently contains five programs: 1) GeneMesh, searching MeSH terms (or genes) from a gene (or MeSH) query; 2) GenePair, analysing a designated gene pair; 3) GeneCluster, displaying the hierarchical clustering results; 4) GeneNet, predicting a gene interaction network based on a user-defined gene list; 5) MeSHBrowse, browsing MeSH tree for MeSH terms and predicted genes and gene interaction network for each MeSH term. General MeSH terms and structures are extracted from the MeSH website (http://www.nlm.nih.gov/mesh). The images of the interaction networks are generated automatically with an internally developed script based on the graph visualization software Graphviz (http://www.graphviz.org).
Prediction of gene-to-gene relationships and networks using GenoMesh
To test the ability of GenoMesh to predict gene-to-gene interactions lacking direct literature support, all E. coli literature data were separated into two parts, literature published before January 1, 2004 and after January 1, 2004. The literature published before 2004 was used for predicting gene-to-gene interactions. The results were verified using the results published after 2004.
To evaluate whether gene pairs in the same pathway have lower GenoMesh dissimilarity scores than gene pairs from a random group of genes, a list of known E. coli pathways from the EcoCyc pathway website [43] was collected. To avoid uncertainties attributed to minor pathways, pathways containing less than ten genes were excluded. For biological pathways containing N related genes, the GenoMesh dissimilarity value for all n(n-1)/2 gene pairs d
ij
, i, j = 1, ..., n, was calculated, and the average
taken as the average GenoMesh value for the pathway. N genes were randomly selected from the E. coli genome and their pair-wise dissimilarity values calculated. The average of these values is denoted as d
0
. The same procedure was repeated 100,000 times to obtain d
i,
0 i = 1, ..., 100,000. The value obtained was used to approximate the null distribution of the average GenoMesh value for gene groups of size N. The empirical p-value was calculated as
The sample mean
and variance
of the sample of all pair-wise GenoMesh values can be estimated. Basically, such a p-value is a permutation p-value determined empirically by repeating the same process many times to see how many times the test result was significant. There is only one test. Therefore, a multiple test correction is not required.
For pathways with large n(n-1)/2 values, the central limit theorem can be used to derive the asymptotic distribution for average GenoMesh values for a random group of n genes, which is normal with mean μ and variance 2σ
0
2/(n(n-1)). Hence the asymptotic z-value can be calculated as
The exhaustive MeSH term dissimilarity value calculations for all of the possible E. coli gene pairs allows analysis of the relatedness of gene pairs without using reported studies (no overlapped references).