Vectorial representation of chemical compounds
We looked for a metabolic representation that: (i) allows all chemical species to be represented in the same space regardless of their composition, molecular size, etc. (in order to compare and operate with them); (ii) imposes no a priori definition of important chemical features (functional groups, etc.), since a representation based on a pre-defined set of functional groups would necessarily restrict the results obtained to those that can be given in terms of those groups; and (iii) allows chemical reactions and pathways to be represented concomitantly in the same space (see next point).
We represent a chemical structure with a vector in which each component represents a triplet of neighboring atoms, and its value indicates the frequency of that triplet in the molecule (Figure 1). Using pairs of atoms instead of triplets would lead to shorter vectors (fewer possible combinations), but would not capture many aspects of the molecule's functionality; the chemical properties of an atom depend mostly on its neighbors up to distance 2. For example, the different properties of the carbonyl group C = O depend on the neighbors of the C: OH-CR = O (acid), H-CR = O (aldehyde), R-CR = O (ketone), etc. On the other hand, groups of 4, 5, ... neighboring atoms would produce vectors with too many components (many possible combinations). We previously used a similar representation to train a machine learning system for predicting the biodegradability features of chemical compounds [26] (see Conclusion).
This representation is not perfectly univocal, since in some cases different molecules can have the same vector (stereoisomers, positional isomers, etc.). Nevertheless, it provides a good balance between information content and utility for the goal of this work. The representation contains more structural information than is first apparent. For example the bond types, although not explicitly considered in the representation, are implicitly taken into account to some extent: as we go from hydrocarbons with few double and triple bonds (highly saturated) to hydrocarbons with many multiple bonds (unsaturated), we gain CCC triplets at the expense of HCH and HCC. In the "Results" section, we demonstrate that this representation captures important physico-chemical properties of compounds and can be used for predictive purposes.
Vectorial representation of biochemical reactions
The vector for a reaction that transforms one chemical compound into another is just the difference between the vectors of the two compounds (product minus substrate) (Figure 1). The stoichiometry is taken into account by multiplying the compound vectors by the corresponding stoichiometric coefficients: e.g. for a reaction A -> 2B the reaction vector would be VA->B = 2·VB - VA.
Since this vectorial representation of reactions is based on that of the compounds, it inheres many of its characteristics: (i) any chemical reaction can be represented in the same space regardless of its characteristics (and hence operations and comparisons can be performed on them); (ii) there is no a priori definition of important reaction features. Moreover, the components of the reaction vector represent the triplets that are changed in the chemical reaction so they neatly encode the characteristics of the transformation (Figure 1). Another intuitive feature is that the vector of a given reaction is the opposite of the vector of the inverse reaction (VA->B = -VB->A) (Figure 1).
In this representation, all chemical reactions have to be simplified as transformations of one single compound into another. For a reaction with more than one reactant and/or product, a "main" pair substrate ->product has to be chosen and information about the other substrates/products becomes part of the vector components. This is not a problem since in metabolism we can usually define a "main transformation" and as such is annotated in metabolic databases (see "Datasets" below).
So, in this multidimensional "metabolic space", chemical compounds are points (or the corresponding vectors), chemical reactions can be seen as arrows (difference vectors) going from one point (substrate) to another (product), and biochemical pathways can be seen as sets of consecutive reaction vectors (a "walk" within this space) (Figure 1).
Datasets
We took from KEGG [14] all the metabolic reactions annotated for the model organism E. coli. For each reaction, the substrate→product transformations annotated as "main" in the "RPAIR" field in KEGG are taken. The molecular structures for the two compounds involved in this transformation are also retrieved from KEGG in .mol format. Hydrogens are added with the OpenBabel package http://www.openbabel.org. Open Babel is also used to transform the .mol files into .pdb since the bond information required to calculate the triplets is easier to obtain from .pdb files. Finally, we calculate the two triplet vectors for the compounds, and the corresponding triplet vector for the reaction as the difference (Figure 1). We end up with 1972 compound vectors and 1033 reaction vectors. We can have more than one vector for a given EC code, depending on whether that enzyme activity is associated with (acting in) more than one reaction.
Both reaction and compound vectors are defined by 60 components (triplets). These triplets were not imposed a priori. They are the ones with the highest frequencies in our dataset. We excluded some triplets with very low frequency, involving rare atoms and appearing in very few compounds. Some triplets involve "unknown" atoms ("X"). These come from molecules in KEGG for which some parts are not detailed at the atomic level but indicated by a general group "-R". These "R" groups (and hence our "X") usually represent biological polymers (peptides, poly-sugars, DNA, etc).
In order to study the relationships between our vectorial representation of reactions and their functional characteristics, additional information on the enzymes catalyzing these reactions was obtained from other resources. From BRENDA [15] we retrieved the "reaction type", a set of one or more keywords describing the transformation carried out by the enzyme. The total number of unique keywords in this dataset is 165. From BRENDA we also retrieved the PROSITE motifs associated with the enzymes. PROSITE [27] is a database of short sequence motifs. We retrieved 183 unique PROSITE motifs associated with the enzymes in our dataset. Interpro [28] is a resource that integrates information from diverse databases defining protein domains according to different structural and sequence criteria, such as Pfam, Prodom, SMART, SCOP, etc. From Interpro, we retrieved 1107 unique signatures (domains) associated with the enzymes in our dataset. Gene Ontology [29] defines three sets of terms for representing three different and independent aspects of the complex phenomena of protein function ("molecular function", "biological process", and "cellular compartment"). These terms are related by parenthood relationships defining three hierarchies which go from very general to more specific terms. A protein (gene product) becomes annotated by associating it to one or more of these terms. From Gene Ontology we retrieved the GO terms in the "molecular function" (GO:MF) and "biological process" (GO:BP) sub-ontologies for our enzymes: 736 GO terms (439 GO:MF and 297 GO:BP). In the following, we use the general term "keyword" referring to BRENDA keywords, PROSITE motifs, Interpro domains and GO terms.
Additionally, to study in more detail the relationships between the compound vectors and the physico-chemical properties of the molecules they represent, a supplementary set of chemical compounds with associated physico-chemical information was obtained from http://cheminformatics.org/ That dataset was previously used by Karthikeyan et al. to develop a melting-point prediction method [30]. The molecules in it are provided in SMILES format http://www.daylight.com/smiles. We convert from SMILES to .pdb with OpenBabel, and then to triplet vectors as explained above. For this dataset, owing to the more diverse atomic compositions of the molecules, we needed 96 atom triplets. From the original 4450 molecules in the dataset, 43 could not be represented in vectorial form for several reasons (e.g. SMILES not readable by OpenBabel), resulting in a final dataset of 4407 compounds. Hence, our final dataset contains 4407 molecules with very diverse chemical properties; e.g. the range of molecular weights is 84-815 and the range of logP is -3.18 to +14.16. We used three physicochemical properties in this work: logP (log ratio of concentrations between octanol and water, a measure of hydrophobicity); MW (molecular weight); and TPSA (Topological Polar Surface Area, the surface of all polar atoms). The program SPSS was used to perform a multivariate linear regression analysis to relate the compound vectors to these chemical properties. The WEKA package [31] was used to perform classification studies aimed at evaluating the predictive value of this vectorial representation. For that, the dataset of 4407 compounds was randomly divided into 10 subsets. Nine of these (training set) were used to perform the linear regression analysis, and the result was then used to predict the chemical property of the remaining group (test set). Cycling this procedure for the other nine groups allowed predictions for all the compounds to be obtained. The predicted values for the three properties were compared with the experimental values by linear correlation.
Clusters of reaction vectors
A distance tree for all the reaction vectors was generated by the UPGMA algorithm implemented in the R package http://www.r-project.org, based on the Euclidean distances between the vectors. Cutting this tree at different levels (going from the root to the leaves) produces different groupings with increasing number of clusters. The "Mclust" function of the "mclust" library of the same package was used to calculate the optimal number of clusters. This function is based on the BIC parameter ("Bayesian information criterion"). We explored from 1 to 100 clusters and the optimal number according to BIC criteria was 21.
We compared this clustering with equivalent classifications defined by the EC hierarchy using the enzyme annotations in the BRENDA, PROSITE, InterPro and GO resources (See "Datasets" above). For our reaction vectors (reactome) we can have any number of clusters depending on the level at which we cut the UPGMA tree, although the optimal number of clusters is 21. On the contrary, the EC classification defines four possible groupings only, depending on the EC level (6 clusters at the 1st level, 39 at the second, etc.). In order to compare EC with reactome for equivalent number of clusters we obtain the groupings of the reactome with the same number of clusters (6, 39, ...).
For a given set of keywords (BRENDA, GO, Interpro, ...) and a given clustering we created a contingency table containing the frequencies of each "keyword" (annotation) in each cluster, that is, the number of enzymes within that cluster associated with the keyword. A chi-squared test was applied to each contingency table. We used the Yates correction [32], which is applied to account for the low frequencies of some keywords and results in a more conservative test. Since different classifications have different degrees of freedom, we convert the chi-squared values (Χ2i) to z-scores (zi) using Fisher's approximation [33], which approximates the chi-squared distribution by a normal distribution with mean 0 and σ = 1.
where ni are the degrees of freedom: (N-1)*(k-1), k being the number of keywords in each dataset and N the number of clusters. A high z-score indicates a relationship between the clustering and the set of keywords: elements (enzymes) clustering together tend to have the same keywords, and vice versa.
To elucidate the biological meanings and characteristics of the 21 groups which form the "optimal" clustering of the reactome (see above), we extracted the keywords and the components of the reaction vectors (atom triplets) specifically associated with each. We did that by calculating, for each keyword/triplet in each of the 21 clusters, the p-value of its frequency within that cluster, taking into account the background frequency of that keyword/triplet. For that we used the hypergeometric distribution and the Poisson distribution for low frequency cases (n*p < 5 and p < 0.1, n being the frequency of the keyword/triplet in the cluster and p the probability of that keyword/triplet in the 21 clusters) [34].
Comments
View archived comments (1)