In this paper, we calculate structural neighboring property on protein-protein interface, through Voronoi diagram. We propose a novel statistical method to extract interacting residues, and interacting patches can be clustered as predicted interface residues. In addition, structural neighboring property can be adopted to construct an energy function to evaluate docking solutions, which includes new statistical property as well as existing energy items.
Data set
To obtain statistical property on interface, we adopt a high quality, non-redundant experimental data set. We select 6,438 complexes from Protein Data Bank [28]; each complex consists of two or more subunits. These complexes are determined from X-ray data with resolution less than 2.2Å. Any two complexes share no more than 30% identity.
A complex may contain several subunits and multiple interfaces. Each interface in a complex occurs in a pair of subunits. Two residues between a pair of subunits are called interface residues, if any two atoms, one from each residue, interact. By interact, we mean distance between two heavy atoms is less than 6 Å.
Structural neighboring property
Most of those features only describe current interacting residues, but cannot represent real situation well, thus are insufficient to predict binding sites with high accuracy. Here, we develop a method to calculate structural neighboring property on interface, using Voronoi diagram.
Site features
The physicochemical features are used to characterize potential interacting residues. The most interesting features are described as follows.
-
Hydrophobicity: a numerical hydrophobicity of an amino acid [29],
-
Electrostatic potential: the number of electrostatic charge in an amino acid [30],
-
Hydrogen bonds: the number of potential hydrogen bonds for all atoms in an amino acid [31].
We use Voronoi diagram to evaluate polygonal face area of each surface residue, and calculate structural neighboring property on these physicochemical features.
Polygonal face area
We use VLDP [32] for geometrically analyzing protein 3D structures, based on Voronoi Tessellation. Voronoi Tessellation is a partition of space into polyhedra, whereas Delaunay diagram builds a graph with vertices at atoms. These graphs define nearest neighbours for each atom of one protein. It calculates Delaunay diagram by using an optimized incremental algorithm. The weights can be interpreted as squared radius. The surface residues appear as a packing of polyhedra, that is necessary to have a reasonable Tessellation throughout entire system.
VLDP can be used to evaluate residue contacts and residue volumes, defined as polygonal face area and polyhedral volume for each atom. In particular, contact area of two residues is sum of atomic interface areas on pairs of atoms; surface area of one residue is sum of surface areas exposed to solvent in this residue; total area of one residue is sum of all areas in this residue. We calculate structural neighboring property, based on polygonal face area.
Property function
Given a protein, structural neighboring property of one surface residue x is defined as follow:
where p(x) is site feature of each residue, contact(x, y) >0 means that contact area between residues x and y is greater than zero, shows surface area of residue x divided by its total area, as shown in Figure 1.
We use a normal distribution F (x) to estimate probability of structural neighboring property on one side of interface. We also use a bivariate normal distribution F (x1, x2) [33] to estimate probability of structural neighboring property on both sides of interface. Given a pair of proteins, effective free energy of interacting residue pair can be calculated as:
where R is a set of all residue pairs on interface.
Extracting interface residues
We propose a statistical method to extract interacting residues, and interacting patches can be clustered as predicted interface residues. The threshold value s
th
is used to harvest all possible residue pairs between two proteins. The residue pairs with S(x1, x2) ≤ s
th
are called interacting residues.
Considering neighboring residues, we construct a sphere with a radius of 10Å for all interacting residues. The updated statistical property of each interacting residue pair is calculated as follows.
where dis(x, r) is 3D distance of two C
α
atoms in residues x and r. The residues r
i
are from protein having residue x1, and the residues r
j
are from protein having residue x2.
We rank interacting residues by using updated statistical property. Top interacting residues can be grouped into different regions. All interacting residues served as graph nodes, and each undirected edge is built when two nodes are within distance 10 Å. Strongly connected components are considered as interacting patches. One region, containing a very small number of interacting residues, indicates a weak signal and can be discarded. We cluster interacting patches as predicted interface residues.
Energy function for docking
Given two input proteins, our task is to find the protein-protein interface between them. In first step, we identify docking solutions of two subunits. It performs a large number of rigid transformations to enumerate poses. Top ranking poses are selected through a linear combination of energy items. In second step, we generate possible conformational changes of interface residues from their unbound states to bound states, based on multidimensional scaling method. In third step, we use trained SVM models to further select best poses for input proteins. Structural neighboring property can be effectively applied to identify docking solutions.
Here, we construct a new energy function to evaluate docking solutions, which includes new statistical property as well as existing energy items [27]. The following lists all energy items, and how they are computed:
-
Structural neighborhood energy is calculated by probability of structural neighboring property on interface.
-
π-π interaction energy is calculated by geometrical property on π-π interaction [27].
-
Dihedral angle energy is calculated by statistical analysis of dihedral angle frequency and correlation on interface [27].
-
Amino acid energy is constructed by probabilities of interface residues.
-
Side-chain atoms of interface residues are packed by SCWRL4 [34], and sidechain energy is extracted.
We use a linear combination of these energy items, referred to as initial energy function, to rank poses. The coefficient of each item is optimized by using a linear combination method in [35]. We output top 100 poses with lowest energy values. For conformational changed structures, our method calculates a set of possibly changed conformations of interfaces.
As in [27], we use a training set consisting of 79 complexes from Dockground [36] to produce 79 SVM models, one for each complex, based on these energy items. Finally, we use trained SVM models to further select best 10 poses with lowest energy values for two input proteins.
Assessment of interface prediction
According to CAPRI evaluation criteria [37], three evaluation measures are commonly used in identifying protein-protein interface. A pair of residues on interface is considered to be in contact if any of their atoms are within 6 Å One is the fraction of native contacts F
nat
, defined as the number of correct residue-residue contacts in predicted complex divided by the number of contacts in native complex. The other is the fraction of non-native contacts F
non−nat
, defined as the number of incorrect residues-residue contacts in predicted complex divided by the total number of contacts in that predicted complex. The third is root-mean-square deviation of interface I
rmsd
, defined as the rmsd value between all backbone atoms of interfaces in predicted structure and in native complex, after two interfaces are superimposed.
We also calculate P value for binding sites prediction. The calculation of P value should be probability of obtaining not less than n correctly predicted interface residues by randomly picking out N predicted interface residues. The probability that a random method obtains success in one trial is, where M is the number of all surface residues, and m is the number of correctly interface residues among them. Therefore, P value for binding sites prediction is given by