DNA-binding proteins
DNA-binding proteins were obtained from our previous study [10]. In that study, we extracted all protein-DNA complexes from the PDB [16]. Then, the dataset was culled using PISCES [17]. The resulting dataset consisted of 171 proteins with mutual sequence identity ≤ 30% and each protein had at least 40 amino acid residues. All the structures have resolution better than 3.0 Å and R factor less than 0.3. In the current study, seven features are evaluated for their usefulness in the prediction of DNA-binding sites. Thus, seven features were calculated for each protein. Among them, structural conservation was calculated based on the alignment of structural neighbors (See details in section 2.2). 25 proteins were discarded because no structures neighbors were found. In the end, 146 DNA-binding proteins were used to evaluate the proposed method in cross-validation.
Features
DNA was removed from the protein-DNA complexes and seven features were calculated for each amino acid of the protein: (1) Relative solvent accessibility was calculated using NACCESS [18]; (2) Electrostatic potential was calculated using Delphi [19] with the same parameters used in the study of Jones et a. [1]. The electrostatic potential of a residue is defined as the average of the electrostatic potentials at the locations of all its atoms as described in Jones et a. [1]; (3) Sequence entropy at each residue position (the sequence entropy for the corresponding column in the multiple sequence alignment) was extracted from the HSSP database [20]. Sequence entropy is a measure of sequence conservation. The lower the value, the more conserved is the corresponding residue position; (4) Surface curvature at each residue position was calculated using MSP (http://connolly.best.vwh.net/); (5) Pockets on protein surface were detected using Proshape (http://csb.stanford.edu/~koehl/ProShape/download.php). The pocket size of a residue is defined as the size of the pocket that the residue is located in. If a residue is not located in any pocket, then a value of 0 is assigned to the pocket size of the residue; (6) The DALI server [21] was used to search for structural neighbors in the PDB for each of the DNA-binding proteins. The DALI server returned a multiple alignment of the query structure and its structural neighbors. Then, structural conservation score was calculated for each residue position using Scorecons [22] based on the multiple alignment; and (7) position-specific scoring matrix (PSSM) of a protein was built by running 4 iterations of PSI-BLAST [23] against the NCBI non-redundant (nr) database. In the PSSM, each residue position corresponds to 20 values. Thus, in total, each amino acid residue is associated with 26 attributes. All these attributes were normalized to the range of 0[1].
Interface residues and surface residues
Interface residues are defined as in Jones et al. [1]. Solvent accessible surface area (ASA) was computed for each residue in the unbound protein (in absence of DNA) and in the protein-DNA complex. A residue is defined to be an interface residue if its ASA in the protein-DNA complex is less than its ASA in the unbound protein by at least 1Å2. A residue is defined to be a surface residue if its relative accessibility in the unbound protein is >5%. In total, 4,337 interfaces residues and 27,248 surface residues were obtained.
Interface patches and non-interface patches
For each DNA-binding protein, an interface patch and a non-interface patch were obtained. The interface patch included all the interface residues. The non-interface patch was randomly taken from the protein surface such that (1) it consisted of a group of contiguous surface residues; (2) it had the same number of residues as the interface patch; and (3) it did not include any interface residue.
Graph representation of patches
Each amino acid residue is represented using a node labeled with the 26 attributes of the residue. Two residues are considered contacting if the closest distance between their heavy atoms is less than the sum of the radii of the atoms plus 0.5 Å. An edge is added between two nodes if the corresponding residues are contacting. In this way, a surface patch of residues is represented as a labeled graph.
Graph kernel
Kernel methods are a popular method with broad applications in data mining. In a simple way, a kernel function can be considered as a positive definite matrix that measures the similarities between each pair of input data. It the currently study, a graph kernel method, namely shortest-path kernel, developed by Borgwart and Kriegel [24], is used to compute the similarities between graphs.
The first step of the shortest-path kernel is to transform original graphs into shortest-path graphs. A shortest-path graph has the same nodes as its original graph, and between each pair of nodes, there is an edge labeled with the shortest distance between the two nodes in the original graph. In the current study, the edge label will be referred to as the weight of the edge. This transformation can be done using any algorithm that solves the all-pairs-shortest-paths problem. In the current study, the Floyd-Warshall algorithm was used.
Let G1 and G2 be two original graphs. They are transformed into shortest-path graphs S1(V1, E1) and S2(V2, E2), where V1 and V2 are the sets of nodes in S1 and S2, and E1 and E2 are the sets of edges in S1 and S2. Then a kernel function is used to calculate the similarity between G1 and G2 by comparing all pairs of edges between S1 and S2.
where, k
edge
( ) is a kernel function for comparing two edges (including the node labels and the edge weight).
Let e1 be the edge between nodes v1 and w1, and e2 be the edge between nodes v2 and w2. Then,
where, k
node
( ) is a kernel function for comparing the labels of two nodes, and k
weight
( ) is a kernel function for comparing the weights of two edges. These two functions are defined as in Borgward et al.[25]:
where, labels(v) returns the vector of attributes associated with node v. When n features were used to labeled the nodes, labels(v) and labels(w) could be considered as the coordinates of two points in the n-dimensional space, and ||labels(v)-labels(w)|| is the Euclidean distance between the two points. Note that K
node
() is a Gaussian kernel function. We tried different values for between 32 and 128 with increments of 2, the accuracy varied from 86% to 88.7% when all features were used in the cross validation. When was set to 72 the best accuracy was achieved.
where, weight(e) returns the weight of edge e. K
weight
( ) is a Brownian bridge kernel that assigns the highest value to the edges that are identical in length. Constant c was set to 2 as in Borgward et al.[25]. We tried different values of c between 1 and 5 with increments of 1, the change in accuracy was less than 1%.
Classification
When the shortest-path graph kernel is used to compute similarities between graphs, the results are affected by the sizes of the graphs. Consider the case that graph G is compared with graphs Gx and Gy separately using the graph kernel:
If Gx has more nodes than Gy does, then |Ex|>|Ey|, where Ex and Ey are the sets of edges in the shortest-path graphs of Gx and Gy. Therefore, the summation in K(G, G
x
) includes more items than the summation in K(G, G
y
) does. Each item (i.e., k
edge
( )) inside the summation has a non-negative value. The consequence is that if K(G, G
x
)>K(G,G
y
) it may not necessary indicate that G
x
is more similar to G than G
y
is, instead, it could be an artifact of the fact that Gx has more nodes than Gy. To overcome this problem, a voting strategy is developed for predicting whether a graph (or a patch) is an interface patch:
Algorithm
Voting_Stategy (G)
Input
: graph G
Outpu
t: G is an interface patch or non-interface patch
Let T be the set of proteins in the training set
Let v be the number of votes given to "G is an interface patch"
v = 0
While (T is not empty)
{
Take one protein (P) out of T
Let G
int
and G
non-int
be the interface and non-interface patches from P.
If K(G, G
int
)>K(G,G
non-int
), then increase v by 1
}
If , then G is an interface patch
Else G is a non-interface patch
Using this strategy, when K(G, G
int
) is compared with K(G, G
non-int
), Gint and Gnon-int are guaranteed to have identical number of nodes, since they are the interface and non-interface patches extracted from the same protein (see section 2.4 for details). Each time K(G, G
int
)>K(G, G
non-int
) is true, one vote is given to "G is an interface patch". In the end G is predicted to be an interface patch if "G is an interface patch" gets more than half of the total votes, i.e., .