### DNA-binding proteins

DNA-binding proteins were obtained from our previous study [10]. In that study, we extracted all protein-DNA complexes from the PDB [16]. Then, the dataset was culled using PISCES [17]. The resulting dataset consisted of 171 proteins with mutual sequence identity ≤ 30% and each protein had at least 40 amino acid residues. All the structures have resolution better than 3.0 Å and R factor less than 0.3. In the current study, seven features are evaluated for their usefulness in the prediction of DNA-binding sites. Thus, seven features were calculated for each protein. Among them, structural conservation was calculated based on the alignment of structural neighbors (See details in section 2.2). 25 proteins were discarded because no structures neighbors were found. In the end, 146 DNA-binding proteins were used to evaluate the proposed method in cross-validation.

### Features

DNA was removed from the protein-DNA complexes and seven features were calculated for each amino acid of the protein: (1) Relative solvent accessibility was calculated using NACCESS [18]; (2) Electrostatic potential was calculated using Delphi [19] with the same parameters used in the study of Jones et a. [1]. The electrostatic potential of a residue is defined as the average of the electrostatic potentials at the locations of all its atoms as described in Jones et a. [1]; (3) Sequence entropy at each residue position (the sequence entropy for the corresponding column in the multiple sequence alignment) was extracted from the HSSP database [20]. Sequence entropy is a measure of sequence conservation. The lower the value, the more conserved is the corresponding residue position; (4) Surface curvature at each residue position was calculated using MSP (http://connolly.best.vwh.net/); (5) Pockets on protein surface were detected using Proshape (http://csb.stanford.edu/~koehl/ProShape/download.php). The pocket size of a residue is defined as the size of the pocket that the residue is located in. If a residue is not located in any pocket, then a value of 0 is assigned to the pocket size of the residue; (6) The DALI server [21] was used to search for structural neighbors in the PDB for each of the DNA-binding proteins. The DALI server returned a multiple alignment of the query structure and its structural neighbors. Then, structural conservation score was calculated for each residue position using Scorecons [22] based on the multiple alignment; and (7) position-specific scoring matrix (PSSM) of a protein was built by running 4 iterations of PSI-BLAST [23] against the NCBI non-redundant (nr) database. In the PSSM, each residue position corresponds to 20 values. Thus, in total, each amino acid residue is associated with 26 attributes. All these attributes were normalized to the range of 0[1].

### Interface residues and surface residues

Interface residues are defined as in Jones *et al*. [1]. Solvent accessible surface area (ASA) was computed for each residue in the unbound protein (in absence of DNA) and in the protein-DNA complex. A residue is defined to be an interface residue if its ASA in the protein-DNA complex is less than its ASA in the unbound protein by at least 1Å^{2}. A residue is defined to be a surface residue if its relative accessibility in the unbound protein is >5%. In total, 4,337 interfaces residues and 27,248 surface residues were obtained.

### Interface patches and non-interface patches

For each DNA-binding protein, an interface patch and a non-interface patch were obtained. The interface patch included all the interface residues. The non-interface patch was randomly taken from the protein surface such that (1) it consisted of a group of contiguous surface residues; (2) it had the same number of residues as the interface patch; and (3) it did not include any interface residue.

### Graph representation of patches

Each amino acid residue is represented using a node labeled with the 26 attributes of the residue. Two residues are considered contacting if the closest distance between their heavy atoms is less than the sum of the radii of the atoms plus 0.5 Å. An edge is added between two nodes if the corresponding residues are contacting. In this way, a surface patch of residues is represented as a labeled graph.

### Graph kernel

Kernel methods are a popular method with broad applications in data mining. In a simple way, a kernel function can be considered as a positive definite matrix that measures the similarities between each pair of input data. It the currently study, a graph kernel method, namely shortest-path kernel, developed by Borgwart and Kriegel [24], is used to compute the similarities between graphs.

The first step of the shortest-path kernel is to transform original graphs into shortest-path graphs. A shortest-path graph has the same nodes as its original graph, and between each pair of nodes, there is an edge labeled with the shortest distance between the two nodes in the original graph. In the current study, the edge label will be referred to as the weight of the edge. This transformation can be done using any algorithm that solves the all-pairs-shortest-paths problem. In the current study, the Floyd-Warshall algorithm was used.

Let G_{1} and G_{2} be two original graphs. They are transformed into shortest-path graphs S_{1}(V_{1}, E_{1}) and S_{2}(V_{2}, E_{2}), where V_{1} and V_{2} are the sets of nodes in S_{1} and S_{2}, and E_{1} and E_{2} are the sets of edges in S_{1} and S_{2}. Then a kernel function is used to calculate the similarity between G1 and G2 by comparing all pairs of edges between S_{1} and S_{2}.

K\left({G}_{1},{G}_{2}\right)=\sum _{{e}_{1}\in {E}_{1}}\sum _{{e}_{2}\in {E}_{2}}{k}_{edge}\left({e}_{1},{e}_{2}\right)

where, *k*_{
edge
}( ) is a kernel function for comparing two edges (including the node labels and the edge weight).

Let e_{1} be the edge between nodes v_{1} and w_{1}, and e_{2} be the edge between nodes v_{2} and w_{2}. Then,

{k}_{edge}\left({e}_{1},{e}_{2}\right)={k}_{node}\left({v}_{1},{v}_{2}\right)*{k}_{weight}\left({e}_{1},{e}_{2}\right)*{k}_{node}\left({w}_{1},{w}_{2}\right)

where, *k*_{
node
}( ) is a kernel function for comparing the labels of two nodes, and *k*_{
weight
}( ) is a kernel function for comparing the weights of two edges. These two functions are defined as in Borgward et al.[25]:

{k}_{node}\left(v,w\right)=\mathsf{\text{exp}}\left(-\frac{\left|\right|labels\left(v\right)-labels\left(w\right)|{|}^{2}}{2{\delta}^{2}}\right)

where, *labels*(*v*) returns the vector of attributes associated with node *v*. When *n* features were used to labeled the nodes, *labels*(*v*) and *labels(w)* could be considered as the coordinates of two points in the *n*-dimensional space, and ||*labels(v)-labels(w)*|| is the Euclidean distance between the two points. Note that *K*_{
node
}() is a Gaussian kernel function. We tried different values for \frac{1}{2{\delta}^{2}} between 32 and 128 with increments of 2, the accuracy varied from 86% to 88.7% when all features were used in the cross validation. When \frac{1}{2{\delta}^{2}} was set to 72 the best accuracy was achieved.

{k}_{weight}\left({e}_{1},{e}_{2}\right)=\mathsf{\text{max}}\left(0,c-|weight\left({e}_{1}\right)-weight\left({e}_{2}\right)|\right)

where, *weight*(*e*) returns the weight of edge *e. K*_{
weight
}( ) is a Brownian bridge kernel that assigns the highest value to the edges that are identical in length. Constant *c* was set to 2 as in Borgward et al.[25]. We tried different values of *c* between 1 and 5 with increments of 1, the change in accuracy was less than 1%.

### Classification

When the shortest-path graph kernel is used to compute similarities between graphs, the results are affected by the sizes of the graphs. Consider the case that graph G is compared with graphs G_{x} and G_{y} separately using the graph kernel:

K\left(G,{G}_{x}\right)=\sum _{e\in E}\sum _{{e}_{x}\in {E}_{x}}{k}_{edge}\left(e,{e}_{x}\right)

K\left(G,{G}_{y}\right)=\sum _{e\in {E}_{}}\sum _{{e}_{y}\in {E}_{y}}{k}_{edge}\left(e,{e}_{y}\right)

If G_{x} has more nodes than G_{y} does, then |E_{x}|>|E_{y}|, where E_{x} and E_{y} are the sets of edges in the shortest-path graphs of G_{x} and G_{y}. Therefore, the summation in *K(G, G*_{
x
}*)* includes more items than the summation in *K(G, G*_{
y
}*)* does. Each item (i.e., *k*_{
edge
}*( )*) inside the summation has a non-negative value. The consequence is that if *K(G, G*_{
x
}*)>K(G,G*_{
y
}*)* it may not necessary indicate that G_{
x
} is more similar to G than G_{
y
} is, instead, it could be an artifact of the fact that G_{x} has more nodes than G_{y}. To overcome this problem, a voting strategy is developed for predicting whether a graph (or a patch) is an interface patch:

*Algorithm*
*Voting_Stategy (G)*

*Input*
*: graph G*

*Outpu*
*t: G is an interface patch or non-interface patch*

*Let T be the set of proteins in the training set*

*Let v be the number of votes given to "G is an interface patch"*

*v = 0*

*While (T is not empty)*

*{*

*Take one protein (P) out of T*

*Let G*_{
int
} *and G*_{
non-int
} *be the interface and non-interface patches from P*.

*If K(G, G*
_{
int
}
*)>K(G,G*
_{
non-int
}
*), then increase v by 1*

*}*

*If* v>\left|T\right|/2, *then G is an interface patch*

*Else G is a non-interface patch*

Using this strategy, when *K(G, G*_{
int
}*)* is compared with *K(G, G*_{
non-int
}*)*, G_{int} and G_{non-int} are guaranteed to have identical number of nodes, since they are the interface and non-interface patches extracted from the same protein (see section 2.4 for details). Each time *K(G, G*_{
int
}*)>K(G, G*_{
non-int
}*)* is true, one vote is given to "G is an interface patch". In the end G is predicted to be an interface patch if "G is an interface patch" gets more than half of the total votes, i.e., v>\left|T\right|/2.