### Datasets

We download 20,199 data of human sequences protein from the Uniprot database [11]. The PPIs data come from Various resource libraries including MatrixDB, BioGRID, DIP, IntAct and InnateDB [12,13,14,15,16]. In order to obtain the SIP data set, the PPI data that can interact with itself were collected. Accordingly, we obtained 2, 994 human SIPs sequences.

To collect datasets scientifically and efficiently, the human SIPs dataset is screened by the following steps [17]: (1) the protein sequence(>5000residues or < 50 residues) was removed from the whole human sequences protein; (2) For the construction of the positive data set, the selected SIPs must meet one of the following situations: (a) At least two mass experiments or one small scale experiment have shown that this protein sequence can interact with itself; (b) the protein must be homooligomer in UniProt; (c) the self-interaction of this protein have been reported by more than one publication; (3) For the sake of establish negative data set, all known SIPs were deleted from the whole human proteome.

As a result, 1441 human SIPs were selected to build positive data sets and 15,938 human protein that non-interacting were selected to build negative datasets. In addition, to better verify the usefulness of the designed scheme, we constructed the *S.erevisiae* SIPs dataset that cover 710 SIPs and 5511 non-SIPs by using above strategy.

### Position specific weight matrix

PSWM [18] was first adopted for detecting proteins of distantly related. The PSWM successfully applied in the field of biological information, including protein disulfide connectivity, protein structural classes, and subnuclear localization, DNA or RNA binding sites [19,20,21,22,23]. In the study, we used PSWM for predicting SIPs. A PSWM for a query protein is a Y×20 matrix *M* = {*m*_{ij}: *i* =1 ⋯ Y *and* *j* = 1 ⋯ 20}, where the Y represents the size of the protein sequence and the number of columns of M matrix denotes 20 amino acids. In order to construct PSWM, a position frequency matrix is first created by calculating the presence of each nucleotide on each position. This frequency matrix can be represented as *p*(*u*, *k*), where *u* means position, *k* is the *k*_{th} nucleotide. The PSWM can be expressed as \( {M}_{ij}={\sum}_{k=1}^{20}p\left(u,k\right)\times w\left(v,k\right) \), where *w*(*v*, *k*) is a matrix whose elements represent the mutation value between two different amino acids. Consequently, high scores represent highly conservative positions, and low points represent a weak conservative position.

In this paper, the PSWM of a protein sequences were generated by using Position specific iterated BLAST (PSI-BLAST) [24]. To get high and broad homologous information, we set three iterations and set the e-value to 0.001.

### Zernike moments

In this paper, the Zernike moments are introduced to extract meaningful information from protein sequence and generate feature vector [25,26,27,28,29,30]. We introduce the concept of the Zernike function to clearly define the moments of the Zernike. A set of complex polynomials are introduced by Zernike which form a complete orthogonal set within the unit circle. These polynomials are represented as *V*_{nm}(*x*, *y*). These polynomials have the following form:

$$ {V}_{xy}\left(n,m\right)={V}_{xy}\left(\rho, \theta \right)={R}_{xy}\left(\rho \right){e}^{jy\theta}\mathrm{for}\ \rho \le 1 $$

(1)

where *x* is a positive integer greater than zero, *y* is integer, and satisfies |*y*| < *x*, where *x* - |*y*| is an even number. *ρ* is the length from (0, 0) to the pixel (*n*, *m*). *θ* represents included angle between vector *ρ* and *n* axis in counterclockwise direction. *R*_{xy}(*ρ*)is

$$ {R}_{xy}\left(\rho \right)=\sum \limits_{s=0}^{\left(x-|y|/2\right)}{\left(-1\right)}^s\frac{\left(x-s\right)!}{s!\left(\frac{x+\left|y\right|}{2}-s\right)!\left(\frac{x+\left|y\right|}{2}-s\right)!}{\rho}^{x-2s} $$

(2)

From equation (2), we can find *R*_{x, − y}(*ρ*) = *R*_{xy}(*ρ*). These orthogonal polynomials are satisfying:

$$ \underset{0}{\overset{2\pi }{\int }}{\int}_0^1{V}_{xy}^{\ast}\left(\rho, \theta \right){V}_{pq}\left(\rho, \theta \right)\rho d\rho d\theta =\frac{\pi }{x+1}{\delta}_{xp}{\delta}_{yq}\kern0.75em $$

(3)

with

$$ {\delta}_{ab}=\left\{\begin{array}{c}1\ \\ {}0\end{array}\right.\kern2.25em \genfrac{}{}{0pt}{}{a=b}{otherwise} $$

(4)

The Zernike moments can be obtained by calculating (5)

$$ {Z}_{xy}=\frac{x+1}{\pi }{\sum}_{\left(\rho, \theta \right)\in unit\ circle}\sum f\left(\rho, \theta \right){V}_{nm}^{\ast}\left(\rho, \theta \right) $$

(5)

To calculate the ZMs of a protein sequence represented by a PSWM matrix, the origin is at the center of the matrix, and the points in the matrix are mapped inside the unit circle., i.e., *n*^{2} + *m*^{2} ≤ 1. The value falling outside the unit circle is not calculated [31,32,33,34,35]. Note that \( {A}_{xy}^{\ast }={A}_{x,-y.} \)

### Feature selection

To sum up, Zernike moments can extract some important information. When we use the Zernike moments, there is a problem that must be considered is how big *n*_{max}should be set? The moments of lower order extract unsophisticated feature and the moments of higher order capture details feature. Figure 1 shows the magnitude plots of the Zernike moments with low order. Considering that we not only need enough information for more accurate classification, but also need to control the dimension of feature to reduce the computational cost. In this experiment, *x*_{max} is set to 30 [36,37,38,39,40]. This moment information constitutes the feature vectors of protein sequences

$$ \overrightarrow{F}={\left[\left|{A}_{11}\right|,\left|{A}_{22}\right|,\dots \dots, \left|{A}_{NM}\right|\right]}^T $$

(6)

where |*A*_{nm}| represents the absolute value of Zernike moments. The zeroth order moments are not computed because they do not contain any valuable information and ZMs without considering *m* < 0, since they are inferred through \( {A}_{n,-m}={A}_{nm}^{\ast }. \)

Finally, in order to eliminate noise as much as possible and to reduce the computational complexity, the feature dimensional was reduced from 240 to 150 by means of principal component analysis (PCA) method [41].

### Long short-term memory

Long Short-Term Memory (LSTM), a special recurrent neural network, performs much better than standard recurrent neural networks in many tasks. Almost all exciting results based on recurrent neural networks are implemented by them. In this work, the deep LSTM net structure was first introduced to predict self-interaction protein.

The main difference between LSTM network and other networks is its use of complex memory block instead of the neurons of general network. The memory block contains three multiplicative ‘gate’ units (the input, forget, and output gates.) along with some memory cells (one or more). The gate unit is used to control the information flow, and the memory cell is used to store the historical information [42,43,44]. The structure of the memory block is shown in the Fig. 2, to better understand the work of the gate unit, memory cells are not shown in the Fig. 2. The gate removes or restore information to the cell state by controlling the information flow. More specific, the input and output of the information flow are respectively handled by the input and output gates. The forget gate determines how much of the previous unit’s information is retained to the current unit. In addition, in order to enable memory blocks to store earlier information, we add a peephole to the block to connect the memory cell to the gate [45, 46].

The information flow passing through a memory block needs to do the following operations to complete the mapping from input *x* to output *h*:

$$ {i}_t= sigm\left({W}_i\bullet \left[{C}_{t-1},{x}_t,{h}_{t-1}\right]+{b}_i\right) $$

(7)

$$ {f}_t= sigm\left({W}_f\bullet \left[{C}_{t-1},{x}_t,{h}_{t-1}\right]+{b}_f\right) $$

(8)

$$ {o}_t= sigm\left({W}_o\bullet \left[{C}_t,{x}_t,{h}_{t-1}\right]+{b}_o\right) $$

(9)

$$ {\overset{\check{} }{C}}_t=\mathit{\tanh}\left({W}_C\bullet \left[{x}_t,{h}_{t-1}\right]+{b}_C\right) $$

(10)

$$ {C}_t={C}_{t-1}\ast {f}_t+{\overset{\check{} }{C}}_t\ast {i}_t $$

(11)

$$ {h}_t=\tanh \left({C}_t\right)\ast {o}_t+{\overset{\check{} }{C}}_t\ast {i}_t $$

(12)

Here, symbols related to the letter *C* represent cell activation vectors, the symbol *f, i*, *o*, and *C* are respectively the forget gate, input gate, output gate. The items related to *W* (*W*_{i}, *W*_{f}, *W*_{f}), represent weight matrices, the items related to *b* (*b*_{i}, *b*_{f}, *b*_{o}, *b*_{C}) denote bias, *σ* is sigmoid function, ∗ is the element-wise product of the vectors.

### Stacked long short-term memory

A large number of theoretical and practical results support that the deep hierarchical network model can be more competent for complex tasks than shallow one. We construct the Stacked Long Short-Term Memory (SLSTM) net by stacking multiple LSTM hidden layers on top of each other, which contain one input layer, three LSTM hidden layers, one output layer. Figure 3 shows a SLSTM network. The number of neurons in the input layer is equal to the dimension of the input data. Each SLSTM hidden layer consist of 16 memory blocks. The number of neurons in the output layer equals the number of classes. Therefore, the number of neurons or memory blocks in each layer of the network are *200–16–16-16-2.* In output layer, the softmax function is used to generate probabilistic results.

### Prevent over fitting

Overfitting problems exist in many prediction or classification models. Even the deep learning model with superior performance is no exception. A great deal of theoretical and practical work has proved that over-fitting can be reduced or avoided by adding “dropout” operation on neural net. “dropout” provides a way to approximate combine exponentially different neural network architectures [47]. More specific, “dropout” involves two important operations: 1) Dropout randomly discards hidden units and edges connected with them with a fixed probability in each training case; 2) In the test, dropout is responsible for integrating multiple neural networks generated during training. The first operation makes it possible to produce a different network almost every training case and these different networks share the same weights for the hidden units. The Fig. 4 describes a network model after using dropout. At test time, all hidden layer neurons are used without “dropout”, but the weight of the network is a reduced version of the trained weights. The proportion of weight reduction equals to the probability of the unit being retained [48]. By weight reduction, a large number of dropout networks can be merged into a single neural network and provide a similar performance to averaging over all networks [49].