The proposed method optimizes gene set annotations using gene expression data and structure information of gene ontology. The process of this algorithm is as follows:
(i) In the process of data preprocessing, gene expression data will be normalized by Z-score. For cancer gene expression data, the differentially expressed genes will be selected for the follow analysis.
(ii) The extension of gene (gene product) annotations is quantified by integrating the random walks with restart algorithm [16] and the true path rule [17] and the extended annotation matrix will contain GO structure information.
(iii) In the process of gene annotation optimization, Genes are divided into several categories created by the clustering algorithm using gene expression data. Then a cluster contained the largest number of annotations for each GO term will be retained as a result of the gene annotations. Different from the previous methods, we convert the clustering results into the probability values and use it to optimize the annotation matrix. After optimizing by the GO structure and gene expression data, genes in the same cluster (gene set) in the annotation matrix will be given similar annotations, namely, the optimization algorithm successfully improves the quality of gene annotations.
Annotation matrix construction
The classical annotation matrix is generally a binary matrix and it places all annotations on the same horizontal line, e.g. all the gene annotation information is converted into 0 or 1 in the annotation matrix [15]. In the paper, the elements of annotation matrix are probability values which more fully reflects the annotation information based on the GO structure from downward random walk algorithm [10] as shown Eq. (1).
$$ {\begin{aligned} A &= \left[ \begin{array}{ccc} a_{1,1} & \cdots & a_{1,n}\\ \vdots & \ddots & \vdots\\ a_{t,1} & \cdots & a_{t,n} \end{array} \right ] \\ a_{i,j}&= \left\{ \begin{array}{ll} 1\ \ &gene\ j\ is\ annotated\ by\ term\ i\ \\ &in\ the\ existing\ GO\ data\\ p(i,j) & the\ probability\ of\ \\ & unknown\ annotation\ \\ & for\ gene\ j\ and\ term\ i \end{array}\right. \end{aligned}} $$
(1)
where ai,j denotes gene j is annotated by term i. The p(i,j) denotes the predicted probability that gene j is annotated by term i using downward random walks algorithm for the unknown annotation for gene j and term i. (i=1,⋯,t, j=1,⋯,n).
The dRW algorithm has been used to obtain the GO structure information in this article. The first step in the construction of annotation matrix is to quantify the similarity between different GO terms. In this article, we used Gene Ontology semantic similarity Tool (GOssTo) [18] to calculate semantic similarity and selected Lin’s [19] similarity measuring method.
After calculating the semantic similarity, downward random walks with restart [16] on the GO directed acyclic graph (DAG) has been introduced to find out unknown gene annotations. GO annotations abide by the true path rule, namely, if a gene is annotated with a GO term, it will be annotated with all of the ancestor terms of this specific term in the GO DAG. Thus gene annotation extension is mainly how to establish links between genes and the offspring of existing annotated terms. Downward random walks with restart algorithm can chase down the potential functions of a gene for the available terms associated to the same gene.
The adjacency matrix of GO DAG is constructed to show the association of terms. It is defined as follows:
$$ \begin{aligned} X &= \left[ \begin{array}{ccc} x_{1,1} & \cdots & x_{1,t}\\ \vdots & \ddots & \vdots\\ x_{t,1} & \cdots & x_{t,t} \end{array} \right]\\ x_{p,q} &= \left\{ \begin{array}{ll} 1\ & term\ q\ is\ child\ of\ term\ p\\ 0 & otherwise \end{array}\right. \end{aligned} $$
(2)
where p=1,⋯,t, q=1,⋯,t.
After getting the adjacency matrix X, the corresponding probability matrix T is constructed using the semantic similarity matrix SSM and adjacency matrix X. For term p and term q, we use trans(p,q)=SSM(p,q)×X(p,q) to filter the semantic similarity between different terms and represent the random walk probability between different terms. Since we need to get the probability values, we normalize the corresponding probability matrix T by the following formula:
$$ T(p,q)=\frac{trans(p,q)}{\sum_{m\in ancestor(term q)} trans(m,q)} $$
(3)
where ancestor(termq) denotes all ancestor terms ofterm q.
The downward random walkers iteratively reach the descendant nodes of the starting term according to the corresponding probability matrix T. The final iteration process converges to the steady state. The process is defined as follows:
$$ R^{iter+1}=\beta R^{iter} T + (1-\beta)I $$
(4)
where β∈[0,1] is the restart probability between different terms, 1−β is the probability of a random walker staying the current term. I is an identity matrix. Riter represents the current transition matrix and Riter+1 represents the transition matrix at the next moment. When the difference between Riter+1 and Riter is less than a fixed value, the transition matrix R reaches a steady state, and at this point, the iterative process is terminated, and R∗ is obtained. Here the difference between Riter+1 and Riter will be calculated using the following formula:
$$ \sum_{p=1}^{t}\sum_{q=1}^{t}\left|R_{p,q}^{iter+1}-R_{p,q}^{iter}\right|<\epsilon $$
(5)
where t denotes the number of terms. The ε denotes a preset value. When the value on the left side of the inequality is less than ε, the iterative process will be terminated.(according to our empirical study, the number of iteration is less than 15) And the transition matrix at this time is said to be in a steady state. (p=1,⋯,t, q=1,⋯,t)
The process of downward random walk algorithm will achieve the steady state after several iterations. Finally, the transition probability matrix in the steady state will be combined to construct gene annotation probability matrix. It is defined as follows:
$$ A(i,j) = \sum_{e \in \chi_{j}}{R^{*}(e,i)} \ \ \ \ s.t.\ R^{*}(e,i)>\theta $$
(6)
where θ is the row mean of the transition probability matrix in the steady state. Its main purpose is to remove those very small probability values. χj represents the set of GO terms annotated to the gene j. The term e represents the terms that do not annotate the gene j. A(i,j) is the probability that the gene j is annotated by the term i. R∗ is the transition probability matrix in the steady state.
Finally, we can get the gene annotation matrix contained the GO structure information. Compared to the traditional method, threshold is not directly introduced to decide whether to annotate genes using GO terms in this article. Probability values are retained as inputs to the gene annotation optimization algorithm to compensate for the inability of the optimization algorithm to increase annotation data.
Gene annotations optimization algorithm
In this section, we introduce the process of gene set annotation optimization algorithm. Our main goal is to annotate the same function for genes with similar expression patterns based on the gene expression data, and filter out inconsistent functional annotation. The algorithm is designed as follows:
-
Genes are divided into k partitional clusters through clustering algorithm (e.g. K-means) based on gene expression data and establish a category matrix C according to cluster result.
$$ C= \left[ \begin{array}{ccc} c_{1,1} & \cdots & c_{1,k}\\ \vdots & \ddots & \vdots\\ c_{n,1} & \cdots & c_{n,k} \end{array} \right ] \ c_{j,l}= \left\{ \begin{array}{ll} 1\ \ &gene\ j\ belongs\ to\ cluster\ l\\ 0 & otherwise \end{array}\right. $$
(7)
where cj,l denotes gene j belongs to cluster l. (j=1,⋯,n, l=1,⋯,k)
-
Let A×C, get annotation statistic matrix S.
$$S= \left[ \begin{array}{ccc} s_{1,1} & \cdots & s_{1,k}\\ \vdots & \ddots & \vdots\\ s_{t,1} & \cdots & s_{t,k} \end{array} \right] $$
-
Normalize the annotation statistic matrix S by Eq. (8)
$$ s_{i,j}=\frac{s_{i,j}}{sum(s_{i,*})} $$
(8)
where sum(si,∗) is the sum of row i.
-
Calculate the optimized annotation matrix A∗
$$ A^{*}= \left[ \begin{array}{ccc} a_{1,1}^{*} & \cdots & a_{1,n}^{*}\\ \vdots & \ddots & \vdots\\ a_{t,1}^{*} & \cdots & a_{t,n}^{*} \end{array} \right] a_{i,j}^{*}=a_{i,j}c_{j,kmax}s_{i,kmax} $$
(9)
where kmax is the column with the max probability of each row in the matrix S. Each row of S will be used to find the largest subset of annotations. If there are multiple clusters with same values, the algorithm will select one cluster randomly.
-
Through Z−1 iterations, get Z−1 optimized annotation matrix \(A_{k}^{*}\) (k=2,⋯,Z) and the average optimized annotation matrix.
$$ A_{ave}^{*}=\frac{1}{Z-1}\sum_{k=2}^{z} A_{k}^{*} $$
(10)
-
Calculate the final optimized annotation matrix \(A_{final}^{*}\) after N bootstrap resample datasets
$$ A_{final}^{*}=\frac{1}{N}\sum_{k=2}^{z} A_{ave}^{*} $$
(11)