We construct a phylogenetic tree for a set of DNA sequences. Our method generally contains following processes: aligning sequence [14–16], building phylogenetic trees, and selecting phylogenetic tree.
Aligning sequence
The genetic information storage location has some differences on distinct species, such as information length and carrier of genetic information. These differences will affect our subsequent analysis. Therefore, we should arrange all possible similar sites in the same position, via a progressive algorithm of multiple sequence alignment. We adopt representational evolutionary multiple sequence alignment algorithm, called ClustalW [17–19]. It displays the alignment score, in form of identities, similarities and differences, and a guide tree of evolutionary relationship between aligned sequences.
Building phylogenetic trees
The phylogenetic tree consists of many nodes and branches, where the node represents a taxon, namely species or sequence; the branch represents the evolutionary relationship between species [20, 21]. All nodes are divided into external nodes and internal nodes. In general, the external node represents actual observed taxon, the internal node represents location of evolutionary event.
Phylogeny model
Given the genetic information, we need the specific phylogeny model to predict evolutionary tree. First, we use the substitution model in terms of conversion rate. In general, the instantaneous conversion matrix is expressed as follows.
where this matrix specifies the rate of change from nucleotide i-row to nucleotide j-column. The nucleotides are in the order A,C,G,T. The stationary frequencies of nucleotides (π
A
, π
C
, π
G
, π
g
) are obtained by letting the substitution process run for a very long time.
The instantaneous conversion rate matrix describes the ratio of substitutions in a short period of time, but we need to calculate probabilities of changes in a certain period of time. Then, the probability matrix can be calculated as follows.
where Q is the instantaneous rate matrix, t is the branch length.
For a variety of evolutionary trees, we can calculate the likelihood of each phylogenetic tree. We need to consider the transformation between one external node and one internal node, and also consider the transformation between two internal nodes. For a specific site, we can calculate the likelihood of phylogenetic tree as follows.
$$L = \sum_{y}\pi_{y_{2s-1}}\prod_{k=1}^{s}p_{y_{\sigma(k)},x_{k}}(v_{k})\prod_{k=s+1}^{2s-2}p_{y_{\sigma(k)},y_{k}}(v_{k}) $$
where x and y are the external node and the internal node, respectively. σ(k) is the prefix index of k, v
k
is the branch length between y
σ(k) and x
k
/y
k
. External nodes are s input sequences, that is, s species; according to the graph theory, we can get a total of 2s−1 internal nodes.
Log-likelihood
We assume that all sites are independent with each other. We can calculate the likelihood of each site [22], and then multiply them together to get final likelihood of phylogenetic tree.
We put all possible permutations, and then calculate the likelihood of all possibilities. For a specific site, the likelihood is the sum possibility of all internal nodes, denoted by L
j
. The likelihood of all sites can be calculated as follows.
$$\ln L = \sum_{j=1}^{N}L_{j} $$
where N refers to the length of sequence and the total number of sites.
Bayesian inference
We can use Bayesian inference [23] to produce the posterior probability of i-th phylogenetic tree, τ
i
, as follows.
$$f(\tau_{i}|X) = \frac{f(X|\tau_{i})f(\tau_{i})}{\sum_{j=1}^{B(s)}f(X|\tau_{j})f(\tau_{j})} $$
where f(τ
i
|X) is the posterior probability of τ
i
, f(X|τ
i
) is the likelihood of τ
i
, and f(τ
i
) is the prior probability of τ
i
. B(s) is the number of all possible trees.
Selecting phylogenetic tree
Typically, the posterior probability of phylogenies cannot be calculated analytically, but it can be approximated by sampling phylogenetic trees from the posterior probability distribution.
Markov chain Monte Carlo
Markov chain Monte Carlo (MCMC) [24] can be used to sample phylogenies according to their posterior probabilities. The Metropolis-Hastings-Green (MHG) algorithm is an MCMC method that has been used successfully to approximate posterior probabilities of trees. MHG algorithm constructs a Markov chain with the stationary frequency of posterior probability. The current state is denoted as τ, and a new state is proposed as \(\tau ^{'}\phantom {\dot {i}\!}\). The new state is accepted with probability as follows.
$$\begin{aligned} R & = min\left(1, \frac{f(\tau^{'}|X)}{f(\tau|X)} \times \frac{f(\tau|\tau^{'})}{f(\tau^{'}|\tau)} \right) \\ & = min\left(1, \frac{f(X|\tau^{'})f(\tau^{'})/f(X)}{f(X|\tau)f(\tau)/f(X)} \times \frac{f(\tau|\tau^{'})}{f(\tau^{'}|\tau)}\right) \\ & = min\left(1, \frac{f(X|\tau^{'})}{f(X|\tau)} \times \frac{f(\tau^{'})}{f(\tau)} \times \frac{f(\tau|\tau^{'})}{f(\tau^{'}|\tau)}\right) \end{aligned} $$
One important problem of MCMC method is that we can only get the local optimal result, but not the global optimum. As shown in Fig. 1, if the current state is at the peak of T
1, because of the jump decision, the probability of next state must be less than one of current state, so MCMC method may get T
1, but miss better T
2.
Multi-chain Markov chain Monte Carlo
When the distribution becomes flat, Multi-Chain Markov Chain Monte Carlo (MCMCMC) is easy to get down from the peak of local optimum, and then try to get more states. We set a cold chain, and rest of heat chains obtained by heat values. The heat value is obtained as follows.
$$ \beta_{i} = \frac{1}{1 + c(i-1)} $$
where c is the heat coefficient according to the specific experimental data, i is the chain number. The state value of i-chain is calculated as \(\phantom {\dot {i}\!}f_{i}(s) = f_{1}(s)^{\beta _{i}}\). Easy to see, the distribution is more gentle, as shown in Fig. 2.
Exchange occurs between two selected chains, and the exchange rate is determined as follows.
$$ R = \frac{f_{i}(s_{j})f_{j}(s_{i})}{f_{i}(s_{i})f_{j}(s_{j})} $$
where s is the state of chain, f(s) is the state s corresponding to the state value in the special chain. When R is more than or equal to 1, it must be exchanged; when R is less than 1, it may be exchanged with probability value.
Conscious detection
The correlation model of phylogenetic analysis [9] assumes that phylogenetic tree is built on homogeneous data, therefore there exists a large deviation in the presence of heterogeneous data. We use conscious detection to solve compositional heterogeneity. We make a conscious detection of phylogenetic tree, analyze multiple sampling, and compare different samples obtained from estimated values. We extract the partial data from original data and form a new data set. Hundreds of data sets are used to generate different phylogenetic trees, and then get the support rate of different branches in the phylogenetic tree generated by actual data.
For m×n data set matrix, we select a random number from 1 to n, and obtain the column corresponding to this random number as re-sampling data for the first column; then repeat the above step to obtain re-sampling data of the second column, and so on. After N-loops selection, we get the final data set with same length of the original data set. For obtained data set, we analyze the phylogenetic tree according to phylogenetic analysis. Finally, we get N phylogenetic trees and their posterior probabilities, and analyze the genetic information.