Quartet-based methods to reconstruct phylogenetic networks

Background Phylogenetic networks are employed to visualize evolutionary relationships among a group of nucleotide sequences, genes or species when reticulate events like hybridization, recombination, reassortant and horizontal gene transfer are believed to be involved. In comparison to traditional distance-based methods, quartet-based methods consider more information in the reconstruction process and thus have the potential to be more accurate. Results We introduce QuartetSuite, which includes a set of new quartet-based methods, namely QuartetS, QuartetA, and QuartetM, to reconstruct phylogenetic networks from nucleotide sequences. We tested their performances and compared them with other popular methods on two simulated nucleotide sequence data sets: one generated from a tree topology and the other from a complicated evolutionary history containing three reticulate events. We further validated these methods to two real data sets: a bacterial data set consisting of seven concatenated genes of 36 bacterial species and an influenza data set related to recently emerging H7N9 low pathogenic avian influenza viruses in China. Conclusion QuartetS, QuartetA, and QuartetM have the potential to accurately reconstruct evolutionary scenarios from simple branching trees to complicated networks containing many reticulate events. These methods could provide insights into the understanding of complicated biological evolutionary processes such as bacterial taxonomy and reassortant of influenza viruses.


Background
In the natural history of life, recombination, reassortant, hybridization and horizontal gene transfer (HGT) represent four types of important reticulate evolutionary events. For example, recombination plays a significant role in driving human genome evolution [1]; reassortant occurs frequently in influenza viruses and has facilitated the generation of 1957 H2N2, 1968 H3N2, and 2009 H1N1 pandemic influenza viruses [2]; hybridization is crucial in the evolution of plants and fish [3]; and HGT is a significant evolutionary mechanism in shaping the diversification of bacteria genomes [4]. Although most evolutionary events can be modeled as tree-like relationships, these reticulate events can be expressed much more http://www.biomedcentral.com/1752-0509/8 /21 In past decades, a number of methods have been proposed for reconstructing implicit networks either from pairwise distances, e.g. Split-Decomposition [10,11], Median Network [12], and Neighbor-Net [13] or from weighted triplets and quartets, e.g. QNet [14], SuperQ [15], and QuartetNet [16]. Distance-based methods are usually computationally efficient but tend to be inaccurate since pairwise distances reflect only small information of two taxa, while quartet-based methods are more accurate since quartet contains the information of four taxa. However, simulation studies show that current quartet-based methods are suffered from identifying false-positive and false-negative splits of the taxa set, and the trivial split weights are usually systematically overestimated [16].
Consistency is an important criteria to evaluate the reconstruction performance of a reconstruction method. A reconstruction method is considered as consistent on a special set of trees or networks if the method reconstructs precisely every tree or network in the set provided that the input data are generated from it and that sufficient data are available. Theoretically, one can prove that Neighbor-Joining is consistent on compatible split systems (trees) [17]; Split-Decomposition is consistent on weakly compatible split systems [10,11]; Neighbor-Net, QNet and SuperQ are consistent on circular split systems [13][14][15]; and Quartet-Net is consistent on 2-weakly compatible split systems [16]. Since 2-weakly compatible split system contains trees, circular split system, and weakly compatible split system as proper subsets, Neighbor-Joining, Neighbor-Net, Split-Decomposition, QNet, and superQ tend to have more false-negatives because these methods restrict the splits they reconstruct to be compatible, weakly compatible, and circular. Any split does not fit the criterion will be removed even if they are true split. Quartet-Net also have the potential to generate false negatives since it takes minimum in calculating the weight of a split whenever there are many possible scenarios [16]. A looser criterion like second minimum, average, or maximum should be able to keep more true splits. In addition, all methods tend to generate some false-positive splits when there are some random mutations or sequence errors, and thus a filtering strategy should be applied to remove the false positives while keeping the final results interpretable.
In this paper, we present three quartet-based methods QuartetS, QuartetA, and QuartetM in QuartetSuite to reconstruct split networks from a collection of weighted triplets and quartets. These methods first calculate triplet and quartet weights directly from multiple sequence alignments (MSAs) by a parsimony method and then functions by iteratively decomposing all triplet and quartet weights into simple components based on full splits. The three methods QuartetS, QuartetA, and QuartetM are designed in the same manner with slight differences.
Specifically, QuartetM is a maximum method, in which we take the maximum whenever there are several possible weights of a split. Similarly, QuartetA is an average method and QuartetS is a minimum method. Analyses on simulation data and real data show that these methods are capable of reconstructing accurate phylogenies from branching trees to complicated scenarios containing many reticulate events. In addition, the methods are effective in inferring phylogenetic distances.

Results and discussion
The proposed quartet-based methods were validated through application into two artificial data and two real data sets. The first artificial data set was simulated from a simple tree phylogeny, whereas the second one was from a complicated phylogenetic scenario containing three reticulate events. The purpose was to show that the quartet-based methods are competent in accurately reconstructing a wide range of phylogenetic networks, from branching trees to very complicated reticulate phylogenies. We chose a bacterial data set consisting of seven concatenated genes of 36 bacterial species whose evolutionary history is generally believed to contain very few reticulate events. We also chose an influenza data set containing 22 selected influenza A viruses related to the evolutionary pathways for the recently emerging H7N9 low pathogenic avian influenza virus. Since the confirmation of the first H7N9 case on March 27, 2013 [18], H7N9 has caused more than 136 human cases in China (http://www.who.int/influenza/human_animal_ interface/influenza_h7n9/Data_Reports/en/). The readers are referred to the online website http://sysbio. cvm.msstate.edu/QuartetMethods/ for the nucleotide sequences and nexus files used and generated in this study.

Simulated data
The software Dawg [19] was applied to generate six DNA sequences from a phylogenetic tree in Figure 1 and seven DNA sequences from a phylogenetic network containing three reticulate events in Figure 2. Specifically, the model is set to be GTR+Gamma+I; the substitution rate is 0.01 and the sequence length is 10,000 bp for tree and 80,000 bp for network since it is a concatenation of eight underlying trees. To avoid randomness, we completed 100 runs of Dawg and the 100 multiple sequence alignment (MSA) were applied to six methods QuartetS, QuartetA, QuartetM, Quartet-Net [16], Neighbor-Net [13], and Neighbor-Joining [20] for phylogenetic tree and network reconstruction. Table 1 lists all true splits and splits reconstructed by the six methods with bootstrap values larger than or equal to http://www.biomedcentral.com/1752-0509/8/21 15 in 100 runs for the tree data. The value in the column "wei" denotes the averaged weight on 100 runs. We only listed one block of a split in Table 1 to increase clarity. For instance, split ab|cedf is listed as ab. We normalized each split with the weight of split ab since it was successfully reconstructed by all of these six methods. All methods successfully reconstructed the true splits in all 100 runs. However, Neighbor-Net and Quartet-Net reconstructed a few false-positive splits. For example, Neighbor-Net Each node labeled in lower case letter indicates a taxon and r is the common ancestor. There are three reticulate events labeled by A, B, and C. For example, the sequence of taxa h is resulted from concatenating partial sequence from j and partial sequence from k in reticulate event A. reconstructed 10 additional splits with bootstrap values varying from 15 to 40. These false-positive splits might be due to some random mutations in the data. In addition, except for QuartetA and QuartetM, all other methods systematically over-estimate the weights of trivial splits and the weights from QuartetA is closer to the true weights with a root mean square error of 0.016. Thus, QuartetA might be good for reconstructing tree-like phylogenies.

Analysis on a phylogenetic network with three reticulate events
In Table 2, all true splits and splits reconstructed by the six methods with bootstrap value larger than or equal to 10 for the network data are listed. The weight of a true split in Figure 2 is calculated as summing up the split weights of this split in eight underlying trees by switching off one branch in a reticulate event. For example, an underlying tree is obtained by switching off three branches jA, hC, and iB in reticulate events A, B, and C. Similarly, we also normalize each split with the weight of split abc and then multiple it by four for convenience. As can be seen from the table, QuartetS, QuartetA, QuartetM, and Quartet-Net accurately reconstruct all the true splits in all 100 runs, while Neighbor-Net and Neighbor-Joining fail to reconstruct a lot of true splits. For example, Neighbor-Net fails to reconstruct the split acefg in more than 90 out of 100 runs and Neighbor-Joining fails to reconstruct splits abd, abce, abcg, aefg, abceg, and acefg in all 100 runs. This occurs because that Neighbor-Joining only reconstruct trees and Neighbor-Net also reduces the splits to make the split system planar. As a result, the reconstructed trees and networks are much simpler than the true phylogeny, distorting the originally complicated evolutionary histories. In addition, Neighbor-Net and Quartet-Net also reconstruct a few false-positive splits with low weights. It is worth noting that QuartetS reconstructs the closest non-trivial split weights with a root mean square error of 0.054 and QuartetA reconstructs the most accurate trivial split weights with a root mean square error of 0.124. Thus, QuartetS and QuartetA could be useful in reconstructing phylogenetic networks with a lot of reticulate events.
For a better comparison of the reconstruction performance of QuartetA, Quartet-Net, Neighbor-Net, and Neighbor-Joining on simulated data sets, we plot the sensitivity and specificity of these methods over the 100 runs in Figure 3(A) and 3(B), respectively. We only plot Quar-tetA because the sensitivity and specificity of QuartetS, QuartetA, and QuartetM are the same by our definition (see Methods). For tree reconstruction, the sensitivities of all methods and the specificities of Neighbor-Joining and QuartetA are equal to 1, while Neighbor-Net has the lowest median specificity. For network reconstruction, only QuartetA has the perfect performance, indicating http://www.biomedcentral.com/1752-0509/8/21 The column "True Phylo" denotes the underlying truth and the columns "QuartetS", "QuartetA", "QuartetM", "Quartet-Net", "Neighbor-Net", and "Neighbor-Joining" denote the reconstructed results by each method respectively. In addition, "Wei" denotes the average weight of the corresponding split over 100 runs; "BV" denote bootstrap value.
its potential in reconstructing complex phylogenetic networks, whereas the sensitivity of Neighbor-Joining is the lowest.

Real data
To remain concise, we only showed the phylogenetic networks constructed by QuartetS as it performs well in reconstructing non-trivial splits in the artificial data above.

Analysis on bacteria data
The bacteria sequence data set consists of concatenated sequences of seven important genes (16S rRNA, 23S rRNA, gyrB, phyH, recA, rpoA, and rpoD) from 36 bacteria species, with lengths around 9200 ∼ 12700 base pairs. Each sequence falls into three groups (GC-poor, GC-median, GC-rich) according to their percentage levels ( ≈ 30%, ≈ 50% and ≈ 60%) of GC content. There are 14 GC-poor, 11 GC-median, and 11 GC-rich bacteria respectively. The readers are referred to [21] for the detailed sequence information of concatenated as well as each gene of the species.
We use Clustal-W [22] to align 11 GC-rich sequences, 25 GC-poor and GC-rich sequences, and all 36 sequences, respectively. The obtained multiple alignments are taken as inputs to QuartetS, QuartetA, QuartetM, Quartet-Net [16], Neighbor-Joining [20], and Neighbor-Net [13]. We ran the program in a Dell desktop with 2.93G HZ processor and 4 GB memory. In practice, Neighbor-joining is the fastest, and the time for QuartetS, QuartetA, Quar-tetM, and Quartet-Net vary from seconds to around 2 minutes for different MSA sequences. Then, the reconstructed weighted split systems are viewed by SplitsTree [23]. Due to page limitations, we only show the three split networks reconstructed by QuartetS. Figure 4 shows the split network reconstructed by Quar-tetS on 11 GC-rich bacteria. As commonly believed, it is generally a tree structure with reticulate blocks of very small weights. Figure 5 and Figure 6 show the QuartetS split network on 25 GC-poor and GC-rich bacteria, and all 36 bacteria respectively, which are also generally tree-like with a few reticulation blocks. The bacteria in the same genus are classified together, supporting the current bacterial taxonomy. An interesting observation is that there is http://www.biomedcentral.com/1752-0509/8/21 The column "True Phylo" denotes the underlying truth and the columns "QuartetS", "QuartetA", "QuartetM", "Quartet-Net", "Neighbor-Net", and "Neighbor-Joining" denote the reconstructed results by each method respectively. In addition, "Wei" denotes the average weight of the corresponding split over 100 runs; "BV" denote bootstrap value. a split in Figure 5, which divides the GC-poor and GC-rich bacteria. And the GC-median bacteria are mixed among GC-rich and GC-poor bacteria in Figure 6. The results suggest that GC content might be an important factor in the evolution of bacteria.

Analysis on flu data
We downloaded whole genome sequences of 22 influenza A viruses related to the recently emerging H7N9 low pathogenic avian influenza viruses in China [18]. Similarly, we aligned them using Clustal-W [22] and reconstructed the split network by QuartetS.
The reconstructed network are shown in Figure 6. In Figure 7, the three viruses that caused human a virus D associated with H7N3 and H7N7 viruses. Virus B were shown to be genetically linked to H9N2 viruses.
Our results suggested that these novel H7N9 viruses in China are a triple reassortant from avian-origin H7N9 viruses, H9N2 viruses, and H7N3 viruses, which are consistent with other reports [18]. However, the detailed evolutionary process for these H7N9 viruses may be much more complicated, and we will further analyze the details, including the reassortment temporal order and each potential scenario, in the future. A possible direction is to reconstruct the ancestral sequences at the internal nodes by minimizing the number of substitutions. Since there are many methods for ancestral states reconstruction along a tree topology, e.g. [24], we need only reconstruct ancestral states at reticulate blocks, which could be resolved by considering all underlying tree structures and applying parsimony or maximum likelihood criterion at the block.
We have shown that the proposed quartet based methods achieved good sensitivity and specificity on simulated data. However, the comparison of reconstruction methods on real data is hard since the real evolutionary history is unknown. Thus, we only listed the number of full splits reconstructed by QuartetS, QuartetA, Quar-tetM, Quartet-Net, Neighbor-Net, and Neighbor-Joining in Table 3. It is interesting that sometimes quartet-based methods reconstructed even fewer splits than Neighbor-Joining, indicating that the full resolution of taxa is not achieved. This scenario also happens for other reconstruction methods, e.g. Split-Decomposition [16]. Generally, QuartetS and QuartetM reconstruct very few number of full splits, while QuartetA reconstructs moderate number among the comparison methods.

Conclusions
We have introduced and implemented three quartetbased methods QuartetS, QuartetA, and QuartetM in QuartetSuite to infer phylogenetic networks from multiple sequence alignment. As can be seen from both simulation data in which the evolutionary histories are known and two real data sets, these three methods can accurately reconstruct phylogenetic scenarios from branching trees to complicated reticulation events. In addition, QuartetS and QuartetM are also good at estimating evolutionary distances between ancestral taxa and current taxa. A comparison study shows that QuartetA is useful in reconstructing tree-like phylogenies, while QuartetS performs well in reconstructing phylogenies with a lot of reticulation events. Our methods have the potential to help untangle the complicated mechanisms underlying evolution. http://www.biomedcentral.com/1752-0509/8/21

Splits and their weights
A split on a taxa set X is a bipartition of X into 2 nonempty disjoint subsets (or blocks). The split with two blocks A and B is denoted by A|B. If A and B contain all taxa in X, then A|B is called a full split; otherwise, it is called a partial split. A split A|B is called trivial if |A| = 1 or |B| = 1. For example, the phylogenetic tree in Figure 1 contains six trivial full splits e.g. a|bcdef , three non-trivial full splits ab|cdef , abcd|ef , and abef |cd and many partial splits, e.g. ab|ef . For any split A|B, we define w(A|B) to be the evolutionary distance between taxa set A and B. For example, w(abcd|ef ) = 3 in Figure 1. Specifically, if A = {a} and B = {b} contain only 1 taxon in each set, then w(a|b) is the distance between taxon a and b. A split of type a|bc is called a triplet and thus w(a|bc) is called a triplet weight. Similarly, a split of type ab|cd is called a quartet and w(ab|cd) is called a quartet weight. In general, a split A|B with |A| = m and |B| = n is called an m|n-split. For any four taxa a, b, c, and d, there are three different quartets denoted by ab|cd, ac|bd, and ad|bc, respectively, and thus there are overall 3 n 4 different quartets for a taxa set of size n.
We define a weighted split system to be a set of full splits together with their split weights. There is a correspondence between a split network and a weighted split system [10]. For example, a phylogenetic tree defines a natural weighted split system, where each edge in the tree defines a full split and the edge length defines the weight of that split. On the contrary, if all the edges and their lengths are provided, then the tree is fixed [25]. Thus, we can formulate the problem of reconstructing phylogenetic networks as calculating the weight of all full splits for a given taxa set. After the weighted split system is calculated, a software called splitstree [23] is used to visualize the network.

Calculating triplet and quartet weights
We provide a naive parsimony method to estimate the triplet and quartet weights from the multiple sequence alignment of a taxa set. For any quartet ab|cd, we first collect the sub-alignment consisting only of the sequences a, b, c, and d from the multiple sequence alignment. w(ab|cd) is defined as the proportion of sites such that a and b share a same character c 1 , and c and d share a character c 2 , but c 1 = c 2 . This could be considered as a generalization of uncorrected P distance for a pair of taxa. Similarly, one can define triplet weight. It is worth noting that there are a lot of methods to calculate quartet and triplet weight, for example likelihood methods, and the weights from all these methods can be directly applied to our algorithms.

Quartet based methods
Using the quartet and triplet weights calculated from the multiple sequence alignment, we compute the full split weights from three methods QuartetS, QuartetA, and QuartetM. We used QuartetA as an example and listed the general steps in the following, http://www.biomedcentral.com/1752-0509/8/21

Output: Weighted split system
Here, "Avg" means taking the average of all the values. The difference between QuartetS, QuartetA, and QuartetM lies in how the weights of large splits were calculated from those of small ones. In QuartetM, the "Avg" is replaced by maximum value in all possible scenarios, whereas in QuartetS it is the second minimum value.
The parameter c is a user defined threshold value for filtering random splits. We test the performance of c on two simulated data and the bacteria data. Specifically, we check the variation of sensitivity and specificity of Quar-tetS by letting c vary from 0 to 100 with a step of 5. The results are shown in Figure 8. As can be seen, QuartetS achieves perfect performances when c varies from 5 to 70 for tree, and when c varies from 5 to 10 for network. Since the simulated tree and network are two extreme cases, we believe that c = 10 performs well in removing some splits incurred by random mutations and sequencing errors in general. As for bacteria data, we plotted in Additional file 1: Figure S1 the variation of the number of full splits reconstructed by QuaretS and QuartetA with the increasing of c since the true evolutionary history is unknown. In Additional file 1: Figure S1, the numbers of http://www.biomedcentral.com/1752-0509/8/21 full splits reconstructed by both methods decrease with the increase of c. c being 5 to 10 seems to be able to reduce the complexity of the final network while keeping interpretable results, and all curves becomes flat when c is larger than 30. In addition, QuartetA is more affected by the choice of c than QuartetS on bacteria data. Theoretically, c is related to mutation rate and branch length, and a detailed analysis will be performed in the future.

Implementation
A direct implementation of these methods will lead to exponential algorithms. Similar to [10] and [16], we applied an alternative way that will improve them to polynomial algorithms in most cases. A split A|B is said to display another split A |B if either A ⊆ A and B ⊆ B, or A ⊆ B and B ⊆ A. For example, ab|cdef displays ab|cd.
The following lemma is proved in [10].

Lemma 1. If a split A|B displays another split A |B , then w(A|B) ≤ w(A |B ).
By this lemma, if a partial split receives weight 0, then all the splits displaying this split will be associated with weight 0. To make use of this property, we implemented quartet methods in the following way: Suppose there are n taxa and they are ordered by number 1, 2, 3, · · · , n. There are only three quartets 12|34, 13|24, and 14|23 for the first 4 taxa 1, 2, 3, 4. We stored the quartets together with their weights in an active set, say S. After that, iteratively we added i = 5, 6, · · · , n to the left and right blocks of the splits stored in S and calculated the weights of newly generated splits from those of splits already resolved. Noticing that the only splits that cannot be generated in this way were ki|1 · · · k −1 k +1 · · · i−1 for k = 1, · · · , i−1, we also calculated their weights and added them to S. At the end of each iteration, we removed the splits with weight 0 from S since they cannot be further extended to splits with positive weights. After the last iteration, only full non-trivial splits with nonzero weights were left in S. We also calculated the trivial splits and added to S. The process is illustrated in the following algorithm,