# Quartet-based methods to reconstruct phylogenetic networks

- Jialiang Yang
^{1, 2}, - Stefan Grünewald
^{3}, - Yifei Xu
^{1}and - Xiu-Feng Wan
^{1}Email author

**8**:21

https://doi.org/10.1186/1752-0509-8-21

© Yang et al.; licensee BioMed Central Ltd. 2014

**Received: **16 November 2013

**Accepted: **12 February 2014

**Published: **20 February 2014

## Abstract

### Background

Phylogenetic networks are employed to visualize evolutionary relationships among a group of nucleotide sequences, genes or species when reticulate events like hybridization, recombination, reassortant and horizontal gene transfer are believed to be involved. In comparison to traditional distance-based methods, quartet-based methods consider more information in the reconstruction process and thus have the potential to be more accurate.

### Results

We introduce QuartetSuite, which includes a set of new quartet-based methods, namely QuartetS, QuartetA, and QuartetM, to reconstruct phylogenetic networks from nucleotide sequences. We tested their performances and compared them with other popular methods on two simulated nucleotide sequence data sets: one generated from a tree topology and the other from a complicated evolutionary history containing three reticulate events. We further validated these methods to two real data sets: a bacterial data set consisting of seven concatenated genes of 36 bacterial species and an influenza data set related to recently emerging H7N9 low pathogenic avian influenza viruses in China.

### Conclusion

QuartetS, QuartetA, and QuartetM have the potential to accurately reconstruct evolutionary scenarios from simple branching trees to complicated networks containing many reticulate events. These methods could provide insights into the understanding of complicated biological evolutionary processes such as bacterial taxonomy and reassortant of influenza viruses.

## Keywords

## Background

In the natural history of life, recombination, reassortant, hybridization and horizontal gene transfer (HGT) represent four types of important reticulate evolutionary events. For example, recombination plays a significant role in driving human genome evolution [1]; reassortant occurs frequently in influenza viruses and has facilitated the generation of 1957 H2N2, 1968 H3N2, and 2009 H1N1 pandemic influenza viruses [2]; hybridization is crucial in the evolution of plants and fish [3]; and HGT is a significant evolutionary mechanism in shaping the diversification of bacteria genomes [4]. Although most evolutionary events can be modeled as tree-like relationships, these reticulate events can be expressed much more effectively by networks. Moreover, even if the history is not reticulate, parallel evolution, model heterogeneity, and sampling or inference errors may also cause the ambiguity in reconstructing a unique tree. In such cases, networks are efficient in visualizing a group of feasible trees.

Generally speaking, a phylogenetic network is a generalization of a phylogenetic tree, allowing non tree-like structures to represent conflicting signals or alternative evolutionary histories of a group of taxa. Under this umbrella concept, there are two fundamentally different types of phylogenetic networks according to the interpretation of reticulate blocks. An explicit network describes an explicit evolutionary scenario, e.g. hybridization network [3, 5], recombination network [6, 7] and HGT network [8, 9]. In contrast, an implicit network aims to capture incompatibilities in the data, e.g. split network [10, 11]. We focus on implicit network in this study.

In past decades, a number of methods have been proposed for reconstructing implicit networks either from pairwise distances, e.g. Split-Decomposition [10, 11], Median Network [12], and Neighbor-Net [13] or from weighted triplets and quartets, e.g. QNet [14], SuperQ [15], and QuartetNet [16]. Distance-based methods are usually computationally efficient but tend to be inaccurate since pairwise distances reflect only small information of two taxa, while quartet-based methods are more accurate since quartet contains the information of four taxa. However, simulation studies show that current quartet-based methods are suffered from identifying false-positive and false-negative splits of the taxa set, and the trivial split weights are usually systematically overestimated [16].

Consistency is an important criteria to evaluate the reconstruction performance of a reconstruction method. A reconstruction method is considered as consistent on a special set of trees or networks if the method reconstructs precisely every tree or network in the set provided that the input data are generated from it and that sufficient data are available. Theoretically, one can prove that Neighbor-Joining is consistent on compatible split systems (trees) [17]; Split-Decomposition is consistent on weakly compatible split systems [10, 11]; Neighbor-Net, QNet and SuperQ are consistent on circular split systems [13–15]; and Quartet-Net is consistent on 2-weakly compatible split systems [16]. Since 2-weakly compatible split system contains trees, circular split system, and weakly compatible split system as proper subsets, Neighbor-Joining, Neighbor-Net, Split-Decomposition, QNet, and superQ tend to have more false-negatives because these methods restrict the splits they reconstruct to be compatible, weakly compatible, and circular. Any split does not fit the criterion will be removed even if they are true split. Quartet-Net also have the potential to generate false negatives since it takes minimum in calculating the weight of a split whenever there are many possible scenarios [16]. A looser criterion like second minimum, average, or maximum should be able to keep more true splits. In addition, all methods tend to generate some false-positive splits when there are some random mutations or sequence errors, and thus a filtering strategy should be applied to remove the false positives while keeping the final results interpretable.

In this paper, we present three quartet-based methods QuartetS, QuartetA, and QuartetM in QuartetSuite to reconstruct split networks from a collection of weighted triplets and quartets. These methods first calculate triplet and quartet weights directly from multiple sequence alignments (MSAs) by a parsimony method and then functions by iteratively decomposing all triplet and quartet weights into simple components based on full splits. The three methods QuartetS, QuartetA, and QuartetM are designed in the same manner with slight differences. Specifically, QuartetM is a maximum method, in which we take the maximum whenever there are several possible weights of a split. Similarly, QuartetA is an average method and QuartetS is a minimum method. Analyses on simulation data and real data show that these methods are capable of reconstructing accurate phylogenies from branching trees to complicated scenarios containing many reticulate events. In addition, the methods are effective in inferring phylogenetic distances.

## Results and discussion

The proposed quartet-based methods were validated through application into two artificial data and two real data sets. The first artificial data set was simulated from a simple tree phylogeny, whereas the second one was from a complicated phylogenetic scenario containing three reticulate events. The purpose was to show that the quartet-based methods are competent in accurately reconstructing a wide range of phylogenetic networks, from branching trees to very complicated reticulate phylogenies. We chose a bacterial data set consisting of seven concatenated genes of 36 bacterial species whose evolutionary history is generally believed to contain very few reticulate events. We also chose an influenza data set containing 22 selected influenza A viruses related to the evolutionary pathways for the recently emerging H7N9 low pathogenic avian influenza virus. Since the confirmation of the first H7N9 case on March 27, 2013 [18], H7N9 has caused more than 136 human cases in China (http://www.who.int/influenza/human_animal_interface/influenza_h7n9/Data_Reports/en/). The readers are referred to the online website http://sysbio.cvm.msstate.edu/QuartetMethods/ for the nucleotide sequences and nexus files used and generated in this study.

### Simulated data

#### Tree analysis

*a*

*b*|

*c*

*e*

*d*

*f*is listed as

*ab*. We normalized each split with the weight of split

*ab*since it was successfully reconstructed by all of these six methods. All methods successfully reconstructed the true splits in all 100 runs. However, Neighbor-Net and Quartet-Net reconstructed a few false-positive splits. For example, Neighbor-Net reconstructed 10 additional splits with bootstrap values varying from 15 to 40. These false-positive splits might be due to some random mutations in the data. In addition, except for QuartetA and QuartetM, all other methods systematically over-estimate the weights of trivial splits and the weights from QuartetA is closer to the true weights with a root mean square error of 0.016. Thus, QuartetA might be good for reconstructing tree-like phylogenies.

**Comparison of true and reconstructed splits and weights on the artificial tree by 6 methods**

True Phylo | QuartetS | QuartetA | QuartetM | Quartet-Net | Neighbor-Net | Neighbor-Joining | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Split | Wei | Split | Wei | BV | Split | Wei | BV | Split | Wei | BV | Split | Wei | BV | Split | Wei | BV | Split | Wei | BV |

ab | 1 | ab | 1 | 100 | ab | 1 | 100 | ab | 1 | 100 | ab | 1 | 100 | ab | 1 | 100 | ab | 1 | 100 |

abcd | 3 | abcd | 3.03 | 100 | abcd | 3 | 100 | abcd | 2.98 | 100 | abcd | 3.03 | 100 | abcd | 2.98 | 100 | abcd | 2.98 | 100 |

abef | 1 | abef | 1.03 | 100 | abef | 1.03 | 100 | abef | 1.03 | 100 | abef | 1.03 | 100 | abef | 1.02 | 100 | abef | 1.02 | 100 |

a | 1 | a | 1.03 | 100 | a | 0.99 | 100 | a | 0.98 | 100 | a | 1.03 | 100 | a | 1.04 | 100 | a | 1.04 | 100 |

b | 1 | b | 1.04 | 100 | b | 1 | 100 | b | 0.99 | 100 | b | 1.04 | 100 | b | 1.05 | 100 | b | 1.05 | 100 |

c | 1 | c | 1.03 | 100 | c | 0.99 | 100 | c | 0.98 | 100 | c | 1.03 | 100 | c | 1.04 | 100 | c | 1.04 | 100 |

d | 1 | d | 1.03 | 100 | d | 0.99 | 100 | d | 0.98 | 100 | d | 1.03 | 100 | d | 1.04 | 100 | d | 1.04 | 100 |

e | 1 | e | 1.04 | 100 | e | 1.01 | 100 | e | 1.00 | 100 | e | 1.04 | 100 | e | 1.04 | 100 | e | 1.05 | 100 |

f | 1 | f | 1.06 | 100 | f | 1.03 | 100 | f | 1.02 | 100 | f | 1.06 | 100 | f | 1.06 | 100 | f | 1.06 | 100 |

abc | 0.01 | 15 | abc | 0.02 | 31 | ||||||||||||||

abd | 0.02 | 15 | abd | 0.02 | 40 | ||||||||||||||

acef | 0.01 | 15 | acef | 0.01 | 21 | ||||||||||||||

abf | 0.02 | 34 | |||||||||||||||||

acd | 0.02 | 33 | |||||||||||||||||

aef | 0.02 | 36 | |||||||||||||||||

abe | 0.02 | 35 | |||||||||||||||||

ad | 0.01 | 16 | |||||||||||||||||

ae | 0.01 | 15 | |||||||||||||||||

adef | 0.01 | 19 |

#### Analysis on a phylogenetic network with three reticulate events

*abc*and then multiple it by four for convenience. As can be seen from the table, QuartetS, QuartetA, QuartetM, and Quartet-Net accurately reconstruct all the true splits in all 100 runs, while Neighbor-Net and Neighbor-Joining fail to reconstruct a lot of true splits. For example, Neighbor-Net fails to reconstruct the split

*acefg*in more than 90 out of 100 runs and Neighbor-Joining fails to reconstruct splits

*abd*,

*abce*,

*abcg*,

*aefg*,

*abceg*, and

*acefg*in all 100 runs. This occurs because that Neighbor-Joining only reconstruct trees and Neighbor-Net also reduces the splits to make the split system planar. As a result, the reconstructed trees and networks are much simpler than the true phylogeny, distorting the originally complicated evolutionary histories. In addition, Neighbor-Net and Quartet-Net also reconstruct a few false-positive splits with low weights. It is worth noting that QuartetS reconstructs the closest non-trivial split weights with a root mean square error of 0.054 and QuartetA reconstructs the most accurate trivial split weights with a root mean square error of 0.124. Thus, QuartetS and QuartetA could be useful in reconstructing phylogenetic networks with a lot of reticulate events.

**Comparison of true and reconstructed splits and weights on the artificial network by 6 methods**

True Phylo | QuartetS | QuartetA | QuartetM | Quartet-Net | Neighbor-Net | Neighbor-Joining | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Split | Wei | Split | Wei | BV | Split | Wei | BV | Split | Wei | BV | Split | Wei | BV | Split | Wei | BV | Split | Wei | BV |

ab | 1 | ab | 1.08 | 100 | ab | 1.09 | 100 | ab | 1.12 | 100 | ab | 1.08 | 100 | ab | 0.76 | 80 | ab | 0.41 | 56 |

abc | 4 | abc | 4 | 100 | abc | 4 | 100 | abc | 4 | 100 | abc | 4 | 100 | abc | 4 | 100 | abc | 4 | 100 |

abd | 1 | abd | 1.03 | 100 | abd | 1.03 | 100 | abd | 1.03 | 100 | abd | 1.02 | 100 | abd | 1.32 | 50 | |||

abcd | 4 | abcd | 3.99 | 100 | abcd | 3.99 | 100 | abcd | 3.99 | 100 | abcd | 3.99 | 100 | abcd | 4.17 | 100 | abcd | 3.97 | 100 |

abce | 1 | abce | 1.01 | 100 | abce | 1.02 | 100 | abce | 1.01 | 100 | abce | 1.01 | 100 | abce | 0.86 | 46 | |||

abcg | 1 | abcg | 1.03 | 100 | abcg | 1.03 | 100 | abcg | 1.03 | 100 | abcg | 1.02 | 100 | abcg | 0.89 | 54 | |||

aefg | 1 | aefg | 1.02 | 100 | aefg | 1.03 | 100 | aefg | 1.02 | 100 | aefg | 1.02 | 100 | aefg | 1.39 | 50 | |||

abcde | 1 | abcde | 1.08 | 100 | abcde | 1.09 | 100 | abcde | 1.12 | 100 | abcde | 1.07 | 100 | abcde | 0.76 | 67 | abcde | 0.39 | 46 |

abcdg | 1 | abcdg | 1.07 | 100 | abcdg | 1.08 | 100 | abcdg | 1.11 | 100 | abcdg | 1.07 | 100 | abcdg | 0.71 | 68 | abcdg | 0.39 | 54 |

abceg | 2 | abceg | 2.07 | 100 | abceg | 2.09 | 100 | abceg | 2.12 | 100 | abceg | 2.07 | 100 | abceg | 2.36 | 47 | |||

acefg | 2 | acefg | 2.06 | 100 | acefg | 2.07 | 100 | acefg | 2.11 | 100 | acefg | 2.06 | 100 | acefg | 2.23 | 8 | |||

adefg | 1 | adefg | 1.08 | 100 | adefg | 1.09 | 100 | adefg | 1.12 | 100 | adefg | 1.07 | 100 | adefg | 0.76 | 73 | adefg | 0.43 | 44 |

a | 10 | a | 10.31 | 100 | a | 10.15 | 100 | a | 10.17 | 100 | a | 10.43 | 100 | a | 10.07 | 100 | a | 9.42 | 100 |

b | 6 | b | 6.25 | 100 | b | 6.11 | 100 | b | 6.07 | 100 | b | 6.33 | 100 | b | 7.62 | 100 | b | 7.15 | 100 |

c | 10 | c | 10.26 | 100 | c | 10.11 | 100 | c | 10.12 | 100 | c | 10.38 | 100 | c | 10.05 | 100 | c | 9.40 | 100 |

d | 4 | d | 4.21 | 100 | d | 4.06 | 100 | d | 4.08 | 100 | d | 4.26 | 100 | d | 6.02 | 100 | d | 6.82 | 100 |

e | 10 | e | 10.28 | 100 | e | 10.14 | 100 | e | 10.16 | 100 | e | 10.41 | 100 | e | 10.12 | 100 | e | 9.43 | 100 |

f | 6 | f | 6.25 | 100 | f | 6.11 | 100 | f | 6.06 | 100 | f | 6.34 | 100 | f | 7.22 | 100 | f | 7.11 | 100 |

g | 10 | g | 10.31 | 100 | g | 10.16 | 100 | g | 10.18 | 100 | g | 10.44 | 100 | g | 10.14 | 100 | g | 9.47 | 100 |

ac | 0.08 | 99 | abdfg | 0.04 | 15 | ||||||||||||||

abcdf | 0.08 | 97 | ag | 0.04 | 13 | ||||||||||||||

acdeg | 0.02 | 74 | ae | 0.03 | 12 | ||||||||||||||

acdef | 0.03 | 66 | abdef | 0.04 | 10 | ||||||||||||||

ae | 0.04 | 65 | |||||||||||||||||

abdeg | 0.03 | 65 | |||||||||||||||||

ag | 0.04 | 64 | |||||||||||||||||

abdfg | 0.04 | 64 | |||||||||||||||||

acdfg | 0.02 | 62 | |||||||||||||||||

abdef | 0.04 | 59 | |||||||||||||||||

af | 0.03 | 58 |

### Real data

To remain concise, we only showed the phylogenetic networks constructed by QuartetS as it performs well in reconstructing non-trivial splits in the artificial data above.

#### Analysis on bacteria data

The bacteria sequence data set consists of concatenated sequences of seven important genes (16S rRNA, 23S rRNA, gyrB, phyH, recA, rpoA, and rpoD) from 36 bacteria species, with lengths around 9200 ∼ 12700 base pairs. Each sequence falls into three groups (GC-poor, GC-median, GC-rich) according to their percentage levels (≈30*%*, ≈50*%* and ≈60*%*) of GC content. There are 14 GC-poor, 11 GC-median, and 11 GC-rich bacteria respectively. The readers are referred to [21] for the detailed sequence information of concatenated as well as each gene of the species.

We use Clustal-W [22] to align 11 GC-rich sequences, 25 GC-poor and GC-rich sequences, and all 36 sequences, respectively. The obtained multiple alignments are taken as inputs to QuartetS, QuartetA, QuartetM, Quartet-Net [16], Neighbor-Joining [20], and Neighbor-Net [13]. We ran the program in a Dell desktop with 2.93G HZ processor and 4 GB memory. In practice, Neighbor-joining is the fastest, and the time for QuartetS, QuartetA, QuartetM, and Quartet-Net vary from seconds to around 2 minutes for different MSA sequences. Then, the reconstructed weighted split systems are viewed by SplitsTree [23]. Due to page limitations, we only show the three split networks reconstructed by QuartetS.

#### Analysis on flu data

Our results suggested that these novel H7N9 viruses in China are a triple reassortant from avian-origin H7N9 viruses, H9N2 viruses, and H7N3 viruses, which are consistent with other reports [18]. However, the detailed evolutionary process for these H7N9 viruses may be much more complicated, and we will further analyze the details, including the reassortment temporal order and each potential scenario, in the future. A possible direction is to reconstruct the ancestral sequences at the internal nodes by minimizing the number of substitutions. Since there are many methods for ancestral states reconstruction along a tree topology, e.g. [24], we need only reconstruct ancestral states at reticulate blocks, which could be resolved by considering all underlying tree structures and applying parsimony or maximum likelihood criterion at the block.

**Comparison of the numbers of full splits reconstructed by six methods**

Methods | GC-Poor | GC-Poor and rich | All bacteria | Influenza |
---|---|---|---|---|

QuartetS | 20 | 43 | 56 | 38 |

QuartetA | 22 | 59 | 120 | 42 |

QuartetM | 21 | 46 | 58 | 40 |

Quartet-Net | 22 | 45 | 60 | 49 |

Neighbor-Net | 29 | 77 | 114 | 68 |

Neighbor-Joining | 19 | 47 | 69 | 41 |

## Conclusions

We have introduced and implemented three quartet-based methods QuartetS, QuartetA, and QuartetM in QuartetSuite to infer phylogenetic networks from multiple sequence alignment. As can be seen from both simulation data in which the evolutionary histories are known and two real data sets, these three methods can accurately reconstruct phylogenetic scenarios from branching trees to complicated reticulation events. In addition, QuartetS and QuartetM are also good at estimating evolutionary distances between ancestral taxa and current taxa. A comparison study shows that QuartetA is useful in reconstructing tree-like phylogenies, while QuartetS performs well in reconstructing phylogenies with a lot of reticulation events. Our methods have the potential to help untangle the complicated mechanisms underlying evolution.

## Methods

### Splits and their weights

A *split* on a taxa set *X* is a bipartition of *X* into 2 non-empty disjoint subsets (or blocks). The split with two blocks *A* and *B* is denoted by *A*|*B*. If *A* and *B* contain all taxa in *X*, then *A*|*B* is called a *full split*; otherwise, it is called a *partial split*. A split *A*|*B* is called *trivial* if |*A*|=1 or |*B*|=1. For example, the phylogenetic tree in Figure 1 contains six trivial full splits e.g. *a*|*b* *c* *d* *e* *f*, three non-trivial full splits *a* *b*|*c* *d* *e* *f*, *a* *b* *c* *d*|*e* *f*, and *a* *b* *e* *f*|*c* *d* and many partial splits, e.g. *a* *b*|*e* *f*. For any split *A*|*B*, we define *w*(*A*|*B*) to be the evolutionary distance between taxa set *A* and *B*. For example, *w*(*a* *b* *c* *d*|*e* *f*)=3 in Figure 1. Specifically, if *A*={*a*} and *B*={*b*} contain only 1 taxon in each set, then *w*(*a*|*b*) is the distance between taxon *a* and *b*. A split of type *a*|*b* *c* is called a *triplet* and thus *w*(*a*|*b* *c*) is called a triplet weight. Similarly, a split of type *a* *b*|*c* *d* is called a *quartet* and *w*(*a* *b*|*c* *d*) is called a quartet weight. In general, a split *A*|*B* with |*A*|=*m* and |*B*|=*n* is called an *m*|*n*-split. For any four taxa *a*, *b*, *c*, and *d*, there are three different quartets denoted by *a* *b*|*c* *d*, *a* *c*|*b* *d*, and *a* *d*|*b* *c*, respectively, and thus there are overall $3\left(\genfrac{}{}{0.0pt}{}{n}{4}\right)$ different quartets for a taxa set of size *n*.

We define a *weighted split system* to be a set of full splits together with their split weights. There is a correspondence between a split network and a weighted split system [10]. For example, a phylogenetic tree defines a natural weighted split system, where each edge in the tree defines a full split and the edge length defines the weight of that split. On the contrary, if all the edges and their lengths are provided, then the tree is fixed [25]. Thus, we can formulate the problem of reconstructing phylogenetic networks as calculating the weight of all full splits for a given taxa set. After the weighted split system is calculated, a software called splitstree [23] is used to visualize the network.

### Calculating triplet and quartet weights

We provide a naive parsimony method to estimate the triplet and quartet weights from the multiple sequence alignment of a taxa set. For any quartet *a* *b*|*c* *d*, we first collect the sub-alignment consisting only of the sequences *a*, *b*, *c*, and *d* from the multiple sequence alignment. *w*(*a* *b*|*c* *d*) is defined as the proportion of sites such that *a* and *b* share a same character *c*
_{1}, and *c* and *d* share a character *c*
_{2}, but *c*
_{1}≠*c*
_{2}. This could be considered as a generalization of uncorrected P distance for a pair of taxa. Similarly, one can define triplet weight. It is worth noting that there are a lot of methods to calculate quartet and triplet weight, for example likelihood methods, and the weights from all these methods can be directly applied to our algorithms.

### Quartet based methods

Using the quartet and triplet weights calculated from the multiple sequence alignment, we compute the full split weights from three methods QuartetS, QuartetA, and QuartetM. We used QuartetA as an example and listed the general steps in the following,

#### QuartetA

Here, “Avg” means taking the average of all the values. The difference between QuartetS, QuartetA, and QuartetM lies in how the weights of large splits were calculated from those of small ones. In QuartetM, the “Avg” is replaced by maximum value in all possible scenarios, whereas in QuartetS it is the second minimum value.

*c*is a user defined threshold value for filtering random splits. We test the performance of

*c*on two simulated data and the bacteria data. Specifically, we check the variation of sensitivity and specificity of QuartetS by letting

*c*vary from 0 to 100 with a step of 5. The results are shown in Figure 8. As can be seen, QuartetS achieves perfect performances when

*c*varies from 5 to 70 for tree, and when

*c*varies from 5 to 10 for network. Since the simulated tree and network are two extreme cases, we believe that

*c*=10 performs well in removing some splits incurred by random mutations and sequencing errors in general. As for bacteria data, we plotted in Additional file 1: Figure S1 the variation of the number of full splits reconstructed by QuaretS and QuartetA with the increasing of

*c*since the true evolutionary history is unknown. In Additional file 1: Figure S1, the numbers of full splits reconstructed by both methods decrease with the increase of

*c*.

*c*being 5 to 10 seems to be able to reduce the complexity of the final network while keeping interpretable results, and all curves becomes flat when

*c*is larger than 30. In addition, QuartetA is more affected by the choice of

*c*than QuartetS on bacteria data. Theoretically,

*c*is related to mutation rate and branch length, and a detailed analysis will be performed in the future.

### Sensitivity and specificity

*n*taxa, there are overall 2

^{ n−1}−1 splits. The splits in the true phylogeny are

*condition positives*, and the other splits are

*condition negatives*. The splits reconstructed by a reconstruction method are

*test outcome positives*, and the other splits are

*test outcome negative*. By definition,

*true positive*splits are the intersection of condition positives and test outcome positive and

*true negative*splits are the intersection of condition negatives and test outcome negative. The

*precision*of a reconstruction method is defined as

*specificity*is defined as

#### Implementation

A direct implementation of these methods will lead to exponential algorithms. Similar to [10] and [16], we applied an alternative way that will improve them to polynomial algorithms in most cases. A split *A*|*B* is said to *display* another split *A*
^{′}|*B*
^{′} if either *A*
^{′}⊆*A* and *B*
^{′}⊆*B*, or *A*
^{′}⊆*B* and *B*
^{′}⊆*A*. For example, *a* *b*|*c* *d* *e* *f* displays *a* *b*|*c* *d*. The following lemma is proved in [10].

#####
**Lemma**
**1**

If a split *A*|*B* displays another split *A*
^{′}|*B*
^{′}, then *w*(*A*|*B*)≤*w*(*A*
^{′}|*B*
^{′}).

By this lemma, if a partial split receives weight 0, then all the splits displaying this split will be associated with weight 0. To make use of this property, we implemented quartet methods in the following way: Suppose there are *n* taxa and they are ordered by number 1,2,3,⋯,*n*. There are only three quartets 12|34, 13|24, and 14|23 for the first 4 taxa 1,2,3,4. We stored the quartets together with their weights in an active set, say *S*. After that, iteratively we added *i*=5,6,⋯,*n* to the left and right blocks of the splits stored in *S* and calculated the weights of newly generated splits from those of splits already resolved. Noticing that the only splits that cannot be generated in this way were *k* *i*|1⋯*k*−1 *k*+1⋯*i*−1 for *k*=1,⋯,*i*−1, we also calculated their weights and added them to *S*. At the end of each iteration, we removed the splits with weight 0 from *S* since they cannot be further extended to splits with positive weights. After the last iteration, only full non-trivial splits with nonzero weights were left in *S*. We also calculated the trivial splits and added to *S*. The process is illustrated in the following algorithm,

##### An alternative way to implementquartet methods

It could be proven that for some special split systems the number of non-zero splits in each iteration are bounded [10, 16]. Thus, the time complexity of quartet methods is polynomial for these special split systems like the split system from trees. However, it is beyond the scope of this study. We implemented the algorithm in C++ and the codes are downloadable for free from http://sysbio.cvm.msstate.edu/QuartetMethods/.

## Declarations

### Acknowledgements

This work was partially supported by the Natural Science Foundation of China (No. 10971213) to SG, and the Department of Justice (2010-DD-BX-0596) and National Institutes of Health (NIAID RC1AI086830) to XW.

## Authors’ Affiliations

## References

- Meunier J, Duret L:Recombination drives the evolution of gc-content in the human genome. Mol Biol Evol. 2004, 21 (6): 984-990. 10.1093/molbev/msh070.View ArticlePubMedGoogle Scholar
- Nelson M, Vibound C, Simonsen L, Bennett R, Griesemer S, George K, Taylor J, Spiro D, Sengamalay NA, Ghedin E, Taubenberger J, Holmes E:Multiple reassortment events in the evolutionary history of h1n1 influenza a virus since 1918. PLOS Pathog. 2008, 4 (2): 1000012-10.1371/journal.ppat.1000012.View ArticleGoogle Scholar
- Linder R, Rieseberg L:Reconstructing patterns of reticulate evolution in plants. Am J Bot. 2004, 91: 1700-1708. 10.3732/ajb.91.10.1700.PubMed CentralView ArticleGoogle Scholar
- Doolittle W:How big is the iceberg of which organellar genes in nuclear genomes are but the tip?. Phil Trans R Soc Lond B Biol Sci. 2003, 358: 39-57. 10.1098/rstb.2002.1185.View ArticleGoogle Scholar
- Yu Y, Than C, Degnan J, Nakhleh L:Coalescent histories on phylogenetic networks and detection of hybridization despite lineage sorting. Syst Biol. 2011, 60 (2): 138-149. 10.1093/sysbio/syq084.PubMed CentralView ArticlePubMedGoogle Scholar
- Gusfield D, Eddhu S, Langley C:Optimal, efficient reconstruction of phylogenetic networks with constrained recombination. J Bioinform Comput Biol. 2004, 2: 173-213. 10.1142/S0219720004000521.View ArticlePubMedGoogle Scholar
- Huson D, Kloepper T:Computing recombination networks from binary sequences. Bioinformatics. 2005, 21 (Suppl 2): 159-165.View ArticleGoogle Scholar
- Kunin V, Goldovsky L, Darzentas N, Ouzounis C:The net of life: reconstructing the microbial phylogenetic network. Genome Res. 2005, 15 (7): 954-959. 10.1101/gr.3666505.PubMed CentralView ArticlePubMedGoogle Scholar
- Jin G, Nakhleh L, Snir S, Tuller T:Efficient parsimony-based methods for phylogenetic network reconstruction. Bioinformatics. 2006, 23: 123-128.View ArticleGoogle Scholar
- Bandelt H, Dress A:A canonical decomposition theory for metrics on a finite set. Adv Math. 1992, 92: 47-105. 10.1016/0001-8708(92)90061-O.View ArticleGoogle Scholar
- Bandelt H, Dress A:Split decomposition: a new and useful approach to phylogenetic analysis of distance data. Mol Phylogenet Evol. 1992, 1: 242-252. 10.1016/1055-7903(92)90021-8.View ArticlePubMedGoogle Scholar
- Bandelt H, Forster P Röhl:Median-joining networks for inferring intraspecific phylogenies. Mol Biol Evol. 1990, 16: 37-48.View ArticleGoogle Scholar
- Bryant D, Moulton V:Neighbor-net: an agglomerative method for the construction of phylogenetic networks. Mol Biol Evol. 2004, 21: 255-265.View ArticlePubMedGoogle Scholar
- Grünewald S, Forslund K, Dress A, Moulton V:Qnet: an agglomerative method for the construction of phylogenetic networks from weighted quartets. Mol Biol Evol. 2006, 24: 532-538. 10.1093/molbev/msl180.View ArticlePubMedGoogle Scholar
- Grünewald S, Spillner A, Bastkowski S, Bogershausen A, Moulton V:Superq: Computing supernetworks from quartets. IEEE/ACM Trans Comput Biol Bioinform. 2013, in press,Google Scholar
- Yang J, Grünewald S, Wan X:Quartet-net: a quartet-based method to reconstruct phylogenetic networks. Mol Biol Evol. 2013, in press,Google Scholar
- Gascuel O, Steel M:Neighbor-joining revealed. Mol Biol Evol. 2006, 23 (11): 1997-2000. 10.1093/molbev/msl072.View ArticlePubMedGoogle Scholar
- Gao R, Cao B, Shu Y:Human infection with a novel avian-origin influenza a (h7n9) virus. New England J Med. 2013, 368: 1888-1897. 10.1056/NEJMoa1304459.View ArticleGoogle Scholar
- Cartwright R:Dna assembly with gaps (dawg): simulating sequence evolution. Bioinformatics. 2005, 21: 31-38. 10.1093/bioinformatics/bth471. 3 suppl,View ArticleGoogle Scholar
- Saitou N, Nei M:The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4: 406-425.PubMedGoogle Scholar
- Takahashi M, Kryukov K, Saitou N:Estimation of bacterial species phylogeny through oligonucleotide frequency distances. Genomics. 2009, 93: 525-533. 10.1016/j.ygeno.2009.01.009.View ArticlePubMedGoogle Scholar
- Larkin M, Blackshields B, Brown N, Chenna R, McGettigan P, McWilliam H, Valentin F, Wallace I, Wilm A, Lopez R, Thompson J, Gibson T, Higgins D:Clustal w and clustal x version 2.0. Bioinformatics. 2007, 23 (21): 2947-2948. 10.1093/bioinformatics/btm404.View ArticlePubMedGoogle Scholar
- Huson D, Bryant D:Application of phylogenetic networks in evolutionary studies. Mol Biol Evol. 2006, 23: 254-267.View ArticlePubMedGoogle Scholar
- Yang Z:Paml 4 a program package for phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007, 24: 1586-1591. 10.1093/molbev/msm088.View ArticlePubMedGoogle Scholar
- Buneman P:The recovery of trees from measures of dissimilarity.Mathematics in the Archaeological and Historical Sciences. Edited by: Hodson FR, Kendall DG, Tautu P. 1971, Edinburgh University Press: Edinburgh, 387-395.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedicationwaiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.