Polytomy identification in microbial phylogenetic reconstruction

Background A phylogenetic tree, showing ancestral relations among organisms, is commonly represented as a rooted tree with sets of bifurcating branches (dichotomies) for simplicity, although polytomies (multifurcating branches) may reflect more accurate evolutionary relationships. To represent the true evolutionary relationships, it is important to systematically identify the polytomies from a bifurcating tree and generate a taxonomy-compatible multifurcating tree. For this purpose we propose a novel approach, "PolyPhy", which would classify a set of bifurcating branches of a phylogenetic tree into a set of branches with dichotomies and polytomies by considering genome distances among genomes and tree topological properties. Results PolyPhy employs a machine learning technique, BLR (Bayesian logistic regression) classifier, to identify possible bifurcating subtrees as polytomies from the trees resulted from ComPhy. Other than considering genome-scale distances between all pairs of species, PolyPhy also takes into account different properties of tree topology between dichotomy and polytomy, such as long-branch retraction and short-branch contraction, and quantifies these properties into comparable rates among different sub-branches. We extract three tree topological features, 'LR' (Leaf rate), 'IntraR' (Intra-subset branch rate) and 'InterR' (Inter-subset branch rate), all of which are calculated from bifurcating tree branch sets for classification. We have achieved F-measure (balanced measure between precision and recall) of 81% with about 0.9 area under the curve (AUC) of ROC. Conclusions PolyPhy is a fast and robust method to identify polytomies from phylogenetic trees based on genome-wide inference of evolutionary relationships among genomes. The software package and test data can be downloaded from http://digbio.missouri.edu/ComPhy/phyloTreeBiNonBi-1.0.zip.


Background
Evolutionary histories (or phylogenies) are an integral part of many studies in modern biology. Phylogenies have long been used to study the relationships among species. Most phylogenetic trees are in the form of a rooted tree where the leaves represent the species, and internal nodes represent their hypothetical ancestors. Consequently, the phylogenetic tree becomes a branching diagram showing the inferred evolutionary relationships among various biological species based on similarities and differences in their physical and/or genetic characteristics. The two most commonly seen topological sub-structures of trees are dichotomy and polytomy. Although resolving phylogenetic trees into dichotomous branching patterns has been a general goal in phylogenetics [1], the enforced resulting trees might be stretched beyond the necessary evolutionary assumptions.
Polytomies are multifurcating (as opposed to bifurcating) relationships in phylogenetic hypotheses and occur for two reasons: First, polytomies can result from poor resolution of true bifurcating relationships (due to lack of sufficient data or inappropriate analysis of characters), and these are "soft" polytomies; second, polytomies can represent bona fide multifurcations (multiple, simultaneous divergence events), and these are "hard" polytomies [2]. For example, simultaneous divergences of three or more lineages can occur due to the isolation of subpopulations within a widespread species by sudden meteorological or geological events resulting in reproductive isolation due to the rapid expansion of the population into open territory. Traditionally, researchers assumed that hard polytomies were exceptions and rare in nature and treated them as expression of ignorance [3,4]. With this assumption, researchers traditionally sought to resolve polytomies in binary trees, often by increasing the number of characters analyzed. However, more recent evidence shows the existence of simultaneous speciation events such as with the Drosophila simulans species complex [5,6] shown in Figure 1 and African cichlid fishes which speciate quickly after their home lakes formed in Africa resulting in several phylogenetic polytomies [7].
Microbial taxonomy imposes many difficulties in determining whether generated phylogenetic tree branches are bifurcations or multifurcations. The major problem is that microbes, although morphologically diverse, may have fewer morphological characters than macrobes. Consequently, the rate of change in morphological characters may be greater in macrobes than in microbes [8]. Thus, microbes might not have enough diverse morphologies to do deep taxonomy accurately beyond the level of phylum. As a result, grouping of microbes into many taxonomy trees shows up as multifurcating clads instead of resolved bifurcating trees. Also, bacteria evolve in a different manner from eukaryotes. Eukaryotes predominately evolve by point mutations (changes in single base-pair in DNA) whereas bacteria often evolve by inserting or deleting large chunks of DNA [9][10][11]. Thus, microbial phylogeny analysis of available molecular sequences, rRNA and protein, have difficulties to convincingly resolve in the many specific branching orders of the microbial divisions due to incongruence of sequence diversity. This suggests that the base of the microbial tree is best seen as a polytomy, an expansive radiation that has not been resolved with current data [12]. As a result, many sources of species trees, such as the NCBI Taxonomy Database, use polytomies for species trees. In fact, 64% of the branch points in the NCBI taxonomy Database [13] have three or more children. In addition, a multifurcated species tree is indeed valuable to infer gene duplication events and orthologs [14][15][16][17]. As with any approach that imposes structure on the data [18][19][20], bifurcations are the imposition of method, not necessarily the reality. Due to the nature of algorithmic pairwise comparison of many phylogeny methods, often phylogeny visualization tools display a tree as a bifurcating cladogram, which is a directed bifurcating tree with a unique node corresponding to the (usually imputed) most recent common ancestor of all the entities at the leaves of the tree. Thus, all tree-building methods will force a binary tree on the data without considering the possibility that resulting trees might stretch beyond the assumption. A justification may be that it is easier to work on a strictly binary set of nodes although there are possibilities of the polytomies existing in trees [21]. However, even if there is a real dichotomous structure in the data, unresolved nodes will often occur mostly at or near the terminal branches due to lack of information. In this case, it is better to use polytomies instead of forcing dichotomies artificially. As suggested, polytomies might be quite common in microbial taxonomy trees since many times evolutionary relationships of interested species cannot be fully resolved to separate descending branches or difference of timeframes between two divergences.
The most popular approach to obtaining polytomies is through forcing a threshold on bootstrap values of branches generated from a consensus tree, which is produced from a set of gene trees based on different selections of gene subsets. Then multiple bifurcating branches with low bootstrap values would be combined into multifurcating branches. This method has been employed in a number of popular phylogenetic tools, such as PAUP [22], Phylip [23] and Splitstree [22][23][24]. The result is often ad hoc and coarse grain without any significant statistical basis. Another approach for polytomy identification from bifurcating branches is through literature mining and manual curation of biology experts in order to match up with the general understanding of the taxonomy tree of life [5][6][7] which is laborious and Figure 1 Phylogenetic tree of Drosophila melanogaster species group. Redrawn from [5]. This phylogeny shows the relationships among some species in the Drosophila group in which there is a hard polytomy branch. The reason could be that the island species D. mauritiana and D. sechellia branched off from the mainland species D. simulans in a narrow timeframe, such that it is impossible to distinguish which species branched off first and which second. time consuming. Until now, there is still no systematic algorithm and method to automate the process of polytomy identification from a bifurcating tree. Therefore, it is essential to construct a systematic algorithm to identify polytomies from an existing bifurcating tree to facilitate the phylogenetic analysis.
With fast accumulations of molecular sequencing data, especially using whole-genome sequences from different organisms, inferring molecular evolutionary relationships using genomic data has been a popular phylogeny analysis approach in recent years. Often subsets of gene or protein families would be selected to infer the so-called "gene trees" for interested organisms to establish phylogenetic relationships. These analyses typically result in bifurcating phylogeny trees containing only dichotomies. Though gene trees are good sources for hypothesizing species trees, they introduce other problems, such as the incongruence of tree branches in the phylogenetic trees due to different selections of gene subsets. Hence, it is logical to introduce the polytomies instead of trying to resolve every branch when there is incongruence among tree branches of different trees from the same set of species.
In this study, we propose an innovative strategy, called 'PolyPhy' (phylogeny construction with polytomy identification) to identify polytomies from a phylogenetic tree. In PolyPhy, the process starts with phylogenetic tree construction for many species and then combines a feature extraction process with a Bayesian classification method to identify polytomies from a bifurcating phylogenetic tree. The PolyPhy method takes into account different tree topological properties between dichotomy and polytomy, such as long-branch retraction and shortbranch contraction, and quantifies these properties as comparable rates among different sub-branches. More specifically, PolyPhy utilizes ComPhy [25], a microbial tree reconstruction tool that we developed based on genome-structure features. After a phylogenetic tree is built, two sets of bifurcating branches, presumably dichotomies and polytomies, are extracted based on Bergeys' Taxonomy. Then three tree structure features, 'LR' (Leaf rate), 'IntraR' (Intra-subset branch rate) and 'InterR' (Inter-subset branch rate), are calculated from bifurcating tree branch sets based on distance matrices derived from ComPhy. The next step is BLR (Bayesian Logistic regression) classification, which is a recursive training process to select best classification parameters to fit the input features and generate a model for classification. The last step is identifying potential polytomies and checking for accuracies.

Results & discussion
Optimizing Jackknife ability In ComPhy, genome structures such as breakpoints of gene ordering are constructed to infer genome distance based on orthologs defined in a pair of genomes. Therefore, we treated individual orthologs as samples for jackknife procedure. We resampled the dataset by selecting 30% to 90% of the orthologs incrementally. Figure 2 shows the distribution of average accuracy for trees generated by different percentage selections of orthologs after 50 iterations. It is noticeable that the 60% cutoff worked well since the increase of accuracy beyond this point almost reached a plateau even when more data was included. Therefore, a 60% cutoff was used as the default for genome feature jackknife phylogeny analysis. With 60% jackknife resampling, we could generate unbiased tree replicates for the next step of classification training and testing. Figure 2 Accuracy vs. different ortholog percentage selection curves. The horizontal-axis is percentage (ranging from 30% to 90%) of orthologs being selected to produce the tree and y-axis is the accuracy. This figure shows two different accuracy curves of different ortholog percentage selections. The blue curve shows the accuracy of the tree generated by ComPhy in comparison to the taxonomy. And the red curve is the average bootstrap values from the tree branches for each generated tree.

Classification signal strength in three features
To efficiently utilize each of the features for the classifier described in "Material & Methods," it is important to know how strongly each feature can individually distinguish between dichotomy and polytomy. Figure 3 represents the distribution of each feature value in each class (dichotomy or polytomy) in the two datasets introduced in "Material & Methods." Three features of Part A in Figure  3 are generated from the smaller dataset consisting of 83 genomes; and genome distances were calculated based on traditional MSA (multiple sequence alignment) using 16s rRNA sequences. Part B of Figure 3 is generated from the larger dataset, which has 398 microbial species, and genomes distances were calculated based on composite distance from ComPhy. Figures 3A-1 and 3B-1 correspond to the leaf feature rate distribution, while 3A-2 and 3B-2 correspond to intra-subset branch rate with 3A-3 and 3B-3 from inter-subset branch rate. These figures clearly demonstrate the distinguishable divisions of feature rate distributions between dichotomy and polytomy. Furthermore, similar distribution patterns were observed for all three features between set A figures and set B figures even though these distributions were generated based on different genome distance calculations. This indicates that not only extracted features have strong discerning power, but also the performance is not significantly subject to different genome distance methods.

BLR parameter optimization
As in other multivariate statistical models, the performance of BLR in classification depends on the combination of several parameters. Here, BLR involves mostly two classes of parameters: 1) the prior variance parameter V and 2) the error tuning parameter t. V is a Figure 3 Classification strength of each of three tree topological features. Part A of the figures is generated using 83 microbial species based on MSA (multiple sequence alignment) on 16S rRNA sequences. A-1 shows leaf rate topological feature generated per dichotomies and polytomies, while A-2 and A-3 use IntraR and InterR features, respectively. Part B of the figures is generated using all 398 microbial species based on composite distance calculation from ConPhy. B-1 shows leaf rate topological feature generated per dichotomies and polytomies, while B-2 and B-3 use IntraR and InterR features, respectively. regulation parameter that controls the balance and tradeoff among individual feature distribution variance and their prior skews; V also tries to minimize the overall distribution variance. How to choose a good set of prior variance is not well understood and varies from dataset to dataset. Therefore, we chose variance (V) values based on commonly used variance range and applied the best fit. For V: V i+1 = 2*V i , where V 1 = 0.002 and i = 1,..., 20, i is the number of iterations. The error tuning t is another parameter to help confirm the right selection of other parameters by trying different accuracy measurements. Five criteria can be used for error tuning of t with 0 representing a no tuning threshold: 1) sum of error = (FP + FN); 2) balanced error rate BER = (sensitivity + specificity)/2; 3) T11U = 2*TP -FP [26]; 4) F1 measure [27]; 5) T13U = 20*TP -FP [28], where TP is true positive value, FP is false positive value and FN is false negative value.
Accordingly, two parameters, V and t, should be optimized. The parameter optimization was performed by using a grid search within a limited range. To minimize over-fitting of the prediction model, three-fold crossover validation was used to investigate the training set. To evaluate predicted accuracy, an F-measure was used [27] as the weighted harmonic mean of precision and recall; and it is defined as follows: During the BLR classification model training, each data point represents a 2-level bifurcation with three leaf nodes (a triplet). If the bifurcation triplet is in the gold standard dichotomy set from Bergey tree, one must assign class label 1; otherwise -1. Figure 4 shows the profile of predicting the accuracy of the three-fold cross-validation on the training set versus the variations of the parameters V and t. Obviously, the accuracy profile has a maximum values peak at (V, t) = (0.128, 3), indicating that the optimal values of V and t for constructing BLR models are 0.128 and 3, respectively.
Another parameter we provide in the PolyPhy algorithm is to let users choose types of priors for BLR classification. Two types of prior are commonly employed, Laplace (sparse data and less features) and Gaussian (approximate normal distribution). Users can try both prior types to achieve the best prediction result based on their data property.

Prediction ability
Using the optimal values of V and t, the polytomy classification model was constructed based on the training set by using the BLR learning algorithm provided by the BBR package [29]. To minimize data dependence on the prediction model, 10 tree replicates were prepared through jackknife procedure, five from each dataset (see "Material & Methods"). Each training set from data set 1 consisted of around 200 Bergey system confirmed bifurcating triplets, while each training set from data set 2 consisted of around 60 confirmed bifurcating triplets. All training sets were half confirmed dichotomies and half confirmed polytomies. The classification results are listed in Table 1. For all test models, the balanced accuracies were > 80.24%, and the AUC value was 83.31%. Training sets from data set 2 had higher accuracy than sets from data set 1 because of their less variance from the smaller genome set.
To further improve the classification accuracy, three topological features were added per different distance to the BLR classifier. Distance measures GDD, GBD and GCD from ComPhy were used with all three tree features included per distance respectively (nine features in total). The classification accuracy reached as high as 0.87 for balanced accuracy, 0.95 precision, 0.85 recall, and 0.95 AUC value for all the training sets from data set 1, but the classification accuracy showed only around 5% improvement for training sets from data set 2.

Multifucation tree generation
With polytomies being identified through the BLR classification, PolyPhy can make a multifurcating tree. The PolyPhy team took the most intuitive approach by using the average of the branch lengths from a bifurcation identified as a polytomy. For example, from Figure 5, Part I, where S 1 , S 2 , S 3 and S 4 are branch lengths from a bifurcation, S 1 , S 2 and S 3 were averaged and the branch S 4 was removed in order to make the subtree like Part II in Figure 5. Note, in the classification process, Poly-Phy generated either independent classifier for each jackknife-generated tree or a combined classifier for one overall tree to accommodate those who just want to see one final tree produced in the end when doing analysis. Therefore, in generating the final tree, a consensus multifurcating tree was produced from several independent multifurcating trees and multiple classifiers were applied. We applied the accuracy evaluation from Com-Phy [25] and evaluated the accuracies of the multifurcating trees against Bergey's taxonomy system. A comparable accuracy was achieved with 30% more polytomies being validated due to the fact that ComPhy previously only evaluated dichotomy branches.

Conclusion
Polyphy, a standalone tool for phylogeny reconstruction and polytomy identification, is robust and easy to use for studying evolution. It can construct a consensus tree without requiring multiple sequence alignment and can also identify polytomies without requiring manual interferences. It is a fully automated tool. The phylogeny reconstruction implements a jackknife recursive sampling procedure; then recursively trains the BLR classification model for polytomy identification. It allows users to infer a multifurcating phylogenetic tree for any set of microbial genomes of interest to study their evolutionary relationships. Although the tool is built for reconstructing a multifurcating phylogeny from whole genome sequences, users can have a pre-generated bifurcating tree from any phylogenetic tool as the starting point for classification. We believe that this is a timely and necessary development as more and more microbes are sequenced daily (especially from the metagenomic analysis of microbial communities) without reliable taxonomy established.

Taxon selections
We used a set of 398 single-chromosome microbial genomes, as shown in Table 2, consisting of 31 Archaea and 367 bacterial genome sequences. These sequences were previously used in the development of ComPhy [25]. The distribution of phylum clad in this dataset represents the trend of microbial taxonomy classification [30], in which most genome sequences are from three phyla, Proteobacteria, Firmicutes and Actinobacteria.
The second dataset consists of 83 microbial genomes [31], whose 16S rRNA sequences were obtained from the Greengenes microbial database [32]. These 83  Precision is the true positive/(true positive + false positive). Recall is true positive/(true positive + false negative). AUC quantifies the overall ability of the test to discriminate between positive samples and negative samples.
genomes are also part of the first data set of 398 microbial genomes, and they were used to generate phylogenetic tree solely based on the 16S rRNA sequences.

PolyPhy (phylogeny with polytomy identification) tool
This research developed a machine-learning tool, Poly-Phy, for polytomy identification. PolyPhy starts with ComPhy reconstructing a large-scale microbial phylogeny using whole-genome structural features. It obtains multiple unbiased trees through a jackknife procedure. PolyPhy uses three tree topological features to be discussed in following sections, and through recursive trainings using a machine learning method to identify polytomy from the tree. The last step of the tool is generating a meaningful multifurcating tree. No fixed threshold is imposed on any of the patterns to decide if a two-level bifurcation is a dichotomy or polytomy since every phylogenetic tree will have a different distance value and branch. Thus, through a machine-learning method, i.e., the BLR classifier, the extracted features can be clearly used to train the classification model without needing to know the exact cutoff values.

Phylogenetic tree reconstruction
Obtaining an optimal binary phylogenetic tree as input tree for PolyPhy is as important as identifying polytomy from the binary tree. It has been traditionally considered that likelihood-based methods have more accurate phylogenetic relationship inferences than distance-based methods such as Neighbor-Joining (NJ) [33,34]. However, in 2010, Roch [35] showed that when sets of large-scale species are involved in phylogenetic analysis, the distancebased method proved much more effective, practical and surprisingly comparable in performance [36]. Thus, given the amount of genomes used in this study, distancebased phylogenetic inference was applied.

Phylogenetic tree reconstruction using ComPhy
ComPhy [25] is a stand-alone Java-based tool which utilizes a robust and efficient strategy called 'Gene Composite Distance' to combine different aspects of evolutionary relationships among genomes for producing a phylogenetic tree from a given set of whole genome sequences. Specifically, composite distance measure starts with an all-against-all pairwise genome comparison using BLASTP [37]. In the second step, a distance matrix is calculated from three components, i.e., GDD (Gene Dispersion distance), GBD (Genome Breakpoint distance) and GCD (Gene Content distance). This distance matrix is then fed into a distance-based algorithm, Neighbor-Joining (NJ) [33,34], using a third-party tool, Phylip [23], to produce phylogenetic trees.
Phylogenetic tree reconstruction using 16s rRNA sequences Besides using the whole genome structural features to infer microbial phylogeny by ComPhy, we also applied Figure 5 Polytomy identification. Part I presents a three-level bifurcation subtree with 4 genome taxa A, B, C and D. X 1 , X 2 and X 3 are genome distances calculated among A, B and C. And S 1 , S 2 , S 3 , S 4 and S 5 are branch lengths. If bifurcating subtree of A, B and C is identified as a polytomy, then the sub-tree can be converted to a multifurcation subtree, shown as Part II. multiple-sequence alignment using 16s rRNA sequences from the second dataset of microbial genome (see more detail in Taxa selections section) for evolution distance inference. MUSCLE [38] was applied for multiple sequence alignments for datasets with the 83 microbes. Then a pairwise genome distance matrix was generated from the alignments using "distMat" from EMBOSS with Kimura correction [39]. Finally, this distance matrix was fed into Neighbor-Joining (NJ) [33,34], using Phylip [23] to produce a phylogenetic tree.

Jackknife procedure in phylogeny
Jackknifing is a statistical method to estimate the accuracy of sample statistics by repeatedly using subsets of available data to produce the results. In order to obtain multiple meaningful phylogenetic microbial trees, Poly-Phy provides the user with capability to generate trees from the same dataset through a jackknife procedure to obtain robust and non-biased training datasets of bifurcating subtrees from the trees. We subsequentially selected different subsets of whole genome gene sets in which ortholog ordering was the main component used in ComPhy distance measure to generate different trees. It is not a trivial task to apply the perturbation to ortholog selection and assess the robustness of the data. Though some have tried to use jackknife in large-scale phylogeny analysis, the studies were not conclusive [40,41]. Here, we conducted some of the performance tests on jackknife procedure using 398 microbial genomes.
The major steps are performed as follows: 1) Generating a new set of orthologs by randomly selecting k% from overall orthologs of each genome. Orders of the selected orthologs are preserved with respect to their order in the original genomes. 2) Reconstructing a tree replicate from these genomes with new ortholog subsets. 3) Repeat step 1 with same k value but different randomly chosen ortholog subsets 50 times to obtain 50 different tree replicates. 4) Compute a consensus tree and corresponding confidence values on all internal branches by using the CONSENSE program in PHYLIP [23] from 50 tree replicates. 5) Repeat Step 1 with different k values, ranging from 30 to 80. Through this procedure, an optimal selection of subset of orthologs can be determined for jackknife selection.

Polytomy identification through classification
We have developed a machine-learning process to carry out the task of identifying polytomies from given bifurcation tree. Figure 5.I is an example of the bifurcating tree with three levels of bifurcating branching, (((A, B), C), D). If evolutionary relationships among leaves genomes A, B and C cannot be fully resolved due to conflict supports with low bootstrap values or simply more likely simultaneous speciation, then we would classify this bifurcating branch as a polytomy subtree, as shown in Figure 5.II. We will utilize the BLR (Bayesian Logistic Regression) classification, a machine learning approach, to automatically learn and classify bifurcating branches into dichotomies or polytomies. After a classification model is trained from extracted features, we can classify a bifurcating branch into a dichotomy subtree or a polytomy subtree. The following shows the details of the procedure: (1) Bayesian logistic regression The Bayesian approach is attractive for its ability to incorporate prior knowledge into statistical inference; and the logistic regression can provide a simple but accurate model for the predictions with calibrated probabilities without extrapolating input datasets to a complex and higher dimension. Bayesian logistic regression utilizes the best features of both methods by employing the prior information about the success probability and recursively optimizing the prediction parameters to achieve the optimal classification/prediction rather than simple regression coefficient for classifiers [42][43][44]. BBR (Bayesian Logistic Regression Software) package [29] was used to train the model and classify the bifurcations for dichotomies and polytomies.
(2) Classification training datasets The format of training data for classification is comprised of various two-level bifurcating subtrees as shown in Figure 5.I. The gold standard for deciding whether or not a binary tree branching is a dichotomy or a polytomy is the tree topology provided by Bergey's microbial taxonomy classification system. For 398 microbial taxa, we constructed a bifurcating phylogenetic tree and were able to extract 370 dichotomies and 285 polytomies as training data. The dataset was split into 90% for model training and 10% for testing the accuracy of the prediction model.

(3) Classification features
In order to successfully train a classification model for bifurcating branches from a phylogenetic tree, there must be a meaningful set of features extracted from the bifurcating branches in a tree. Figure 5.I gives a hypothetical bifurcation (a two-layer binary subtree in phylogram format) with four different genomes, A, B, C and D as leaf nodes, in which genome D is used as outgroup taxa. Genome distances among A, B and C are represented as X 1 , X 2 and X 3 , which can be calculated from ComPhy or any other phylogeny construction tool preferred by the users. The branch lengths, S 1 , S 2 , S 3 , S 4 and S 5 are usually the branch lengths generated by different tree generation algorithms. With this information given by tree topology, three topological features can be generated which are consistent and can be extracted from any bifurcating branch of the tree. They are LR (Leaf rate) as Figure 6.I, IntraR (Intra-subset branch rate) as Figure 6.II and InterR (inter-subset branch rate) as Figure 6.III.
The common misconception of the branch length in phylogenetic tree is that it is an irrelevant factor for inferring species branching history compared to other types of topology information such as hierarchy of the branches. Although this view could simplify the presentation of the tree and still give the users a somewhat coarse and high-level examination to infer the evolution history using ordering of the taxa, the sensitivity of relative evolution timing is lost due to this assumption. For the next two topological features, branch length must be considered as a strong supporting factor for evolution history inferences. The correct branch length is not a random assignment, but rather a statistical derivation from different phylogenetic tools proportional to the predicted or hypothetical evolutionary time between organisms or sequences based on the presumed evolution model [45][46][47]. Therefore, the two patterns, IntraR and InterR extracted from the tree, use the idea of longbranch retraction and short-branch contraction for polytomy identification.

(4) Feature LR generation
This pattern studies the relative genome distances among three studied genomes. In Figure 6.I, if the bifurcations is a dichotomy, then X 3 , distance between genome A and genome B, will be the shortest genome distance among all three (X 1 , X 2 and X 3 ). Therefore, the summation of distance of A to C and B to C will always be greater than twice the distance of A to B. Also by normalizing with the average distance, this pattern can indicate the difference between a dichotomy and a polytomy: , where X 1 , X 2 and X 3 are pair-wise genome distances among A, B and C (see Figure 6.I).

(5) Feature IntraR generation
This feature considers all the branches within the twolevel bifurcation (Figure 6.II). If one assumes that the two-levels of branching from the top parent node in a given bifurcation correspond to two different steps of speciation, which is dichotomy, then one should see a clear separation between speciation of C and set of (A, B), and between A and B. Therefore, the branch of S 4 attempts to retract and stretch as far as possible to have the same branch length of S 3 . On the contrary, if this subtree of bifurcation is indeed a polytomy, then length of the branch S 4 will be as short as possible so that the length S 1 will be similar to that of S 2 in order to have simultaneous speciation. Hence, the following formula captures the contraction and retraction properties and uses them to identify dichotomy and polytomy: Intra − subset branch length = min (S 1 , S 2 ) /max(S 1 , S 2 ) min (S 3 , S 4 ) /max(S 3 , S 4 ) , where S 1 , S 2 , S 3 and S 4 are the branch lengths. S 1 and S 2 could have the same length, which has no impact on the formula.

(6) InterR generation
This last classifier feature covers the branch relationship between the two-level bifurcation and its level above through branch s5, as shown in Figure 6.III. If one assumes species A, B and C speciated at the same time, then the branch contraction property will force all branches including S 1 , S 2 and S 3 to have similar lengths of branching time for species A, B and C to be clustered together into a polytomy. With another branch S 5 from one higher level of hierarchic topology, it is easier to see that the contraction of S 4 against S 5 has an observable longer branch length. Most dichotomies show that S 5 is shorter than at least one of the intra-subset branches, while S 5 is much longer than all of the intro-subset branches for a polytomy. Thus: inter − subset branch length = max(S 1 , S 2 , S 3 , S 4 ) S 5 , where S 1 , S 2 , S 3 , S 4 and S 5 are the branch lengths.

(7) Classification procedure
For BLR classifier, properly estimating the posteriori probability is the crucial step in the process. In order to have the maximum posteriori estimate of the selected class label for input data, an optimal value of prior variance is needed through a rigorous training. In PolyPhy, the recursive prior variance selection can be performed within a range of values through k-fold cross-validation to obtain the optimal parameter settings for the training model. For each prior variance value, BLR trains k logistic regression models under a prior with that hyperparameter. Each model is trained on the union of k-1 of the subsets and tested on the remaining subset with each subset used as the test subset once. BLR then selects the prior variance that maximizes the average value of the log-likelihood of a training instance when it appears in the test subset. There are two ways to use the BLR classifier. One is the individual tree training and prediction. From each jackknife generated bifurcating tree, PolyPhy extracts all bifurcating subtrees. The gold standard matched dichotomies and polytomies are used as input to the classifier on the rest subsets. The best classification model is trained through 3-fold cross validation and obtained for each tree. The classifier is then applied to the rest of the bifurcating subtrees for each tree to identify the polytomies and convert those bifurcating branches into polytomies. Subsequently a consensus tree can be obtained from all the multifurcating trees. With this final tree, the result tree can be compared to bifurcating tree or consensus tree obtained from multiple bifurcating trees. Another simpler way of using the BLR classifier is to combine all the generated trees to create an overall consensus tree and extract one set of bifurcating subtree input based on the gold standard Bergey bifurcation set. Next a classifier model is generated to classify the rest of the bifurcating subtrees for each tree to identify the polytomies and convert those bifurcating branches into polytomies.