Category Archives: Review

Is your multilocus phylogeny correct?

I am now working on a transcriptome data set for a phylogenomic project. Building trees from huge multilocus data obtained by NGS is now widely practiced, but how reliably we can reconstruct phylogeny with them is still largely unknown.

In my data set, most loci are very little informative because sequences of exons are sometimes extremely conservative (the number of variable sites is often less than 1 in 1000bp). Some loci possibly span thousands of bases and the effect of recombination within locus may not be ignored. In addition, as is often the case with NGS data, they have a quite large number of missing sites and missing loci. I imagine that my transcriptome data boldly violate the common assumptions of phylogenetic inference.

Methods for tree inference also matter. Is concatenation of all loci better than using the multi-species coalescent methods? Do you Infer gene trees first or do all inference simultaneously?

To check the reliability of my phylogenetic inference, I searched recent papers reporting the effects of the model violations and how different methods behave under different conditions and summarized them.


Lanier & Knowles(2012) reported that the effect of recombination is negligible, at least, within the range of recombination rates they tested (ρ=0.1 – 20) and with programs they used.  Their simulations showed that sample size and depth of species tree have much stronger effects on the accuracy of the multispecies coalescent methods.

Springer & Gatesy (2016) questioned the use of recombining loci. They suggest that the effective lengths of non-recombining portion of genes are infinitesimally small and it’s illogical to use multispecies coalescent methods to infer phylogeny with the frequently recombining loci.

There are only a few works on the effects of recombination on phylogenetic inference. However, the idea that the recombination is negligible looks plausible for me because recombinations in deep ancestral populations seem to rarely affect the shape of gene trees.

-Low-variation locus

There are opposite views on the usefulness of low-variation loci. A paper by Lanier et al. (2014) said that adding low-variation loci (loci with θ=0.001, equivalent to 1 difference in 1000 bp) did not add accuracy to two-step likelihood inference (that is, gene tree inference followed by species tree inference). On the other hand, a Bayesian method, *BEAST, can gain the accuracy from these marginally informative loci. They concluded that handling gene tree uncertainty is particularly important when you analyse little-informative data sets.

Xi et al. (2015)‘s results are quite opposite. They reported that the two-step inference, RAxML gene trees + MP-EST species tree, can indeed improve the accuracy with a large number of low-variation loci. They suggested it is biased gene trees which compromise accuracy.

I guess that these difference resulted from a slight difference on the methods they used. While Lanier et al. used a majority rule consensus from MrBayes to build gene trees, Xi et al. used a maximum likelihood of RAxML. By taking a consensus tree, the signals of low-variation loci was probably wiped out and the loci became completely uninformative. If they had used the methods like maximum clade credibility tree instead of consensus, results might have been different.

One surprising point is that even small non-randomness in programs can positively mislead inference when you use a very large number of loci.

-Missing data

There has been a long tradition of debates over effects of missing data on phylogenetic inference.  So, a plenty of papers consider this issue. A recent thorough evaluation is Xi et al. (2015) (another work of the authors above).  Their general conclusion is missing data can reduce the accuracy of tree inference, especially when they concentrate in some species. The detrimental effects of missing data is minimal when missing is “locus-based”, that is, some loci have more missing than others. Similar detrimental effects of “taxa-based” missing were reported by Roure et al. (2013).

This is probably the most problematic issue when handling NGS data. They often have a large variation of read coverages across samples, and consequently, missing data are more abundant in some samples than others. This “biased missing” must lead to decline of accuracy. How this type of missing affect tree shape is yet to be studied.

The parameters to be considered and their combinations are huge when we work on large scale multilocus phylogeny. Though there are a large number of papers testing the accuracy of inference under particular violations of models, we can not cover all possibilities.  But, I found some important points, especially on the choice of methods and treatment of missing loci/sites.

[PaperReview]Intrinsic inference difficulties for trait evolution with Ornstein-Uhlenbeck models

Ho, LST. and Ane, C. (2014) Intrinsic inference difficulties for trait evolution with Ornstein-Uhlenbeck models. Methods in Ecol. and Evol. 5(11):1133-1146. DOI: 10.1111/2041-210X.12285

The Ornstein-Uhlenbeck model (OU model) is a commonly used model for studying trait evolution. It extends the Brownian motion model by adding a term for “pull” toward the optimum trait value. The OU model reasonably models the trait evolution under natural selection, which is often the main focus of evolutionary studies, and is widely used to test if selection acts on a studied trait. This paper reports that the inference with OU model has limitations which are often ignored.

Several limitations of the OU model are discussed. For example, the optimum trait values and its ancestral states can not be simultaneously estimated. You can not separately estimate the optimal trait value, μ and ancestral state, y0 because the likelihood surface forms a “ridge”. This unidentifiablity issue appears when the shift of selection optimum occurs once on a tree and the group under the same selection forms a connected subtree. (I don’t fully understand the maths behind the unidentifiability, but I can intuitively understand that the left tree is bad and the right is OK.)


Unidentifiable and identifiable cases for estimations of selection optima from Ho and Ane (2014).

Other points include: Model selection with AIC, often used to find shifts of selection regime, can be misleading because parameter-rich models are not correctly penalized. Also, the power of model selection is limited and adding more taxa do not readily improve the accuracy.

I was shocked when I first saw this paper. I have been playing with the OU model recently for my new project. If the OU model were useless, my ideas would be all ruined. Fortunately, it is not useless though it is limited. It is still possible to estimate useful parameters. For example, the optimum trait value can not be estimated, but the expected difference between 2 selection regimes can be.

Authors gave recommendations to handle the limitations of the OU, including adding fossil records and re-parameterization techniques. These are very useful guidelines to check the feasibility of the OU-based analysis, and we should carefully consider the conditions when the OU model is misleading before using it.

[PaperReview] An Intuitive, Informative, and Most Balanced Representation of Phylogenetic Topologies

One thing I didn’t do last year on this blog is reviewing papers. The reason is not just that I was lazy, but probably I put too much effort into one review post. That makes the post become a burden, and frequency of writing reviews reduced. So, keeping review articles simple enough looks important. The purpose of review posts on this blog is archiving what I think interesting and worth sharing with others. Over-emphasis of details is not necessary for this purpose.

I will re-start writing reviews of bioinformatic papers keeping this in mind.

Iwasaki, W., Takagi, T. (2010) An Intuitive, Informative, and Most Balanced Representation of Phylogenetic Topologies. Syst. Biol. 59(5): 584-593. doi: 10.1093/sysbio/syq044

Visualizing the information of multiple conflicting phylogenetic trees is a difficult task. The phylogenetic network is maybe the most frequently used, but its visual interpretation is not always straightforward. The Iwasaki&Takagi (2010) paper proposed an alternative method, called “centroid wheel tree”, to do this task.

The centroid wheel tree (CWT) is based on a consensus tree of multiple trees. Its difference from the ordinary consensus is orders of branches. Instead of placing them randomly, branches descending from an unresolved node are placed in a order where more often grouped branches are placed closer. This circular ordering on a node and numbers between branches present the information of frequency of occurrence of clades. Once you get used to how to read a CWT, you find it contains most information you need to interpret results of phylogenetic analyses.


Interpretations of notations on a centroid wheel tree. From Iwasaki & Takagi (2010). See for details.

I think CWT is one of the best phylogenetic methods I have seen. It visualizes complicated information in a simple, but informative way. However, it is not very widely used, unfortunately. (At least, I haven’t seen CWTs in any phylogenetic litereture.) A possible reason is that it is not implemented in major phylogenetic analysis packages or simply it is not known by biologist community.

[PaperReview] Resolving Ambiguity of Species Limits and Concatenation in Multilocus Sequence Data

I will regularly review journal articles about computational or statistical methods, which non-computational biologists are probably not familiar with.

In this post, I will start with a work of my former colleagues.

Chesters, D. and Vogler, A.P. (2013) Resolving Ambiguity of Species Limits and Concatenation in Multilocus Sequence Data for the Construction of Phylogenetic Supermatrices. Syst Biol., 62(3):456-66. doi:10.1093/sysbio/syt011

Finding the corresponding sequences between species is a starting point of phylogenetic inference. This is viewed as a process of sorting out the COLUMNS of the data matrix for the inference. It has attracted researchers’ attention due to its importance. On the other hand, how to concatenate multiple loci, the procedures to sort out the ROWS of the data matrix, hasn’t been seriously studied so far.

Now, with the rapidly growing sequence database, building a large super matrix from downloaded sequences is becoming a daily practice, and the concatenation of sets of partially unidentified sequences are more and more difficult. Despite needs for the objective, automated way to do concatenations, studies have been rarely done.

Chesters & Vogler (2013) developed the method to objectively sort the rows of matrix.

They designed a pipline for automated concatenation of multiple loci like following. Sequences of each locus are first clustered into species level entities by using BlastClust. The components within the loci (ie. species entities in the loci) then are connected to components of other loci based on the shared names. This creates a graph structure, called “multipartite graph”. In the multipartite graph, no links exist between components within the partite but between partites connections are allowed. One partite in this case represents one locus, and linking loci means doing concatenation. The task is to find the optimal match, with which at most one link is left between components while minimizing number of components left unconnected. This is done by a procedure called “maximal cardinality matching”.

Concatenation with maximal cardinality matching of 2 loci. From Chesters & Vogler (2013)

Concatenation of 2 loci with maximal cardinality matching. From Chesters & Vogler (2013)

Strength of the evidence of connections are taken into account by adopting different scoring schemes. For instance, two points are given for parfect matches and one is added if a specimen’s voucher code also matches. Now the problem is called “maximal weighted matching”.

By using sequence clustering and the matching algorithm followed by alignements, they built supermatrices of 4 loci (CO1, 16S, 18S and 28S) from download Coleoptera data with different weighting schemes and did phylogenetic inference. The resulting trees were compared with each other and the Linnean name-based tree.

What they found is ~7300 concatanates with 62% missing sites from their sequence-based prcedure. The weghiting shcemes had virtually no effects except for the most stringent one. The Linnean name-based concatenates are ~7600 with 59% missing data, which was slightly lower missing rate. The sequence-based tree had slightly higher congruence to the taxonomy and also higher bootstrap support, but these results are inconclusive as they compare only common part of the trees.

I found this study is very interesting. No one has seriously considered how to concatenate the sequences so far. One reason of this ignorance is probably that the concatenation of loci is unmistakable as which specimens the sequences of all loci come from are clearly known in a phylogenetic project in one laboratory. I have actually done similar database search and concatenation in much smaller scale. Of course, I used Linnean names…! As mentioned in the discussion, this crude Linnean-based way will be soon inadequate or even impossible because of the large flux of the new sequences. This research is a new step of databese-driven phylogenetic inference.

It has been reported that the number of unidentified/only partially-identified sequences in the Genbank has inflated recently (they are called “dark taxa”). So, not only names of species but also evidences like voucher codes or any other evidences will be useful to do reliable concatenation.