[PaperReview] Resolving Ambiguity of Species Limits and Concatenation in Multilocus Sequence Data

I will regularly review journal articles about computational or statistical methods, which non-computational biologists are probably not familiar with.

In this post, I will start with a work of my former colleagues.

Chesters, D. and Vogler, A.P. (2013) Resolving Ambiguity of Species Limits and Concatenation in Multilocus Sequence Data for the Construction of Phylogenetic Supermatrices. Syst Biol., 62(3):456-66. doi:10.1093/sysbio/syt011

Finding the corresponding sequences between species is a starting point of phylogenetic inference. This is viewed as a process of sorting out the COLUMNS of the data matrix for the inference. It has attracted researchers’ attention due to its importance. On the other hand, how to concatenate multiple loci, the procedures to sort out the ROWS of the data matrix, hasn’t been seriously studied so far.

Now, with the rapidly growing sequence database, building a large super matrix from downloaded sequences is becoming a daily practice, and the concatenation of sets of partially unidentified sequences are more and more difficult. Despite needs for the objective, automated way to do concatenations, studies have been rarely done.

Chesters & Vogler (2013) developed the method to objectively sort the rows of matrix.

They designed a pipline for automated concatenation of multiple loci like following. Sequences of each locus are first clustered into species level entities by using BlastClust. The components within the loci (ie. species entities in the loci) then are connected to components of other loci based on the shared names. This creates a graph structure, called “multipartite graph”. In the multipartite graph, no links exist between components within the partite but between partites connections are allowed. One partite in this case represents one locus, and linking loci means doing concatenation. The task is to find the optimal match, with which at most one link is left between components while minimizing number of components left unconnected. This is done by a procedure called “maximal cardinality matching”.

Concatenation with maximal cardinality matching of 2 loci. From Chesters & Vogler (2013)

Concatenation of 2 loci with maximal cardinality matching. From Chesters & Vogler (2013)

Strength of the evidence of connections are taken into account by adopting different scoring schemes. For instance, two points are given for parfect matches and one is added if a specimen’s voucher code also matches. Now the problem is called “maximal weighted matching”.

By using sequence clustering and the matching algorithm followed by alignements, they built supermatrices of 4 loci (CO1, 16S, 18S and 28S) from download Coleoptera data with different weighting schemes and did phylogenetic inference. The resulting trees were compared with each other and the Linnean name-based tree.

What they found is ~7300 concatanates with 62% missing sites from their sequence-based prcedure. The weghiting shcemes had virtually no effects except for the most stringent one. The Linnean name-based concatenates are ~7600 with 59% missing data, which was slightly lower missing rate. The sequence-based tree had slightly higher congruence to the taxonomy and also higher bootstrap support, but these results are inconclusive as they compare only common part of the trees.

I found this study is very interesting. No one has seriously considered how to concatenate the sequences so far. One reason of this ignorance is probably that the concatenation of loci is unmistakable as which specimens the sequences of all loci come from are clearly known in a phylogenetic project in one laboratory. I have actually done similar database search and concatenation in much smaller scale. Of course, I used Linnean names…! As mentioned in the discussion, this crude Linnean-based way will be soon inadequate or even impossible because of the large flux of the new sequences. This research is a new step of databese-driven phylogenetic inference.

It has been reported that the number of unidentified/only partially-identified sequences in the Genbank has inflated recently (they are called “dark taxa”). So, not only names of species but also evidences like voucher codes or any other evidences will be useful to do reliable concatenation.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s