Are very short sequences useful for species tree reconstruction and species delimitation?
Sequence fragments generated by some NGS platforms are often very short (for example, 100bp with RAD) and each locus contains only a few informative sites. Intuitively, they do not look useful for tree reconstruction using gene tree-based method (that is, species tree estimation using estimated gene trees. The SNP-based methods are not included in this category.) because tree of each locus is unresolved and uninformative for subsequent specie s tree inference.
I did a bit of simulation to check if the gene tree-based species tree reconstruction and delimitation work on very short sequences. Results are rather surprising.
The procedure of simulation is simple.
- Set up a species tree.
- Simulate 500 gene trees within the species tree assuming multispecies coalescent.
- Simulate sequence alignments of 100bp along the gene trees.
- Reconstruct gene trees.
- Reconstruct a species tree.
- Delimit species using the species tree as a guide.
The species tree used for simulation is below. I set the effective population size for each species in between of T3 and T2. So, I expect large amount of incomplete lineage sorting among sp1, sp2 and sp3. Sample size from each species is 5.
The picture below shows simulated gene trees (upper row), simulated alignments (middle row) and reconstructed gene trees (bottom row) for three loci.
The sequences and reconstructed gene trees appear to be barely informative. Although some distantly related species are monophyletic even in reconstructed trees, closely related ones are unresolved. The bottom left tree is almost “flat”.
The tree below is a species tree reconstructed by ASTRAL with 500 simulated loci, and a result of delimitation. The numbers on the nodes are scores of delimitation (positive values indicate nodes are between-species branchings, and “*” signs show species groups.)
One surprising thing is that ASTRAL was able to reconstruct a correct species tree even with apparently uninformative gene trees. It seems delimitation with tr2 can work correctly too once the species tree is correctly estimated.
This is only one simulation run, but I am quite sure that even very short sequences contain information and can be useful with the gene tree-based reconstruction and delimitation, at least, under ideal conditions.
Of cause, real data contain lots of missing sites/loci. There may be gene flows. I guess the ideal conditions of multispecies coalescent are rarely met. I will test how the methods respond to the violations.
Some users of tr2 reported that it crashed when they input thousands of gene trees of RAD. I found that the numerical calculation underflows when the input is very large. I made some small fixes on the program, but it accepts only about 1000 loci now.
I think the simulation above says even 500 loci can have enough information to delimit some difficult species. But, I am thinking of how to solve this limitation of the number of loci.