I recently wrote a blog post about the effect of reduced gene flow over the multilocus species delimitation program, tr2. A general pattern is, even if two populations are connected with moderate gene flow, the tr2 erroneously splits them into two “species” when a very large number of loci are used. This result means we need to be cautious when we apply it to a big data set.
One thing I didn’t test is that the effect of unresolved gene trees. In the previous post, I assumed gene trees are fully resolved. This assumption is rarely met in real data sets as RAD markers or exons of RNAs are not variable enough to give us fully resolved trees. Does this also affect the patterns of oversplits of tr2?
I just tested the effects of less-variable markers on tr2-delimitation. The simulation setting is identical to the previous post except that the branch lengths of gene trees are proportional to mutation rate. As mutation rate gets smaller, gene trees are more unresolved.
The gene trees simulated under Neμ= 0.005 are plotted below. They look realistic, just like gene trees we see in real practices. (You may think that these trees are not informative enough to detect recent speciation. Actually, they are informative.)
So, how does tr2 oversplit populations connected by gene flow, and how does mutation rate affect the pattern of oversplits?
The above plot is a species-population tree used in simulations. Pop.1 and pop.2 are connected by gene flow.
The plot below shows the proportions of trials where samples from two populations were assigned to one species. The migration parameter is Nem=1.0, and population mutation parameter, Neμ, ranged from 0.005 to 5.
The pattern of oversplits is similar to the previous simulations (see the plot). When Neμ= 0.5 or 5, curves are almost identical to the curve with fully-resolved gene trees. With > 500 loci, populations are always split into two “species”. When you have less mutations, curves diverged from the ideal curve and the overspliting slows down.
This is probably a good result. At least, less informative markers don’t lead you to false positives. However, they are probably less sensitive to the true pattern and you may miss it.
When you use tr2-delimitation, maybe you need to consider two points: how informative your markers are and how many markers you use. As you increase the number and informativeness of loci, you can detect finer scale structure, which may or may not be true species.
Again, it is often hard to determine the “sweet spot” of numbers and variability of markers to detect “true” species since speciation is always continuous. I am not sure if this problem can be solved by the delimitation with full multispecies coalescent model. (tr2 is an approximate method.) Explicit modeling of gene flow or geographic distribution is probably the better approach to tackle this problem, but usually time-consuming. It may be possible to use resampling of loci to check how the pattern of splits develops, and combine it with locus informativeness to find the best threshold for true species entities.