Category Archives: Note

Multilocus delimitation with tr2: Guide tree approach

The second option of delimitation with the “tr2” is a guide tree approach. A guide tree is a tree which specify a hierarchical structure of species grouping. By using it, you can significantly reduce the number of possible delimitation hypotheses to search.

The tr2 implements an algorithm to find a best position of nodes which define species group under a given guide tree. As the algorithm is reasonably fast, you can search the best delimitation through a tree from each tip is distinct species to all tips are from one single species.

Now, acceptable size of the number of taxa on guide tree is around 100. If you exceed 200 taxa, the memory requirement usually becomes huge and normal desktop computers can not handle. The current limitation of the number of input trees, that is, the limit of number of loci, is ~ 1000. Theoretically, it can be larger, but a problem on numerical calculation now limits the number of loci you can use.

To run tr2 with a guide tree, you need a newick formatted guide tree file, and a gene tree file also in newick format.

Only the first line of the guide tree file is used. In the standard analysis, guide tree tips must contain all taxa found in gene trees. Guide trees can be built any methods such as concatenated ML (eg. RAxML) or coalescent-based species tree methods (eg. ASTRAL). Most importantly , it must be properly rooted. Incorrect rooting often results in over-splitting.

The gene tree file must contain one tree per line. They must be rooted too. Missing taxa are allowed.

Once two files are ready, the command below starts a search algorithm.

./run_tr2.py -g guide.tre -t genetree.tre

To test with bundled example files, use files in “sim4sp” directory.

./run_tr2.py -g sim4sp/4sp.nex10.RTC.tre -t sim4sp/simulated.gene.trees.nex10.4sp.tre

After some intermediate outputs, you will see a tree with delimitation results and a table.

write: <stdout>
species sample
1     12.3
1    11.3
1    15.3
1    14.3
1    13.3
2    7.2
2    10.2

If the tree and table are too large on your console screen, use “-o” option to output them into files.

./run_tr2.py -o out -g sim4sp/4sp.nex10.RTC.tre -t sim4sp/simulated.gene.trees.nex10.4sp.tre

This command ouputs a table and a tree into “out.table.txt” and “out.tre” respectively. You can see the results using any programs.

Now, I use R + “ape” package to check a delimitation result.

library(ape)
tr <- read.tree("./out.tre")
plot(tr)
nodelabels(text=substr(tr$node.label,1,6), bg="white")

These R commands plots the guide tree and delimitation like the  below picture.

outdelimit

The numbers on the nodes indicate average differences of posterior probability scores. If they are positive the node has between-species branches. The negative values suggests that nodes are within species. “nan” indicates there are not enough samples to split/merge the node. “*” signs show the best position of delimitation. In this case, there are 4 putative species.

Advertisements

Multilocus delimitation with tr2: Model comparison

The “tr2” currently has two options for species delimitation. One is calculating posterior probability scores for user-specified delimitation hypotheses. Another option is finding the best delimitation under a guide tree, which specifies a hierarchical structure of species grouping.

The first option is probably useful to compare multiple species groupings and find the best one (such as comparing morphological species vs. mtDNA groups) while the second option can be used without any prior assignments and find species only from gene trees.

Let’s start with the first option. (I assume you have already set up an environment for tr2.)

You must have two input files: A gene tree file in Newick format and a tab-delimited text file which specify associations of species and individual samples.

In a tree file, one line must contain one gene tree. Trees can have missing taxa. They must be rooted. (Yes. The program is based on “rooted triplet”. So, trees must be rooted. If you do not have outgroups, midpoint rooting or RAxML’s “-I f” option often works well.)

In an association file, the first column represents the names of samples. They must be identical to the names of the tree tips. The second and so forth columns are species groups. You can write as many columns as you want. Also, you can use any codes to describe species names.

For example, a table below specifies three alternative delimitations of samples 16.4-20.4

19.4     4     B     sp5
17.4     4     B     sp5
18.4     4     B     sp4
16.4     4     A     sp4
20.4     4     A     sp4

Association files must contain all sample names which appear in the tree file.

Once you have a tree file and an association file, simply run the tr2 command as follows.

./run_tr2.py -a sp_association.txt -t genetrees.tre

Some example files are stored in the “sim4sp” folder. If you use them to test tr2, the command is like this.

./run_tr2.py -a sim4sp/sp.assoc.4sp.txt -t sim4sp/simulated.gene.trees.nex10.4sp.tre

The outputs of this command must be like below.

write: <stdout>
model score
null 51391.76
model1 5.73

The score of “model1” looks much smaller than the “null” model (, which assumes all samples are from one single species). So, you can be quite confident that model1 is a better delimitation.

How to install “tr2”

Once the environment is properly set up, the installation of tr2-delimitation is fairly simple. Download the file of tr2-delimitation from the bitbucket repository and decompress it in any place in your computer.

Just click a cloud-like icon on the left side pane and then click “Download repository” to start downloading. A decompressed folder may have a long name  like “tfujisawa-tr2-delimitation-84f248b5fa48”. Simply rename it like “tr2” if you want a handy access to the folder.

To test if the tr2 (and the environment) is correctly installed, let’s run it with a test data set.

$cd /path/to/yourfolder/tr2-delimitation #move to the installed folder
$python run_tr2.py -t sim4sp/simulated.gene.trees.nex10.4sp.tre -g sim4sp/guide.tree.4sp.tre

If you see a table of delimitation like below, the installation is successful.

write: <stdout>

species sample
1 1.1
1 2.1
1 3.1
1 4.1

...

4 17.4
4 18.4
4 19.4
4 20.4

 

In Unix-like OS’s, the program can be run without calling with python.

$./run_tr2.py -t sim4sp/simulated.gene.trees.nex10.4sp.tre -g sim4sp/guide.tree.4sp.tre

 

If you installed python with Anaconda and created a python2 environment, you need to first activate it,

$activate python2
$python run_tr2.py -t sim4sp/simulated.gene.trees.nex10.4sp.tre -g sim4sp/guide.tree.4sp.tre

 

You can integrate delimitation with species tree inference using the rooted triple consensus. To set up the triplec program, download it from its website, and create a folder named “bin” in the same folder as run_tr2.py and put the triplec in the “bin” folder.

$python run_tr2.py -t sim4sp/simulated.gene.trees.nex10.4sp.tre

If the command above returns a similar outputs, the triplec is correctly called.

How to set up an environment for “tr2”

Installing Python and check versions

Python is required to run the tr2-delimitation on your computer. In many modern operating systems, Python is pre-installed and you do not need to install it by yourself.

However, the installed version of Python matters. The tr2 is written in Python 2,  while, in some recent systems, Python 3 is the default version. This is simply because Python 2 was a standard when I started writing the codes, but Python 3 is becoming a new standard now. (I am translating the tr2 codes into Python3.) So, you first need to check the version of Python installed on your system.

Type on your console,

$python --version

If you see a message like below, you are running Python 2.

Python 2.7.6

In some decent OS’s, both Python2 and 3 are pre-installed, and you can call Python 2 by

$python2

even when your default Python is Python3.

If your environment does not have Python2 at all, visit the Python website and install it, or create a Python2 environment as explained in the next section.

Installing dependencies (scipy/numpy and Java)

Python packages called numpy/scipy is required to run tr2. These packages are for numerical calculations. Visit SciPy website and follow the instructions for your operating system to install Scipy libraries. Installing all Scipy related packages following the instruction is just fine though not all Scipy packages are required to run tr2.

An alternative, easy way to install dependencies is installing an all-in-one suite like Anaconda, which includes Python and related packages. As Anaconda allows you to run multiple versions of Python, you can install it with any version of Python.

If you choose to install Anaconda with Python3, you must run Python2 codes by creating Python2 environment.

$conda create --name python2 python=2 numpy scipy

This “conda” command creates an “environment” where the python version is 2 with numpy/scipy installed. You can switch to it by calling the “activate”  command,

$activate python2

and switch back to the default environment by the “deactivate” command.

(python2) $deactivate

 

Checking packages

You can check if the packages are properly installed by loading them on the interactive shell.

$python

(Just by typing “python”, “interactive shell” starts. You can quit this mode by pressing Ctl+d.)

>>>import numpy, scipy

If it does not return errors, packages are ready to use.

Java is also required to run the consensus tree building program, triplec. Almost all modern operating systems have Java as pre-installed software.

Multilocus delimitation with very short sequences

Are very short sequences useful for species tree reconstruction and species delimitation?

Sequence fragments generated by some NGS platforms are often very short (for example, 100bp with RAD) and each locus contains only a few informative sites. Intuitively, they do not look useful for tree reconstruction using gene tree-based method (that is, species tree estimation using estimated gene trees. The SNP-based methods are not included in this category.) because tree of each locus is unresolved and uninformative for subsequent specie s tree inference.

I did a bit of simulation to check if the gene tree-based species tree reconstruction and delimitation work on very short sequences. Results are rather surprising.

The procedure of simulation is simple.

  1. Set up a species tree.
  2. Simulate 500 gene trees within the species tree assuming multispecies coalescent.
  3. Simulate sequence alignments of 100bp along the gene trees.
  4. Reconstruct gene trees.
  5. Reconstruct a species tree.
  6. Delimit species using the species tree as a guide.

The R codes are available here so you can replicate the simulation. You need following programs to replicate: fastsimcoal2, Seq-gen, RAxML, ASTRAL and tr2.

The species tree used for simulation is below. I set the effective population size for each species in between of T3 and T2. So, I expect large amount of incomplete lineage sorting among sp1, sp2 and sp3. Sample size from each species is 5.

sptree.t

The picture below shows simulated gene trees (upper row), simulated alignments (middle row) and reconstructed gene trees (bottom row) for three loci.

tree.alig

The sequences and reconstructed gene trees appear to be barely informative. Although some distantly related species are monophyletic even in reconstructed trees,  closely related ones are unresolved. The bottom left tree is almost “flat”.

The tree below is a species tree reconstructed by ASTRAL with 500 simulated loci, and a result of delimitation. The numbers on the nodes are scores of delimitation (positive values indicate nodes are between-species branchings, and “*” signs show species groups.)

result.delimitation.500One surprising thing is that ASTRAL was able to reconstruct a correct species tree even with apparently uninformative gene trees. It seems delimitation with tr2 can work correctly too once the species tree is correctly estimated.

This is only one simulation run, but I am quite sure that even very short sequences contain information and can be useful with the gene tree-based reconstruction and delimitation, at least, under ideal conditions.

Of cause, real data contain lots of missing sites/loci. There may be gene flows. I guess the ideal conditions of multispecies coalescent are rarely met. I will test how the methods respond to the violations.

Update: 14/Sep/16

Some users of tr2 reported that it crashed when they input thousands of gene trees of RAD. I found that the numerical calculation underflows when the input is very large. I made some small fixes on the program, but it accepts only  about 1000 loci now.

I think the simulation above says even 500 loci can have enough information to delimit some difficult species. But, I am thinking of how to solve this limitation of the number of loci.

image.h.R: how to color your bases

I wrote a blog post about an R function to visualize an alignment matrix long time ago. I no longer use the function because the “ape” package now has a function, called “image.DNAbin”, which does exactly the same thing.

When I was analysing my SNP data, I found the image.DNAbin does not show ambiguous bases which represent heterozygous sites.

image.snp

 

Although I can tell the function to show ambiguous letters, say, “Y” or “M”, in the image, it does not automatically assign colors to them.

If the ambiguous bases are colored by the “in-between” of two unambiguous bases, (say, if A is red and C is yellow, M=A/C is in an orangish color), the plot will be much more informative.

R has a function to generate colors between two colors, called “colorRamp”. It is an interesting function which returns a function to interpolate colors between two colors.

RY <- colorRamp(c('red', 'yellow'))
RY(seq(0, 1, 0.1))
      [,1]  [,2] [,3]
 [1,]  255   0.0    0
 [2,]  255  25.5    0
...
[10,]  255 229.5    0
[11,]  255 255.0    0

In the case above, the function “RY” returns RGB values interpolating red and yellow. You can convert these values into color codes by the “rgb” function. Using these functions, a nice color gradation is easily drawn.

g <- rgb(colorRamp(c('darkblue', 'lightblue'))(seq(0,1, 0.1)), max=255)
barplot(1:10, col=g, border=F)

color.bar

 

Using the “colorRamp”, I wrote a simple code to generate 6 colors for heterozygous bases from 4 colors for unambiguous bases.

image.h <- function(sq, col=c('red','yellow','green','blue')) {
	bases <- c('a', 'g', 'c', 't')
	dbases <- c('r', 'm', 'w', 's', 'k', 'y')

	dcol <- apply(combn(col,2), 2, function(x){rgb(colorRamp(x)(0.6), max=255)})

	image(sq, col=c(col, dcol, 'lightgrey', 'black'), what=c(bases, dbases, 'n', '-'))
}

The four colors, red, yellow, green and blue, are colors for A, G, C, and T respectively. The “colorRamp” function in the fifth line generates 6 colors from the combinations of  the 4 colors.

The image drawn by the new function looks OK. The heterozygous sites are shown in reasonable colors, like orange, purple and bluish green.

imageh.snp

However, there is one problem. The color for “K” is grey, which is the same color for “N”. This is because the colors for “G” and “T” are complementary and produce grey when they are mixed.

Is there a set of colors for AGCT which does not produce grey when they are mixed?

I guess the answer is no, at least with simple colors like red-green-blue and cyan-magenta-yellow.  Because when you choose 4 colors from these basic 6 colors, at least one pair is always complementary. (If you see an RGB cube, this is obvious. You can never choose 4 vertices  without including a pair of opposite ones. Black and white are not allowed.)

Interestingly, when I google color schemes for DNA bases, they often use complementary colors to show complementary bases, say, blue for A and yellow for T. I think this is an intuitive decision, but I want to avoid using the complementary colors for my function.

I found designing good color schemes for DNA bases is an interesting topic, and  it looks  very difficult to design schemes which are both informative and beautiful. A set of vivid, informative colors often contains complementary colors while restricting your color space to avoid using them makes the image opaque and less informative.

What I did in the script  is to use 0.6, instead of 0.5 for colorRamp argument not to take the exact mid points between two colors and to avoid producing the identical greys. But, probably, there must be better approaches.

 

Is your multilocus phylogeny correct?

I am now working on a transcriptome data set for a phylogenomic project. Building trees from huge multilocus data obtained by NGS is now widely practiced, but how reliably we can reconstruct phylogeny with them is still largely unknown.

In my data set, most loci are very little informative because sequences of exons are sometimes extremely conservative (the number of variable sites is often less than 1 in 1000bp). Some loci possibly span thousands of bases and the effect of recombination within locus may not be ignored. In addition, as is often the case with NGS data, they have a quite large number of missing sites and missing loci. I imagine that my transcriptome data boldly violate the common assumptions of phylogenetic inference.

Methods for tree inference also matter. Is concatenation of all loci better than using the multi-species coalescent methods? Do you Infer gene trees first or do all inference simultaneously?

To check the reliability of my phylogenetic inference, I searched recent papers reporting the effects of the model violations and how different methods behave under different conditions and summarized them.

-Recombination

Lanier & Knowles(2012) reported that the effect of recombination is negligible, at least, within the range of recombination rates they tested (ρ=0.1 – 20) and with programs they used.  Their simulations showed that sample size and depth of species tree have much stronger effects on the accuracy of the multispecies coalescent methods.

Springer & Gatesy (2016) questioned the use of recombining loci. They suggest that the effective lengths of non-recombining portion of genes are infinitesimally small and it’s illogical to use multispecies coalescent methods to infer phylogeny with the frequently recombining loci.

There are only a few works on the effects of recombination on phylogenetic inference. However, the idea that the recombination is negligible looks plausible for me because recombinations in deep ancestral populations seem to rarely affect the shape of gene trees.

-Low-variation locus

There are opposite views on the usefulness of low-variation loci. A paper by Lanier et al. (2014) said that adding low-variation loci (loci with θ=0.001, equivalent to 1 difference in 1000 bp) did not add accuracy to two-step likelihood inference (that is, gene tree inference followed by species tree inference). On the other hand, a Bayesian method, *BEAST, can gain the accuracy from these marginally informative loci. They concluded that handling gene tree uncertainty is particularly important when you analyse little-informative data sets.

Xi et al. (2015)‘s results are quite opposite. They reported that the two-step inference, RAxML gene trees + MP-EST species tree, can indeed improve the accuracy with a large number of low-variation loci. They suggested it is biased gene trees which compromise accuracy.

I guess that these difference resulted from a slight difference on the methods they used. While Lanier et al. used a majority rule consensus from MrBayes to build gene trees, Xi et al. used a maximum likelihood of RAxML. By taking a consensus tree, the signals of low-variation loci was probably wiped out and the loci became completely uninformative. If they had used the methods like maximum clade credibility tree instead of consensus, results might have been different.

One surprising point is that even small non-randomness in programs can positively mislead inference when you use a very large number of loci.

-Missing data

There has been a long tradition of debates over effects of missing data on phylogenetic inference.  So, a plenty of papers consider this issue. A recent thorough evaluation is Xi et al. (2015) (another work of the authors above).  Their general conclusion is missing data can reduce the accuracy of tree inference, especially when they concentrate in some species. The detrimental effects of missing data is minimal when missing is “locus-based”, that is, some loci have more missing than others. Similar detrimental effects of “taxa-based” missing were reported by Roure et al. (2013).

This is probably the most problematic issue when handling NGS data. They often have a large variation of read coverages across samples, and consequently, missing data are more abundant in some samples than others. This “biased missing” must lead to decline of accuracy. How this type of missing affect tree shape is yet to be studied.

The parameters to be considered and their combinations are huge when we work on large scale multilocus phylogeny. Though there are a large number of papers testing the accuracy of inference under particular violations of models, we can not cover all possibilities.  But, I found some important points, especially on the choice of methods and treatment of missing loci/sites.