Featured post

tr2-delimitation in python3

I just announce that the tr2-delimitation in Python3 is now available from the following repository.

https://bitbucket.org/tfujisawa/tr2-delimitation-python3

It should return results exactly the same as the Python2 version (I verified it with the data set of the Fujisawa et al. (2016) paper).

I will maintain the old Python2 version, but the new functions will be added to this Python3 version.

Advertisements

Multilocus delimitation with tr2: Guide tree approach

The second option of delimitation with the “tr2” is a guide tree approach. A guide tree is a tree which specify a hierarchical structure of species grouping. By using it, you can significantly reduce the number of possible delimitation hypotheses to search.

The tr2 implements an algorithm to find a best position of nodes which define species group under a given guide tree. As the algorithm is reasonably fast, you can search the best delimitation through a tree from each tip is distinct species to all tips are from one single species.

Now, acceptable size of the number of taxa on guide tree is around 100. If you exceed 200 taxa, the memory requirement usually becomes huge and normal desktop computers can not handle. The current limitation of the number of input trees, that is, the limit of number of loci, is ~ 1000. Theoretically, it can be larger, but a problem on numerical calculation now limits the number of loci you can use.

To run tr2 with a guide tree, you need a newick formatted guide tree file, and a gene tree file also in newick format.

Only the first line of the guide tree file is used. In the standard analysis, guide tree tips must contain all taxa found in gene trees. Guide trees can be built any methods such as concatenated ML (eg. RAxML) or coalescent-based species tree methods (eg. ASTRAL). Most importantly , it must be properly rooted. Incorrect rooting often results in over-splitting.

The gene tree file must contain one tree per line. They must be rooted too. Missing taxa are allowed.

Once two files are ready, the command below starts a search algorithm.

./run_tr2.py -g guide.tre -t genetree.tre

To test with bundled example files, use files in “sim4sp” directory.

./run_tr2.py -g sim4sp/4sp.nex10.RTC.tre -t sim4sp/simulated.gene.trees.nex10.4sp.tre

After some intermediate outputs, you will see a tree with delimitation results and a table.

write: <stdout>
species sample
1     12.3
1    11.3
1    15.3
1    14.3
1    13.3
2    7.2
2    10.2

If the tree and table are too large on your console screen, use “-o” option to output them into files.

./run_tr2.py -o out -g sim4sp/4sp.nex10.RTC.tre -t sim4sp/simulated.gene.trees.nex10.4sp.tre

This command ouputs a table and a tree into “out.table.txt” and “out.tre” respectively. You can see the results using any programs.

Now, I use R + “ape” package to check a delimitation result.

library(ape)
tr <- read.tree("./out.tre")
plot(tr)
nodelabels(text=substr(tr$node.label,1,6), bg="white")

These R commands plots the guide tree and delimitation like the  below picture.

outdelimit

The numbers on the nodes indicate average differences of posterior probability scores. If they are positive the node has between-species branches. The negative values suggests that nodes are within species. “nan” indicates there are not enough samples to split/merge the node. “*” signs show the best position of delimitation. In this case, there are 4 putative species.

Doing MCMC with PyMC and making tr2 a bit Bayesian

When you do a Bayesian inference, Markov chain Monte Carlo (MCMC) sampling is a common method to obtain a posterior probability of your model or parameter. So far, I have avoided using MCMC in my programs because I like simple and rapid algorithms. But,  MCMC now looks unavoidable when I do more sophisticated Bayesian modeling.

There are many software platforms to do MCMC with reasonable learning curves, like Stan & Rstan or BUGS. These are definitely worth studying when you want to be a serious Bayesian.

But, for now, I choose a Python package called “PyMC“. This is because it is well integrated with Python language, and there is a nice introductory textbook to learn this package. (It is freely available online, and they sell hard copies too.)

After reading some chapters of the book, I tried to solve a simple problem with PyMC, which is related to phylogenetic inference and species delimitation.

The problem is estimation of λ in the model below.

P(n_1,n_2,n_3|\lambda) = \frac{N!}{n_1!n_2!n_3!}(1-\frac{2}{3}e^{-\lambda})^{n_1}(\frac{1}{3}e^{-\lambda})^{N-n_1}

Why is this related to delimitation?  Because this is a model of distribution of tree topology when you sample 3 individuals from 3 species.

If you sample 3 individuals from 3 different species and reconstruct gene trees, you are more likely to see one particular topology, the same one as species tree, than others. On the other hand, if you sample 3 individuals from 1 species, 3 types of topology are evenly observed. Finding this skew / evenness of  topology distribution is a basic idea of the tr2-delimitation.

N in the model above is the number of sampled loci, which is a sum of n1, n2 and n3, counts of three possible tree topology of 3 samples. As λ (a branch length of species tree) increases, you more frequently observe topology 1 (species tree topology). The distribution becomes more even when λ is close to zero.

With this model, a posterior distribution of λ when you observe topology counts [n1,n2,n3] is,

P(\lambda|n_1,n_2,n_3) = \frac{P(n_1,n_2,n_3|\lambda)\pi(\lambda)}{P(n_1,n_2,n_3)}

I tried to estimate this distribution by MCMC. Luckily, there is an analytical solution to posterior distribution of λ at least with a uniform prior. So, I can check if MCMC can actually work.

The code below simulates n1, n2 and n3 with a particular λ value and does MCMC sampling to estimate λ’s posterior with simulated values, then outputs 5000 samples.

import sys
import numpy
import pymc

##simulated frequencies of triplets
l = 0.598   #true lambda = 0.598
#l = 0.162  #or lambda = 0.162
prob = [1-2*numpy.exp(-l)/3, numpy.exp(-l)/3, numpy.exp(-l)/3]
count_obs = numpy.random.multinomial(100, prob)
print(l, prob, count_obs)

##Bayesian model
lambd = pymc.Uniform("lambda", lower=0.0, upper=5.0) #Uniform prior for lambda

#A pymc function translating lambda to 3 probabilities of triplets
@pymc.deterministic
def triplet_prob(lambd=lambd):
  p1 = 1-2*numpy.exp(-lambd)/3
  p2 = p3 = numpy.exp(-lambd)/3
  return [p1, p2, p3]

#observed values were associated with the multinomial model
obs = pymc.Multinomial("obs", n=sum(count_obs), p=triplet_prob, observed=True, value=count_obs)

#run MCMC
model = pymc.Model([obs, triplet_prob, lambd])
mcmc = pymc.MCMC(model)
mcmc.sample(100000, burn=50000)

with open("trace.lambda.%0.3f.txt"%l, "w") as f:
    for i in mcmc.trace("lambda")[::10]:
            f.write("%f\n"%i)

PyMC has a quite elegant and intuitive way to abstract a Bayesian modelling.

You can easily define prior distributions of parameters and develop models by combining them. The observed data are connected to the model with the “observed=True” option. Dependency of variables can be traced by “parent” and “children” attributes. Then, you can run MCMC just by calling mcmc.sample.

The distribution of posterior probability of λ when λ = 0.598 was like this. (In this case, simulated numbers are [n1,n2,n3]=[60,21,19])

plot2

The left plot is the trace of MCMC, and the right the histogram of MCMC samples. The curve superimposed on the histogram is an analytical solution. As you can see, the MCMC distribution fitted the analytical solution surprisingly well. This is great. The 95% credible interval (CI) is (0.30, 0.78). So, I am quite sure that λ is larger than zero and  topology distribution is skewed.

When λ is smaller (λ = 0.162,[n1,n2,n3]=[44, 33, 23]), estimation became more uncertain.  The 95%CI is (0.035, 0.38). A bit difficult to say n1 is more frequent.

plot1

I think this credible interval approach is OK for this simple case to just estimate λ. But, if you seriously implement species delimitation, a model comparison with reversible jump MCMC is required. It looks much more laborious to write codes for it since PyMC doesn’t have rjMCMC functions.

Regardless, I think PyMC is a good package which is easy to learn and have nice, readable documentations. It is probably a reasonable option if you want to integrate a Bayesian inference in your Python codes. Also, I think it is a handy tool to prototype a Bayesian model before writing it from scratch with other fast languages.

Multilocus delimitation with tr2: Model comparison

The “tr2” currently has two options for species delimitation. One is calculating posterior probability scores for user-specified delimitation hypotheses. Another option is finding the best delimitation under a guide tree, which specifies a hierarchical structure of species grouping.

The first option is probably useful to compare multiple species groupings and find the best one (such as comparing morphological species vs. mtDNA groups) while the second option can be used without any prior assignments and find species only from gene trees.

Let’s start with the first option. (I assume you have already set up an environment for tr2.)

You must have two input files: A gene tree file in Newick format and a tab-delimited text file which specify associations of species and individual samples.

In a tree file, one line must contain one gene tree. Trees can have missing taxa. They must be rooted. (Yes. The program is based on “rooted triplet”. So, trees must be rooted. If you do not have outgroups, midpoint rooting or RAxML’s “-I f” option often works well.)

In an association file, the first column represents the names of samples. They must be identical to the names of the tree tips. The second and so forth columns are species groups. You can write as many columns as you want. Also, you can use any codes to describe species names.

For example, a table below specifies three alternative delimitations of samples 16.4-20.4

19.4     4     B     sp5
17.4     4     B     sp5
18.4     4     B     sp4
16.4     4     A     sp4
20.4     4     A     sp4

Association files must contain all sample names which appear in the tree file.

Once you have a tree file and an association file, simply run the tr2 command as follows.

./run_tr2.py -a sp_association.txt -t genetrees.tre

Some example files are stored in the “sim4sp” folder. If you use them to test tr2, the command is like this.

./run_tr2.py -a sim4sp/sp.assoc.4sp.txt -t sim4sp/simulated.gene.trees.nex10.4sp.tre

The outputs of this command must be like below.

write: <stdout>
model score
null 51391.76
model1 5.73

The score of “model1” looks much smaller than the “null” model (, which assumes all samples are from one single species). So, you can be quite confident that model1 is a better delimitation.

How to install “tr2”

Once the environment is properly set up, the installation of tr2-delimitation is fairly simple. Download the file of tr2-delimitation from the bitbucket repository and decompress it in any place in your computer.

Just click a cloud-like icon on the left side pane and then click “Download repository” to start downloading. A decompressed folder may have a long name  like “tfujisawa-tr2-delimitation-84f248b5fa48”. Simply rename it like “tr2” if you want a handy access to the folder.

To test if the tr2 (and the environment) is correctly installed, let’s run it with a test data set.

$cd /path/to/yourfolder/tr2-delimitation #move to the installed folder
$python run_tr2.py -t sim4sp/simulated.gene.trees.nex10.4sp.tre -g sim4sp/guide.tree.4sp.tre

If you see a table of delimitation like below, the installation is successful.

write: <stdout>

species sample
1 1.1
1 2.1
1 3.1
1 4.1

...

4 17.4
4 18.4
4 19.4
4 20.4

 

In Unix-like OS’s, the program can be run without calling with python.

$./run_tr2.py -t sim4sp/simulated.gene.trees.nex10.4sp.tre -g sim4sp/guide.tree.4sp.tre

 

If you installed python with Anaconda and created a python2 environment, you need to first activate it,

$activate python2
$python run_tr2.py -t sim4sp/simulated.gene.trees.nex10.4sp.tre -g sim4sp/guide.tree.4sp.tre

 

You can integrate delimitation with species tree inference using the rooted triple consensus. To set up the triplec program, download it from its website, and create a folder named “bin” in the same folder as run_tr2.py and put the triplec in the “bin” folder.

$python run_tr2.py -t sim4sp/simulated.gene.trees.nex10.4sp.tre

If the command above returns a similar outputs, the triplec is correctly called.

How to set up an environment for “tr2”

Installing Python and check versions

Python is required to run the tr2-delimitation on your computer. In many modern operating systems, Python is pre-installed and you do not need to install it by yourself.

However, the installed version of Python matters. The tr2 is written in Python 2,  while, in some recent systems, Python 3 is the default version. This is simply because Python 2 was a standard when I started writing the codes, but Python 3 is becoming a new standard now. (I am translating the tr2 codes into Python3.) So, you first need to check the version of Python installed on your system.

Type on your console,

$python --version

If you see a message like below, you are running Python 2.

Python 2.7.6

In some decent OS’s, both Python2 and 3 are pre-installed, and you can call Python 2 by

$python2

even when your default Python is Python3.

If your environment does not have Python2 at all, visit the Python website and install it, or create a Python2 environment as explained in the next section.

Installing dependencies (scipy/numpy and Java)

Python packages called numpy/scipy is required to run tr2. These packages are for numerical calculations. Visit SciPy website and follow the instructions for your operating system to install Scipy libraries. Installing all Scipy related packages following the instruction is just fine though not all Scipy packages are required to run tr2.

An alternative, easy way to install dependencies is installing an all-in-one suite like Anaconda, which includes Python and related packages. As Anaconda allows you to run multiple versions of Python, you can install it with any version of Python.

If you choose to install Anaconda with Python3, you must run Python2 codes by creating Python2 environment.

$conda create --name python2 python=2 numpy scipy

This “conda” command creates an “environment” where the python version is 2 with numpy/scipy installed. You can switch to it by calling the “activate”  command,

$activate python2

and switch back to the default environment by the “deactivate” command.

(python2) $deactivate

 

Checking packages

You can check if the packages are properly installed by loading them on the interactive shell.

$python

(Just by typing “python”, “interactive shell” starts. You can quit this mode by pressing Ctl+d.)

>>>import numpy, scipy

If it does not return errors, packages are ready to use.

Java is also required to run the consensus tree building program, triplec. Almost all modern operating systems have Java as pre-installed software.

Bootstrap a very large data file

A couple of years ago, I wrote a post about a method for sampling from a very large data file. The “reservoir sampling” can efficiently sample a fixed number of samples from a huge data file while you read through it only once. Now, the natural step forward is sampling with replacement.

I sometimes need to resample large data, like a huge alignment or SNP table,  with replacement for statistical analysis, that is, I need to bootstrap them.  Is there a way to efficiently bootstrap a huge file? This looks a lot more difficult than sampling a fixed number of samples without replacement because all items in a file stream must have chance to be sampled several times.

I did some google searches and found that this issue has been studied in various fields from traditional biostatistics to cutting-edge data science. As I expected, the methods for efficient bootstraping are more complicated than the reservoir sampling.

Let’s start with a simple approach. The following code does bootstrapping of lines in a file. You need to load all lines on the memory to resample lines, and bootstrapping will become memory-demanding when you process a data file with millions of lines.

import sys
import numpy

lines = [l for l in open(sys.argv[1], "r")]

for i in numpy.random.randint(0, len(lines), size=len(lines)):
	sys.stdout.write(lines[i])

 

The memory usage will be slightly reduced if you use a multinomial distribution instead of simply sampling lines. This is justified because the number of occurrence of one entry in the bootstrap procedure follows multinomial distribution. This idea appears to have been proposed even in 1980s.

import sys
import numpy

size = sum((1 for line in open(sys.argv[1], "r")))
cnt = numpy.random.multinomial(size, [1.0/size]*size)

for line, c in zip(open(sys.argv[1], "r"), cnt):
        for i in range(c):
                sys.stdout.write(line)

 

A clear problem of the code above is that it requires to read a file twice; it counts the number of lines first then bootstraps them following a multinomial distribution. Because it needs to read the file twice, it does not work on stream inputs.

Is there a way to bootstarp samples while you read a file only once? According to some literatures, an algorithm called “Poisson bootstrap”  does this job. Instead of sampling with multinomial distribution of size=N (N is the total number of samples), you can approximate the number of occurrence of an item by Poisson distribution of lambda=1.  This means that the bootstrap procedure is approximated by sampling each item n times while reading lines and drawing n from a Poisson distribution.

import sys
import numpy

for line in open(sys.argv[1], "r"):
        cnt = numpy.random.poisson(lam=1)

        if cnt == 0:
                continue
        else:
                for i in range(cnt):
                        sys.stdout.write(line)

This simple, but powerful code (it’s even simpler than the reservoir sampling code) does the approximated bootstrap while reading a file only once.

I could not fully understand the mathematical justifications of this procedure, but if you are interested, there are several papers discussing about the properties of Poisson boot strap. For example, Hanely&MacBibbon(2006) or Chamandy et al. (2012). (I think the first paper is more approachable than the second one.)

A disadvantage of this code is you can not sample a fixed number of samples. Bootstrap replicates have different numbers of samples, which probably introduces extra variances to the bootstrap estimates. However, according to  the papers above, the variation of sample size will be negligible when the sample size is huge.

I just generated bootstrap replicates of 100 and 10000 samples with three algorithms above, and checked how many times each sample is sampled. The three algorithms appears to be identical when the sample size is large (N=10000). (Dashed line is the expected number of count with Poisson distribution.)

boot.test.10k

while the variation is not quite negligible when sample size is 100.

boot.test.100.png

This Poisson bootstrap procedure appears to be particularly useful when you need to resample very large data. It gets closer to the true bootstrap as you have a larger number of samples.