Category Archives: Python

tr2-delimitation in python3

I just announce that the tr2-delimitation in Python3 is now available from the following repository.

It should return results exactly the same as the Python2 version (I verified it with the data set of the Fujisawa et al. (2016) paper).

I will maintain the old Python2 version, but the new functions will be added to this Python3 version.

Doing MCMC with PyMC and making tr2 a bit Bayesian

When you do a Bayesian inference, Markov chain Monte Carlo (MCMC) sampling is a common method to obtain a posterior probability of your model or parameter. So far, I have avoided using MCMC in my programs because I like simple and rapid algorithms. But,  MCMC now looks unavoidable when I do more sophisticated Bayesian modeling.

There are many software platforms to do MCMC with reasonable learning curves, like Stan & Rstan or BUGS. These are definitely worth studying when you want to be a serious Bayesian.

But, for now, I choose a Python package called “PyMC“. This is because it is well integrated with Python language, and there is a nice introductory textbook to learn this package. (It is freely available online, and they sell hard copies too.)

After reading some chapters of the book, I tried to solve a simple problem with PyMC, which is related to phylogenetic inference and species delimitation.

The problem is estimation of λ in the model below.

P(n_1,n_2,n_3|\lambda) = \frac{N!}{n_1!n_2!n_3!}(1-\frac{2}{3}e^{-\lambda})^{n_1}(\frac{1}{3}e^{-\lambda})^{N-n_1}

Why is this related to delimitation?  Because this is a model of distribution of tree topology when you sample 3 individuals from 3 species.

If you sample 3 individuals from 3 different species and reconstruct gene trees, you are more likely to see one particular topology, the same one as species tree, than others. On the other hand, if you sample 3 individuals from 1 species, 3 types of topology are evenly observed. Finding this skew / evenness of  topology distribution is a basic idea of the tr2-delimitation.

N in the model above is the number of sampled loci, which is a sum of n1, n2 and n3, counts of three possible tree topology of 3 samples. As λ (a branch length of species tree) increases, you more frequently observe topology 1 (species tree topology). The distribution becomes more even when λ is close to zero.

With this model, a posterior distribution of λ when you observe topology counts [n1,n2,n3] is,

P(\lambda|n_1,n_2,n_3) = \frac{P(n_1,n_2,n_3|\lambda)\pi(\lambda)}{P(n_1,n_2,n_3)}

I tried to estimate this distribution by MCMC. Luckily, there is an analytical solution to posterior distribution of λ at least with a uniform prior. So, I can check if MCMC can actually work.

The code below simulates n1, n2 and n3 with a particular λ value and does MCMC sampling to estimate λ’s posterior with simulated values, then outputs 5000 samples.

import sys
import numpy
import pymc

##simulated frequencies of triplets
l = 0.598   #true lambda = 0.598
#l = 0.162  #or lambda = 0.162
prob = [1-2*numpy.exp(-l)/3, numpy.exp(-l)/3, numpy.exp(-l)/3]
count_obs = numpy.random.multinomial(100, prob)
print(l, prob, count_obs)

##Bayesian model
lambd = pymc.Uniform("lambda", lower=0.0, upper=5.0) #Uniform prior for lambda

#A pymc function translating lambda to 3 probabilities of triplets
def triplet_prob(lambd=lambd):
  p1 = 1-2*numpy.exp(-lambd)/3
  p2 = p3 = numpy.exp(-lambd)/3
  return [p1, p2, p3]

#observed values were associated with the multinomial model
obs = pymc.Multinomial("obs", n=sum(count_obs), p=triplet_prob, observed=True, value=count_obs)

#run MCMC
model = pymc.Model([obs, triplet_prob, lambd])
mcmc = pymc.MCMC(model)
mcmc.sample(100000, burn=50000)

with open("trace.lambda.%0.3f.txt"%l, "w") as f:
    for i in mcmc.trace("lambda")[::10]:

PyMC has a quite elegant and intuitive way to abstract a Bayesian modelling.

You can easily define prior distributions of parameters and develop models by combining them. The observed data are connected to the model with the “observed=True” option. Dependency of variables can be traced by “parent” and “children” attributes. Then, you can run MCMC just by calling mcmc.sample.

The distribution of posterior probability of λ when λ = 0.598 was like this. (In this case, simulated numbers are [n1,n2,n3]=[60,21,19])


The left plot is the trace of MCMC, and the right the histogram of MCMC samples. The curve superimposed on the histogram is an analytical solution. As you can see, the MCMC distribution fitted the analytical solution surprisingly well. This is great. The 95% credible interval (CI) is (0.30, 0.78). So, I am quite sure that λ is larger than zero and  topology distribution is skewed.

When λ is smaller (λ = 0.162,[n1,n2,n3]=[44, 33, 23]), estimation became more uncertain.  The 95%CI is (0.035, 0.38). A bit difficult to say n1 is more frequent.


I think this credible interval approach is OK for this simple case to just estimate λ. But, if you seriously implement species delimitation, a model comparison with reversible jump MCMC is required. It looks much more laborious to write codes for it since PyMC doesn’t have rjMCMC functions.

Regardless, I think PyMC is a good package which is easy to learn and have nice, readable documentations. It is probably a reasonable option if you want to integrate a Bayesian inference in your Python codes. Also, I think it is a handy tool to prototype a Bayesian model before writing it from scratch with other fast languages.

Bootstrap a very large data file

A couple of years ago, I wrote a post about a method for sampling from a very large data file. The “reservoir sampling” can efficiently sample a fixed number of samples from a huge data file while you read through it only once. Now, the natural step forward is sampling with replacement.

I sometimes need to resample large data, like a huge alignment or SNP table,  with replacement for statistical analysis, that is, I need to bootstrap them.  Is there a way to efficiently bootstrap a huge file? This looks a lot more difficult than sampling a fixed number of samples without replacement because all items in a file stream must have chance to be sampled several times.

I did some google searches and found that this issue has been studied in various fields from traditional biostatistics to cutting-edge data science. As I expected, the methods for efficient bootstraping are more complicated than the reservoir sampling.

Let’s start with a simple approach. The following code does bootstrapping of lines in a file. You need to load all lines on the memory to resample lines, and bootstrapping will become memory-demanding when you process a data file with millions of lines.

import sys
import numpy

lines = [l for l in open(sys.argv[1], "r")]

for i in numpy.random.randint(0, len(lines), size=len(lines)):


The memory usage will be slightly reduced if you use a multinomial distribution instead of simply sampling lines. This is justified because the number of occurrence of one entry in the bootstrap procedure follows multinomial distribution. This idea appears to have been proposed even in 1980s.

import sys
import numpy

size = sum((1 for line in open(sys.argv[1], "r")))
cnt = numpy.random.multinomial(size, [1.0/size]*size)

for line, c in zip(open(sys.argv[1], "r"), cnt):
        for i in range(c):


A clear problem of the code above is that it requires to read a file twice; it counts the number of lines first then bootstraps them following a multinomial distribution. Because it needs to read the file twice, it does not work on stream inputs.

Is there a way to bootstarp samples while you read a file only once? According to some literatures, an algorithm called “Poisson bootstrap”  does this job. Instead of sampling with multinomial distribution of size=N (N is the total number of samples), you can approximate the number of occurrence of an item by Poisson distribution of lambda=1.  This means that the bootstrap procedure is approximated by sampling each item n times while reading lines and drawing n from a Poisson distribution.

import sys
import numpy

for line in open(sys.argv[1], "r"):
        cnt = numpy.random.poisson(lam=1)

        if cnt == 0:
                for i in range(cnt):

This simple, but powerful code (it’s even simpler than the reservoir sampling code) does the approximated bootstrap while reading a file only once.

I could not fully understand the mathematical justifications of this procedure, but if you are interested, there are several papers discussing about the properties of Poisson boot strap. For example, Hanely&MacBibbon(2006) or Chamandy et al. (2012). (I think the first paper is more approachable than the second one.)

A disadvantage of this code is you can not sample a fixed number of samples. Bootstrap replicates have different numbers of samples, which probably introduces extra variances to the bootstrap estimates. However, according to  the papers above, the variation of sample size will be negligible when the sample size is huge.

I just generated bootstrap replicates of 100 and 10000 samples with three algorithms above, and checked how many times each sample is sampled. The three algorithms appears to be identical when the sample size is large (N=10000). (Dashed line is the expected number of count with Poisson distribution.)


while the variation is not quite negligible when sample size is 100.


This Poisson bootstrap procedure appears to be particularly useful when you need to resample very large data. It gets closer to the true bootstrap as you have a larger number of samples.


New multilocus delimitation code

I just announce that the codes for the new multilocus delimitation method are available at my Bitbucket repository.

The new program is called “tr2”, which stands for “trinomial distribution of triplets” model. It delimits species from multilocus gene trees by measuring congruence of gene trees between species and incongruence within species. It calculates the posterior probabilities of two types of trinomial distribution models and test within/between-species boundaries. A closely related methods are MP-EST, the triplet-based phylogeny reconstruction method by Liu et al, and the Bayesian species delimitation method BP&P by Yang&Rannala.

The tr2 implements a rapid algorithm to find the best delimitation under a given guide tree and an optimality score (AIC or posterior probability), which can run with more than 100 tip guide trees.

If you are interested in it, please visit the repository below and try it.

I assume that users have knowledge of Python and related programming tools like Mercurial. The program is not well-decumented now. I will write down detailed instructions in following blog posts.

Extracting SNPs from a fasta alignment

I have been working on transcriptome phylogenomics data. The data were originally generated to draw a phylogeny, but we found that we can use them to infer finer-scale demographic histories since they contain a reasonably large number of SNPs within species.

So, I wrote some codes to extract SNPs from a sequence alignment I have constructed.

When you just want to extract variable sites from an alignment, the code must be very simple. You just need to find columns with variable sites by counting bases and keep them when you find two or more unique bases.

import sys

from Bio import AlignIO
from Bio.Seq import Seq
from Bio.Align import MultipleSeqAlignment
from Bio.SeqRecord import SeqRecord

s_bases = ('A', 'C', 'G', 'T')
d_bases = ('R', 'Y', 'S', 'W', 'K', 'M', 'N')
bases = s_bases + d_bases

alig =[1], 'fasta')

cutoff = 0.5

snp = [SeqRecord(Seq('', s.seq.alphabet),, description=s.description) for s in alig]
snp = MultipleSeqAlignment(snp)

for i in range(alig.get_alignment_length()):
	col = alig[:,i]
	col = col.upper()
	col = col.replace('-', 'N')	

	#print col
	base_count = {b:col.count(b) for b in bases}
	del base_count['N']

	if sum(base_count.values()) > len(col)*cutoff:

		genotypes = sorted(base_count.items(), key=lambda x:-x[1])
		genotypes = [b for b in genotypes if b[1] > 0]

		if len(genotypes) > 1:
			snp = snp + alig[:,i:i+1] 

print snp.format('phylip')

This code keeps sites which have two or more unique bases and where missing percentage is smaller than 50. The output looks like below.

$python test.fasta
 24 4
m9467      TTAG
m9092      ----
m12001     TTAG
m8773      TTMG
m9897      TTAG
m8294      CCAA
m8137      TTAG
m6561      TTAG
m8867      TTAG
m10600     CCAA
m4957      TTAG
m10093     TTMG
m12429     TTAG

The idea is very simple, but the behaviour of MultipleSeqAlignment object of Biopython is sometimes tricky and took time to understand it. For example, you can not concatenate columns like this,

snp = snp + alig[:,i]

A valid concatenation of a single column looks like this,

snp = snp + alig[:,i:i+1]

, simply because alig[:,i] is a String but alig[:,i:i+1] is a MultipleSeqAlignment.

The above code is actually not so useful because programs for population genetic inference often use information of genotypes rather than bases. When you need extract SNP genotypes from a sequence alignment, code becomes a bit more complicated.

import sys
import os

from Bio import AlignIO

s_bases = ('A', 'C', 'G', 'T')
d_bases = ('R', 'Y', 'S', 'W', 'K', 'M', 'N')
bases = s_bases + d_bases

deg_map = {'A':('A','A'),'C':('C','C'), 'G':('G','G'), 'T':('T','T'), 'R':('A','G'), 'Y':('C','T'), 'S':('C','G'), 'W':('A','T'), 'K':('G','T'), 'M':('A','C')}

alig =[1], 'fasta')

inname = os.path.basename(sys.argv[1])

colnames = [ for sq in alig]
colnames = ['file', 'pos'] + colnames
print '\t'.join(colnames)

cutoff = 0.5
for i in range(alig.get_alignment_length()):
	col = alig[:,i]
	col = col.upper()
	col = col.replace('-', 'N')	

	base_count = {b:col.count(b) for b in bases}
	del base_count['N']

	if sum(base_count.values()) > len(col)*cutoff:

		genotypes = sorted(base_count.items(), key=lambda x:-x[1])
		genotypes = [b for b in genotypes if b[1] > 0]

		ref = genotypes[0][0]

		if ref in d_bases:
			ref = deg_map[ref][0]

		allel = [deg_map[b[0]] for b in genotypes]
		allel = reduce(lambda x,y: x+y, allel)
		allel = list(set(allel))

		j = allel.index(ref)
		allel[0], allel[j] = allel[j], allel[0]

		allel_code = {b:k for k, b in enumerate(allel)}

		if len(allel) > 1:
			snps = [inname, '%d'%(i+1), ref, ','.join(allel[1:])]
			for s in col:
				if not s == 'N':
					code = [allel_code[a] for a in deg_map[s]]
					code = sorted(code)
					snps.append('%d/%d' % (code[0], code[1]))

			print '\t'.join(snps)

The output is …

$python test.fasta
file	pos	m9467	m9092	m12001	m8773	m9897	m8294	m8137	m6561	m8867	m10600	m4957	m10093	m12429	m9744	m9871	m8953	m8504	m10303	m9328	m9754	m10452	m9391	m8000	m10040
test.fa	57	T	C	0/0	-/-	0/0	0/0	0/0	1/1	0/0	0/0	0/0	1/1	0/0	0/0	0/0	1/1	1/1	0/0	0/0	0/0	1/1	1/1	1/1	0/0	1/1	0/0
test.fa	110	T	C	0/0	-/-	0/0	0/0	0/0	1/1	0/0	0/0	0/0	1/1	0/0	0/0	0/0	1/1	1/1	0/0	0/0	0/0	1/1	1/1	1/1	0/0	1/1	0/0
test.fa	114	A	C	0/0	-/-	0/0	0/1	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/1	0/0	0/0	0/0	0/1	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0
test.fa	119	G	A	0/0	-/-	0/0	0/0	0/0	1/1	0/0	0/0	0/0	1/1	0/0	0/0	0/0	1/1	1/1	0/0	0/0	0/0	1/1	1/1	1/1	0/0	1/1	0/0

Looks neat. The code looks a bit redundant, but it works.

I assume that heterozygous sites in an alignment are represented by the IUPAC ambiguous code (say, Y = C/T). The code first finds variable sites like the previous code, then re-codes IUPAC bases into genotypes like M – A/C – 0/1.   Reference base is determined by choosing the most frequent unambiguous base. The genotypes are now represented by the “0/1” style, where 0/0 is a homozygote of the reference allele and so on. This type of code must be more useful as modifying the output format enables you to write down any types of SNP input files for downstream population genetic analyses.