Category Archives: Python

tr2-delimitation in python3

I just announce that the tr2-delimitation in Python3 is now available from the following repository.

It should return results exactly the same as the Python2 version (I verified it with the data set of the Fujisawa et al. (2016) paper).

I will maintain the old Python2 version, but the new functions will be added to this Python3 version.


Doing MCMC with PyMC and making tr2 a bit Bayesian

When you do a Bayesian inference, Markov chain Monte Carlo (MCMC) sampling is a common method to obtain a posterior probability of your model or parameter. So far, I have avoided using MCMC in my programs because I like simple and rapid algorithms. But,  MCMC now looks unavoidable when I do more sophisticated Bayesian modeling.

There are many software platforms to do MCMC with reasonable learning curves, like Stan & Rstan or BUGS. These are definitely worth studying when you want to be a serious Bayesian.

But, for now, I choose a Python package called “PyMC“. This is because it is well integrated with Python language, and there is a nice introductory textbook to learn this package. (It is freely available online, and they sell hard copies too.)

After reading some chapters of the book, I tried to solve a simple problem with PyMC, which is related to phylogenetic inference and species delimitation.

The problem is estimation of λ in the model below.

P(n_1,n_2,n_3|\lambda) = \frac{N!}{n_1!n_2!n_3!}(1-\frac{2}{3}e^{-\lambda})^{n_1}(\frac{1}{3}e^{-\lambda})^{N-n_1}

Why is this related to delimitation?  Because this is a model of distribution of tree topology when you sample 3 individuals from 3 species.

If you sample 3 individuals from 3 different species and reconstruct gene trees, you are more likely to see one particular topology, the same one as species tree, than others. On the other hand, if you sample 3 individuals from 1 species, 3 types of topology are evenly observed. Finding this skew / evenness of  topology distribution is a basic idea of the tr2-delimitation.

N in the model above is the number of sampled loci, which is a sum of n1, n2 and n3, counts of three possible tree topology of 3 samples. As λ (a branch length of species tree) increases, you more frequently observe topology 1 (species tree topology). The distribution becomes more even when λ is close to zero.

With this model, a posterior distribution of λ when you observe topology counts [n1,n2,n3] is,

P(\lambda|n_1,n_2,n_3) = \frac{P(n_1,n_2,n_3|\lambda)\pi(\lambda)}{P(n_1,n_2,n_3)}

I tried to estimate this distribution by MCMC. Luckily, there is an analytical solution to posterior distribution of λ at least with a uniform prior. So, I can check if MCMC can actually work.

The code below simulates n1, n2 and n3 with a particular λ value and does MCMC sampling to estimate λ’s posterior with simulated values, then outputs 5000 samples.

import sys
import numpy
import pymc

##simulated frequencies of triplets
l = 0.598   #true lambda = 0.598
#l = 0.162  #or lambda = 0.162
prob = [1-2*numpy.exp(-l)/3, numpy.exp(-l)/3, numpy.exp(-l)/3]
count_obs = numpy.random.multinomial(100, prob)
print(l, prob, count_obs)

##Bayesian model
lambd = pymc.Uniform("lambda", lower=0.0, upper=5.0) #Uniform prior for lambda

#A pymc function translating lambda to 3 probabilities of triplets
def triplet_prob(lambd=lambd):
  p1 = 1-2*numpy.exp(-lambd)/3
  p2 = p3 = numpy.exp(-lambd)/3
  return [p1, p2, p3]

#observed values were associated with the multinomial model
obs = pymc.Multinomial("obs", n=sum(count_obs), p=triplet_prob, observed=True, value=count_obs)

#run MCMC
model = pymc.Model([obs, triplet_prob, lambd])
mcmc = pymc.MCMC(model)
mcmc.sample(100000, burn=50000)

with open("trace.lambda.%0.3f.txt"%l, "w") as f:
    for i in mcmc.trace("lambda")[::10]:

PyMC has a quite elegant and intuitive way to abstract a Bayesian modelling.

You can easily define prior distributions of parameters and develop models by combining them. The observed data are connected to the model with the “observed=True” option. Dependency of variables can be traced by “parent” and “children” attributes. Then, you can run MCMC just by calling mcmc.sample.

The distribution of posterior probability of λ when λ = 0.598 was like this. (In this case, simulated numbers are [n1,n2,n3]=[60,21,19])


The left plot is the trace of MCMC, and the right the histogram of MCMC samples. The curve superimposed on the histogram is an analytical solution. As you can see, the MCMC distribution fitted the analytical solution surprisingly well. This is great. The 95% credible interval (CI) is (0.30, 0.78). So, I am quite sure that λ is larger than zero and  topology distribution is skewed.

When λ is smaller (λ = 0.162,[n1,n2,n3]=[44, 33, 23]), estimation became more uncertain.  The 95%CI is (0.035, 0.38). A bit difficult to say n1 is more frequent.


I think this credible interval approach is OK for this simple case to just estimate λ. But, if you seriously implement species delimitation, a model comparison with reversible jump MCMC is required. It looks much more laborious to write codes for it since PyMC doesn’t have rjMCMC functions.

Regardless, I think PyMC is a good package which is easy to learn and have nice, readable documentations. It is probably a reasonable option if you want to integrate a Bayesian inference in your Python codes. Also, I think it is a handy tool to prototype a Bayesian model before writing it from scratch with other fast languages.

Bootstrap a very large data file

A couple of years ago, I wrote a post about a method for sampling from a very large data file. The “reservoir sampling” can efficiently sample a fixed number of samples from a huge data file while you read through it only once. Now, the natural step forward is sampling with replacement.

I sometimes need to resample large data, like a huge alignment or SNP table,  with replacement for statistical analysis, that is, I need to bootstrap them.  Is there a way to efficiently bootstrap a huge file? This looks a lot more difficult than sampling a fixed number of samples without replacement because all items in a file stream must have chance to be sampled several times.

I did some google searches and found that this issue has been studied in various fields from traditional biostatistics to cutting-edge data science. As I expected, the methods for efficient bootstraping are more complicated than the reservoir sampling.

Let’s start with a simple approach. The following code does bootstrapping of lines in a file. You need to load all lines on the memory to resample lines, and bootstrapping will become memory-demanding when you process a data file with millions of lines.

import sys
import numpy

lines = [l for l in open(sys.argv[1], "r")]

for i in numpy.random.randint(0, len(lines), size=len(lines)):


The memory usage will be slightly reduced if you use a multinomial distribution instead of simply sampling lines. This is justified because the number of occurrence of one entry in the bootstrap procedure follows multinomial distribution. This idea appears to have been proposed even in 1980s.

import sys
import numpy

size = sum((1 for line in open(sys.argv[1], "r")))
cnt = numpy.random.multinomial(size, [1.0/size]*size)

for line, c in zip(open(sys.argv[1], "r"), cnt):
        for i in range(c):


A clear problem of the code above is that it requires to read a file twice; it counts the number of lines first then bootstraps them following a multinomial distribution. Because it needs to read the file twice, it does not work on stream inputs.

Is there a way to bootstarp samples while you read a file only once? According to some literatures, an algorithm called “Poisson bootstrap”  does this job. Instead of sampling with multinomial distribution of size=N (N is the total number of samples), you can approximate the number of occurrence of an item by Poisson distribution of lambda=1.  This means that the bootstrap procedure is approximated by sampling each item n times while reading lines and drawing n from a Poisson distribution.

import sys
import numpy

for line in open(sys.argv[1], "r"):
        cnt = numpy.random.poisson(lam=1)

        if cnt == 0:
                for i in range(cnt):

This simple, but powerful code (it’s even simpler than the reservoir sampling code) does the approximated bootstrap while reading a file only once.

I could not fully understand the mathematical justifications of this procedure, but if you are interested, there are several papers discussing about the properties of Poisson boot strap. For example, Hanely&MacBibbon(2006) or Chamandy et al. (2012). (I think the first paper is more approachable than the second one.)

A disadvantage of this code is you can not sample a fixed number of samples. Bootstrap replicates have different numbers of samples, which probably introduces extra variances to the bootstrap estimates. However, according to  the papers above, the variation of sample size will be negligible when the sample size is huge.

I just generated bootstrap replicates of 100 and 10000 samples with three algorithms above, and checked how many times each sample is sampled. The three algorithms appears to be identical when the sample size is large (N=10000). (Dashed line is the expected number of count with Poisson distribution.)


while the variation is not quite negligible when sample size is 100.


This Poisson bootstrap procedure appears to be particularly useful when you need to resample very large data. It gets closer to the true bootstrap as you have a larger number of samples.


New multilocus delimitation code

I just announce that the codes for the new multilocus delimitation method are available at my Bitbucket repository.

The new program is called “tr2”, which stands for “trinomial distribution of triplets” model. It delimits species from multilocus gene trees by measuring congruence of gene trees between species and incongruence within species. It calculates the posterior probabilities of two types of trinomial distribution models and test within/between-species boundaries. A closely related methods are MP-EST, the triplet-based phylogeny reconstruction method by Liu et al, and the Bayesian species delimitation method BP&P by Yang&Rannala.

The tr2 implements a rapid algorithm to find the best delimitation under a given guide tree and an optimality score (AIC or posterior probability), which can run with more than 100 tip guide trees.

If you are interested in it, please visit the repository below and try it.

I assume that users have knowledge of Python and related programming tools like Mercurial. The program is not well-decumented now. I will write down detailed instructions in following blog posts.

Extracting SNPs from a fasta alignment

I have been working on transcriptome phylogenomics data. The data were originally generated to draw a phylogeny, but we found that we can use them to infer finer-scale demographic histories since they contain a reasonably large number of SNPs within species.

So, I wrote some codes to extract SNPs from a sequence alignment I have constructed.

When you just want to extract variable sites from an alignment, the code must be very simple. You just need to find columns with variable sites by counting bases and keep them when you find two or more unique bases.

import sys

from Bio import AlignIO
from Bio.Seq import Seq
from Bio.Align import MultipleSeqAlignment
from Bio.SeqRecord import SeqRecord

s_bases = ('A', 'C', 'G', 'T')
d_bases = ('R', 'Y', 'S', 'W', 'K', 'M', 'N')
bases = s_bases + d_bases

alig =[1], 'fasta')

cutoff = 0.5

snp = [SeqRecord(Seq('', s.seq.alphabet),, description=s.description) for s in alig]
snp = MultipleSeqAlignment(snp)

for i in range(alig.get_alignment_length()):
	col = alig[:,i]
	col = col.upper()
	col = col.replace('-', 'N')	

	#print col
	base_count = {b:col.count(b) for b in bases}
	del base_count['N']

	if sum(base_count.values()) > len(col)*cutoff:

		genotypes = sorted(base_count.items(), key=lambda x:-x[1])
		genotypes = [b for b in genotypes if b[1] > 0]

		if len(genotypes) > 1:
			snp = snp + alig[:,i:i+1] 

print snp.format('phylip')

This code keeps sites which have two or more unique bases and where missing percentage is smaller than 50. The output looks like below.

$python test.fasta
 24 4
m9467      TTAG
m9092      ----
m12001     TTAG
m8773      TTMG
m9897      TTAG
m8294      CCAA
m8137      TTAG
m6561      TTAG
m8867      TTAG
m10600     CCAA
m4957      TTAG
m10093     TTMG
m12429     TTAG

The idea is very simple, but the behaviour of MultipleSeqAlignment object of Biopython is sometimes tricky and took time to understand it. For example, you can not concatenate columns like this,

snp = snp + alig[:,i]

A valid concatenation of a single column looks like this,

snp = snp + alig[:,i:i+1]

, simply because alig[:,i] is a String but alig[:,i:i+1] is a MultipleSeqAlignment.

The above code is actually not so useful because programs for population genetic inference often use information of genotypes rather than bases. When you need extract SNP genotypes from a sequence alignment, code becomes a bit more complicated.

import sys
import os

from Bio import AlignIO

s_bases = ('A', 'C', 'G', 'T')
d_bases = ('R', 'Y', 'S', 'W', 'K', 'M', 'N')
bases = s_bases + d_bases

deg_map = {'A':('A','A'),'C':('C','C'), 'G':('G','G'), 'T':('T','T'), 'R':('A','G'), 'Y':('C','T'), 'S':('C','G'), 'W':('A','T'), 'K':('G','T'), 'M':('A','C')}

alig =[1], 'fasta')

inname = os.path.basename(sys.argv[1])

colnames = [ for sq in alig]
colnames = ['file', 'pos'] + colnames
print '\t'.join(colnames)

cutoff = 0.5
for i in range(alig.get_alignment_length()):
	col = alig[:,i]
	col = col.upper()
	col = col.replace('-', 'N')	

	base_count = {b:col.count(b) for b in bases}
	del base_count['N']

	if sum(base_count.values()) > len(col)*cutoff:

		genotypes = sorted(base_count.items(), key=lambda x:-x[1])
		genotypes = [b for b in genotypes if b[1] > 0]

		ref = genotypes[0][0]

		if ref in d_bases:
			ref = deg_map[ref][0]

		allel = [deg_map[b[0]] for b in genotypes]
		allel = reduce(lambda x,y: x+y, allel)
		allel = list(set(allel))

		j = allel.index(ref)
		allel[0], allel[j] = allel[j], allel[0]

		allel_code = {b:k for k, b in enumerate(allel)}

		if len(allel) > 1:
			snps = [inname, '%d'%(i+1), ref, ','.join(allel[1:])]
			for s in col:
				if not s == 'N':
					code = [allel_code[a] for a in deg_map[s]]
					code = sorted(code)
					snps.append('%d/%d' % (code[0], code[1]))

			print '\t'.join(snps)

The output is …

$python test.fasta
file	pos	m9467	m9092	m12001	m8773	m9897	m8294	m8137	m6561	m8867	m10600	m4957	m10093	m12429	m9744	m9871	m8953	m8504	m10303	m9328	m9754	m10452	m9391	m8000	m10040
test.fa	57	T	C	0/0	-/-	0/0	0/0	0/0	1/1	0/0	0/0	0/0	1/1	0/0	0/0	0/0	1/1	1/1	0/0	0/0	0/0	1/1	1/1	1/1	0/0	1/1	0/0
test.fa	110	T	C	0/0	-/-	0/0	0/0	0/0	1/1	0/0	0/0	0/0	1/1	0/0	0/0	0/0	1/1	1/1	0/0	0/0	0/0	1/1	1/1	1/1	0/0	1/1	0/0
test.fa	114	A	C	0/0	-/-	0/0	0/1	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/1	0/0	0/0	0/0	0/1	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0
test.fa	119	G	A	0/0	-/-	0/0	0/0	0/0	1/1	0/0	0/0	0/0	1/1	0/0	0/0	0/0	1/1	1/1	0/0	0/0	0/0	1/1	1/1	1/1	0/0	1/1	0/0

Looks neat. The code looks a bit redundant, but it works.

I assume that heterozygous sites in an alignment are represented by the IUPAC ambiguous code (say, Y = C/T). The code first finds variable sites like the previous code, then re-codes IUPAC bases into genotypes like M – A/C – 0/1.   Reference base is determined by choosing the most frequent unambiguous base. The genotypes are now represented by the “0/1” style, where 0/0 is a homozygote of the reference allele and so on. This type of code must be more useful as modifying the output format enables you to write down any types of SNP input files for downstream population genetic analyses.


from R to Python (2): libraries for statistical analysis

A couple of years ago, I wrote a post about differences between R and Python. It mainly reported differences on grammars of these languages, which I found interesting when I translated my codes.

However, I found the post is not so informative for readers looking for a primer for “Python as an environment for statistical analysis” (, which can be alternative to R). So, I will write down what I recently learned on statistical analysis using Python.

If you want to develop an R-like environment for interactive statistical analysis using Python, first you need to install following libraries:

The pandas library provides functions and classes for interactive data analysis, for example, R-like data frame, and the statsmodels for statistical modelling like GLM. The matplotlib and numpy/scipy are required for better plotting, numerical calculations and vector handling. They are mainly called from pandas and statsmodels, but you can also directly call their functions.

Installation of these libraries is relatively easy if you use Unix-like operating system, but may not so if you use Windows or Mac. However, there are plenty of documentations on how to install them on Mac or Win like this.

In addition to the libraries, you’d better install IPython, a better interactive shell for Python. IPython provides you with a command-line completion much better than the default Python shell. Without a good command completion, interactive analysis would become very stressful.

Armed with IPython and these libraries, you can do almost the same things as you can with R.

import pandas

iris = pandas.read_csv("./iris.txt", sep="\t")


You will read a tab-delimited file by read_csv, and show first few lines of the table on console. The pandas.read_csv reads a tab-delimited table and returns a DataFrame object. As its name indicates, the DataFrame supports similar functions to the data.frame in R. The “head” is a method of the dataframe which shows part of its rows. You will see results like below when you execute the code in IPython.

In [1]: import pandas

In [2]: iris = pandas.read_csv("./iris.txt", sep="\t")

In [3]: iris.head()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0        5.1         3.5          1.4         0.2  setosa
1        4.9         3.0          1.4         0.2  setosa
2        4.7         3.2          1.3         0.2  setosa
3        4.6         3.1          1.5         0.2  setosa
4        5.0         3.6          1.4         0.2  setosa

[5 rows x 5 columns]

You can access columns of data frame with their names as you can with R. Also, slicing with indices, accessing entries with conditions are supported.

#select a column named "Species"

#select first 5 elements of column "Petal.Length"

#select rows with which "Species" is equal to "virginica", then show first 5 rows

Commands above will return the following outputs.

In [4]: iris["Species"]
0 setosa
1 setosa
2 setosa
3 setosa
4 setosa

In [5]: iris["Petal.Length"][0:5]
0 1.4
1 1.4
2 1.3
3 1.5
4 1.4
Name: Petal.Length, dtype: float64

In [6]: iris[iris["Species"]=="virginica"].head()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
100      6.3         3.3          6.0         2.5 virginica
101      5.8         2.7          5.1         1.9 virginica
102      7.1         3.0          5.9         2.1 virginica
103      6.3         2.9          5.6         1.8 virginica
104      6.5         3.0          5.8         2.2 virginica

[5 rows x 5 columns]

Basic statistics are supported by methods of dataframe.

#count the number of occurrance

#mean of each columns

Counts of numbers of species and means of columns will be returned.

In [7]: iris["Species"].value_counts()
setosa        50
versicolor    50
virginica     50
dtype: int64

In [8]: iris.mean()
Sepal.Length    5.843333
Sepal.Width     3.057333
Petal.Length    3.758000
Petal.Width     1.199333
dtype: float64

Also, there are several types of plotting methods.

iris.boxplot(by="Species", column="Petal.Width")

#scatter plot matrix

The first line plots a boxplot of “Petal.Width” by “Species”. The second line plots a scatter plot matrix. Make sure that you start IPython with the “–matplotlib” option to enable plotting.

figure_1 figure_2

If the Python codes above are translated into R, they look like the following R code.

iris <- read.table("./iris.txt", sep="\t", header=T)




boxplot(Petal.Width ~ Species, data=iris)

plotinr1 plotinr2

The codes in Python and R are very similar. Functions of pandas are almost like object-oriented version of R functions. You can find very basic functions of pandas in their documents. It appears IPython+pandas have as many functions for basic data manipulation and visualization as R has.  So, how about statistical analysis like regression modelling? I will write about in the next post.

Sampling from a huge data file

The size of data we are handling is growing rapidly. We now have to handle  millions of reads from the Illumina sequencers or millions of trees samples from MCMC runs.

Probably, you have experienced difficulty of sampling a small number of elements from a very large data set. For example, you may have wanted to sample one hundred sequence reads from a huge Illumina fastq file. You cannot load all reads on the memory. So, normal functions to sample elements from a list is not feasible in this situation.

A naive approach is iterating over a file and deciding if an element is sampled with a fixed probability. However, this can not return a fixed number of samples. Also, you need to know the total number of elements beforehand, which requires the iteration of file twice.

After some literature search and googling, I found an elegant algorithm called “reservoir sampling” solves this problem. (If you already know the algorithm, you don’t have to read this post.)

The reservoir sampling algorithm for sampling n elements from N elements is described like this.

  1. Iterate over each element.
  2. For i-th element, If i is smaller than or equal to n, put it in the reservoir.
  3. If i is larger than n, draw a random number, r, between 0 and i.
  4. If r is smaller than n, randomly replace an element in the reservoir with the i-th element.
  5. Continue until i = N.

That’s it. If I write it in Python, the code has only 19 lines including spaces. (and can be reduced more.)

import sys
import random

def reservoir_sample(stream, n):
	res = []

	for i, el in enumerate(stream):
		if i <= n:
			rand = random.sample(xrange(i), 1)[0]
			if rand < n:
				res[random.sample(xrange(n), 1)[0]] = el
	return res

if __name__=="__main__":
	samp = reservoir_sample(sys.stdin, 10)
	for s in samp:

This code randomly samples 10 lines from standard input.

python < file.txt

If you replace sys.stdin with other types of stream, for example, a file stream or an iterator of sequences from SeqIO.parse in BioPython, it randomly samples 10 elements from them. Because file iterators of Python do not load whole file onto the memory, you do not need large memories to complete sampling as far as the size of reservoir is small. In addition, you do not need to know how many elements there are in the stream.

I found this algorithm truly useful. But, why does it work?

Let’s consider a list with N elements, from which we sample n elements.

Probability that i-th element is sampled must be a product of probability that i-th element is put into reservoir and probability that i-th element in the reservoir is not replaced by other elements sampled subsequently.

When i = N, the probability that the N-th element is sampled is simply,

P_{i=N} = \frac{n}{N}

, because i = N and there is no subsequent sampling.

When i = N-1, the probability is a product of the N-1-th element is chosen and it is not replaced by the N-th element in the reservoir.

P_{i=N-1} = \frac{n}{N-1} \cdot (1-\frac{n}{N} \cdot \frac{1}{n}) = \frac{n}{N-1} \frac{N-1}{N} = \frac{n}{N}

The probability that N-1-th element is put in reservoir is clearly n/(N-1), and that the element in reservoir is not replaced is 1-n/N*1/n because the N-th element is chosen with n/N and it replaces N-1-th element with probability 1/n.

It appears that a denominator is reduced by a numerator in the following term.
Similar arguments go down to N-2, N-3, and so forth to the i-th element.

P_{N-i} = \frac{n}{i} \cdot (1-\frac{n}{i+1} \cdot \frac{1}{n}) \cdot (1-\frac{n}{i+2} \cdot \frac{1}{n}) \cdots (1-\frac{n}{N} \cdot \frac{1}{n}) \\=\frac{n}{i} \cdot \frac{i}{i+1} \cdot \frac{i+1}{i+2} \cdots \frac{N-1}{N} = \frac{n}{N}

Great. Probability for each element to be sampled is always n/N.

Thanks to this clever algorithm, now I can save memory and time a lot.