# Superimposing maps of two cities with ggmap

I recently moved from Kyoto to Paris to do my second postdoc. In the process of flat hunting in Paris, which has been very hard  because of  the rising housing cost, I walked around some different parts of Paris. One impression I had is Paris is small. Once you walk through the city,  you will find this quite easily.

My intuition told me that its size is almost close to the size of Kyoto, which is a medium-sized city in Japan. But, is it true? I found a website comparing London and Paris, and it says London is larger than Paris, but no comparison between Paris and Japanese cities exists.

So, I just checked by superimposing maps of the two cities. I used an R package called “ggmap”, a very nice geographic visualization package built based on ggplot2.

I first downloaded maps of Paris and Kyoto in 15km squares surrounding city centers. (I chose the Cite island for Paris and Shijo-Karasuma for Kyoto as centers). This process was a bit tricky because maps are defined by longitudinal and latitudinal range. The identical longitudinal ranges do not represent the same distance when the cities’ latitudes are different. I used the “great circule distance” equation to adjust the effect of latitude on the longitudinal distance and and find appropriate ranges.

R <- 6371
#convert distance to longitudinal range given a latitude
delta.lambda <- function(d, phi=0.0) {
2*asin(sin(d/R/2)/cos(phi))
}

#create longitudinal and latitudinal ranges which define a suquare with length d
map.range <- function(d, lat=0.0, lon=0.0){
dphi <- d/R*360/2/pi
dlambd <- delta.lambda(d, phi=lat*2*pi/360)*360/2/pi

lonrange <- lon + c(-dlambd/2, dlambd/2)
latrange <- lat + c(-dphi/2, dphi/2)

rang <- c(lonrange, latrange)
names(rang) <- c("left", "right", "bottom", "top")
return (rang)
}


In the above code, the map.range function returns longitudinal and latitudinal range which covers a square with its side length = d km.

Once you can define a range to draw, downloading and plotting maps is simple thanks to ggmap functions.

d <-15 #15kmx15km map
z <- 12 #Zoom level
kyoto_center <- c(35.0, 135.76)
kyoto <- map.range(d, lat=kyoto_center[1], lon=kyoto_center[2])
kyoto_map <- get_stamenmap(kyoto, zoom=z, maptype="toner")
ggmap(kyoto_map)

paris_center <- c(48.855, 2.34)
paris <- map.range(d, lat=paris_center[1], lon=paris_center[2])
paris_map <- get_stamenmap(paris, zoom=z, maptype="toner")
ggmap(paris_map)


The two maps plotted side by side tells me a lot. For examplem the Peripherique motor way, which defines Paris, looks quite similar size as the city of Kyoto surrounded by mountains.

Superimposing two maps was another tricky point. The ggmap has two ways for superimposition of graphics, “inset_map” and “inset_raster”. It seems “inset_raster” is an appropriate function because it allows me to set coordinates for regions you superimpose graphics. One problem is that it does not allow me to change alpha value. I just added alpha values to the raster graphics of the map and plot it using the inset_raster.

translucent.raster <- function(map){
tmp <- as.matrix(map)
tmp <- gsub("\$", "55", tmp)
tmp <- gsub("FFFFFF", "0000FF", tmp)
ras <- as.raster(tmp)
attributes(ras) <- attributes(map)
ras
}

ggmap(paris_map) + inset_raster(translucent.raster(kyoto_map), xmin=paris["left"], ymin=paris["bottom"], xmax=paris["right"], ymax=paris["top"])


This is a bit awkward way, but the result looks OK. Now, I can compare landmarks more precisely.

The Golden and Silver temples (indicated by small black arrows), which define north-west and north-east edges of Kyoto area are almost on the Peripherique. This means walking from the Cite island to the Peripherique is just like walking from Shijo-Karasuma to the Silver temple. Probably, this distance similarity made me feel that the two cities are in similar size.

Of course, Paris has much larger suburban area, and the size of Paris metropolitan area is much larger than Kyoto. But, my intuition looks more or less correct.

# tr2-delimitation in python3

I just announce that the tr2-delimitation in Python3 is now available from the following repository.

https://bitbucket.org/tfujisawa/tr2-delimitation-python3

It should return results exactly the same as the Python2 version (I verified it with the data set of the Fujisawa et al. (2016) paper).

I will maintain the old Python2 version, but the new functions will be added to this Python3 version.

# Doing MCMC with PyMC and making tr2 a bit Bayesian

When you do a Bayesian inference, Markov chain Monte Carlo (MCMC) sampling is a common method to obtain a posterior probability of your model or parameter. So far, I have avoided using MCMC in my programs because I like simple and rapid algorithms. But,  MCMC now looks unavoidable when I do more sophisticated Bayesian modeling.

There are many software platforms to do MCMC with reasonable learning curves, like Stan & Rstan or BUGS. These are definitely worth studying when you want to be a serious Bayesian.

But, for now, I choose a Python package called “PyMC“. This is because it is well integrated with Python language, and there is a nice introductory textbook to learn this package. (It is freely available online, and they sell hard copies too.)

After reading some chapters of the book, I tried to solve a simple problem with PyMC, which is related to phylogenetic inference and species delimitation.

The problem is estimation of λ in the model below.

$P(n_1,n_2,n_3|\lambda) = \frac{N!}{n_1!n_2!n_3!}(1-\frac{2}{3}e^{-\lambda})^{n_1}(\frac{1}{3}e^{-\lambda})^{N-n_1}$

Why is this related to delimitation?  Because this is a model of distribution of tree topology when you sample 3 individuals from 3 species.

If you sample 3 individuals from 3 different species and reconstruct gene trees, you are more likely to see one particular topology, the same one as species tree, than others. On the other hand, if you sample 3 individuals from 1 species, 3 types of topology are evenly observed. Finding this skew / evenness of  topology distribution is a basic idea of the tr2-delimitation.

N in the model above is the number of sampled loci, which is a sum of n1, n2 and n3, counts of three possible tree topology of 3 samples. As λ (a branch length of species tree) increases, you more frequently observe topology 1 (species tree topology). The distribution becomes more even when λ is close to zero.

With this model, a posterior distribution of λ when you observe topology counts [n1,n2,n3] is,

$P(\lambda|n_1,n_2,n_3) = \frac{P(n_1,n_2,n_3|\lambda)\pi(\lambda)}{P(n_1,n_2,n_3)}$

I tried to estimate this distribution by MCMC. Luckily, there is an analytical solution to posterior distribution of λ at least with a uniform prior. So, I can check if MCMC can actually work.

The code below simulates n1, n2 and n3 with a particular λ value and does MCMC sampling to estimate λ’s posterior with simulated values, then outputs 5000 samples.

import sys
import numpy
import pymc

##simulated frequencies of triplets
l = 0.598   #true lambda = 0.598
#l = 0.162  #or lambda = 0.162
prob = [1-2*numpy.exp(-l)/3, numpy.exp(-l)/3, numpy.exp(-l)/3]
count_obs = numpy.random.multinomial(100, prob)
print(l, prob, count_obs)

##Bayesian model
lambd = pymc.Uniform("lambda", lower=0.0, upper=5.0) #Uniform prior for lambda

#A pymc function translating lambda to 3 probabilities of triplets
@pymc.deterministic
def triplet_prob(lambd=lambd):
p1 = 1-2*numpy.exp(-lambd)/3
p2 = p3 = numpy.exp(-lambd)/3
return [p1, p2, p3]

#observed values were associated with the multinomial model
obs = pymc.Multinomial("obs", n=sum(count_obs), p=triplet_prob, observed=True, value=count_obs)

#run MCMC
model = pymc.Model([obs, triplet_prob, lambd])
mcmc = pymc.MCMC(model)
mcmc.sample(100000, burn=50000)

with open("trace.lambda.%0.3f.txt"%l, "w") as f:
for i in mcmc.trace("lambda")[::10]:
f.write("%f\n"%i)


PyMC has a quite elegant and intuitive way to abstract a Bayesian modelling.

You can easily define prior distributions of parameters and develop models by combining them. The observed data are connected to the model with the “observed=True” option. Dependency of variables can be traced by “parent” and “children” attributes. Then, you can run MCMC just by calling mcmc.sample.

The distribution of posterior probability of λ when λ = 0.598 was like this. (In this case, simulated numbers are [n1,n2,n3]=[60,21,19])

The left plot is the trace of MCMC, and the right the histogram of MCMC samples. The curve superimposed on the histogram is an analytical solution. As you can see, the MCMC distribution fitted the analytical solution surprisingly well. This is great. The 95% credible interval (CI) is (0.30, 0.78). So, I am quite sure that λ is larger than zero and  topology distribution is skewed.

When λ is smaller (λ = 0.162,[n1,n2,n3]=[44, 33, 23]), estimation became more uncertain.  The 95%CI is (0.035, 0.38). A bit difficult to say n1 is more frequent.

I think this credible interval approach is OK for this simple case to just estimate λ. But, if you seriously implement species delimitation, a model comparison with reversible jump MCMC is required. It looks much more laborious to write codes for it since PyMC doesn’t have rjMCMC functions.

Regardless, I think PyMC is a good package which is easy to learn and have nice, readable documentations. It is probably a reasonable option if you want to integrate a Bayesian inference in your Python codes. Also, I think it is a handy tool to prototype a Bayesian model before writing it from scratch with other fast languages.

# Bootstrap a very large data file

A couple of years ago, I wrote a post about a method for sampling from a very large data file. The “reservoir sampling” can efficiently sample a fixed number of samples from a huge data file while you read through it only once. Now, the natural step forward is sampling with replacement.

I sometimes need to resample large data, like a huge alignment or SNP table,  with replacement for statistical analysis, that is, I need to bootstrap them.  Is there a way to efficiently bootstrap a huge file? This looks a lot more difficult than sampling a fixed number of samples without replacement because all items in a file stream must have chance to be sampled several times.

I did some google searches and found that this issue has been studied in various fields from traditional biostatistics to cutting-edge data science. As I expected, the methods for efficient bootstraping are more complicated than the reservoir sampling.

Let’s start with a simple approach. The following code does bootstrapping of lines in a file. You need to load all lines on the memory to resample lines, and bootstrapping will become memory-demanding when you process a data file with millions of lines.

import sys
import numpy

lines = [l for l in open(sys.argv[1], "r")]

for i in numpy.random.randint(0, len(lines), size=len(lines)):
sys.stdout.write(lines[i])


The memory usage will be slightly reduced if you use a multinomial distribution instead of simply sampling lines. This is justified because the number of occurrence of one entry in the bootstrap procedure follows multinomial distribution. This idea appears to have been proposed even in 1980s.

import sys
import numpy

size = sum((1 for line in open(sys.argv[1], "r")))
cnt = numpy.random.multinomial(size, [1.0/size]*size)

for line, c in zip(open(sys.argv[1], "r"), cnt):
for i in range(c):
sys.stdout.write(line)


A clear problem of the code above is that it requires to read a file twice; it counts the number of lines first then bootstraps them following a multinomial distribution. Because it needs to read the file twice, it does not work on stream inputs.

Is there a way to bootstarp samples while you read a file only once? According to some literatures, an algorithm called “Poisson bootstrap”  does this job. Instead of sampling with multinomial distribution of size=N (N is the total number of samples), you can approximate the number of occurrence of an item by Poisson distribution of lambda=1.  This means that the bootstrap procedure is approximated by sampling each item n times while reading lines and drawing n from a Poisson distribution.

import sys
import numpy

for line in open(sys.argv[1], "r"):
cnt = numpy.random.poisson(lam=1)

if cnt == 0:
continue
else:
for i in range(cnt):
sys.stdout.write(line)


This simple, but powerful code (it’s even simpler than the reservoir sampling code) does the approximated bootstrap while reading a file only once.

I could not fully understand the mathematical justifications of this procedure, but if you are interested, there are several papers discussing about the properties of Poisson boot strap. For example, Hanely&MacBibbon(2006) or Chamandy et al. (2012). (I think the first paper is more approachable than the second one.)

A disadvantage of this code is you can not sample a fixed number of samples. Bootstrap replicates have different numbers of samples, which probably introduces extra variances to the bootstrap estimates. However, according to  the papers above, the variation of sample size will be negligible when the sample size is huge.

I just generated bootstrap replicates of 100 and 10000 samples with three algorithms above, and checked how many times each sample is sampled. The three algorithms appears to be identical when the sample size is large (N=10000). (Dashed line is the expected number of count with Poisson distribution.)

while the variation is not quite negligible when sample size is 100.

This Poisson bootstrap procedure appears to be particularly useful when you need to resample very large data. It gets closer to the true bootstrap as you have a larger number of samples.

# New multilocus delimitation code

I just announce that the codes for the new multilocus delimitation method are available at my Bitbucket repository.

The new program is called “tr2”, which stands for “trinomial distribution of triplets” model. It delimits species from multilocus gene trees by measuring congruence of gene trees between species and incongruence within species. It calculates the posterior probabilities of two types of trinomial distribution models and test within/between-species boundaries. A closely related methods are MP-EST, the triplet-based phylogeny reconstruction method by Liu et al, and the Bayesian species delimitation method BP&P by Yang&Rannala.

The tr2 implements a rapid algorithm to find the best delimitation under a given guide tree and an optimality score (AIC or posterior probability), which can run with more than 100 tip guide trees.

If you are interested in it, please visit the repository below and try it.

https://bitbucket.org/tfujisawa/tr2-delimitation/

I assume that users have knowledge of Python and related programming tools like Mercurial. The program is not well-decumented now. I will write down detailed instructions in following blog posts.