Category Archives: Software

tr2-delimitation in python3

I just announce that the tr2-delimitation in Python3 is now available from the following repository.

https://bitbucket.org/tfujisawa/tr2-delimitation-python3

It should return results exactly the same as the Python2 version (I verified it with the data set of the Fujisawa et al. (2016) paper).

I will maintain the old Python2 version, but the new functions will be added to this Python3 version.

Advertisements

Multilocus delimitation with tr2: Guide tree approach

The second option of delimitation with the “tr2” is a guide tree approach. A guide tree is a tree which specify a hierarchical structure of species grouping. By using it, you can significantly reduce the number of possible delimitation hypotheses to search.

The tr2 implements an algorithm to find a best position of nodes which define species group under a given guide tree. As the algorithm is reasonably fast, you can search the best delimitation through a tree from each tip is distinct species to all tips are from one single species.

Now, acceptable size of the number of taxa on guide tree is around 100. If you exceed 200 taxa, the memory requirement usually becomes huge and normal desktop computers can not handle. The current limitation of the number of input trees, that is, the limit of number of loci, is ~ 1000. Theoretically, it can be larger, but a problem on numerical calculation now limits the number of loci you can use.

To run tr2 with a guide tree, you need a newick formatted guide tree file, and a gene tree file also in newick format.

Only the first line of the guide tree file is used. In the standard analysis, guide tree tips must contain all taxa found in gene trees. Guide trees can be built any methods such as concatenated ML (eg. RAxML) or coalescent-based species tree methods (eg. ASTRAL). Most importantly , it must be properly rooted. Incorrect rooting often results in over-splitting.

The gene tree file must contain one tree per line. They must be rooted too. Missing taxa are allowed.

Once two files are ready, the command below starts a search algorithm.

./run_tr2.py -g guide.tre -t genetree.tre

To test with bundled example files, use files in “sim4sp” directory.

./run_tr2.py -g sim4sp/4sp.nex10.RTC.tre -t sim4sp/simulated.gene.trees.nex10.4sp.tre

After some intermediate outputs, you will see a tree with delimitation results and a table.

write: <stdout>
species sample
1     12.3
1    11.3
1    15.3
1    14.3
1    13.3
2    7.2
2    10.2

If the tree and table are too large on your console screen, use “-o” option to output them into files.

./run_tr2.py -o out -g sim4sp/4sp.nex10.RTC.tre -t sim4sp/simulated.gene.trees.nex10.4sp.tre

This command ouputs a table and a tree into “out.table.txt” and “out.tre” respectively. You can see the results using any programs.

Now, I use R + “ape” package to check a delimitation result.

library(ape)
tr <- read.tree("./out.tre")
plot(tr)
nodelabels(text=substr(tr$node.label,1,6), bg="white")

These R commands plots the guide tree and delimitation like the  below picture.

outdelimit

The numbers on the nodes indicate average differences of posterior probability scores. If they are positive the node has between-species branches. The negative values suggests that nodes are within species. “nan” indicates there are not enough samples to split/merge the node. “*” signs show the best position of delimitation. In this case, there are 4 putative species.

Multilocus delimitation with tr2: Model comparison

The “tr2” currently has two options for species delimitation. One is calculating posterior probability scores for user-specified delimitation hypotheses. Another option is finding the best delimitation under a guide tree, which specifies a hierarchical structure of species grouping.

The first option is probably useful to compare multiple species groupings and find the best one (such as comparing morphological species vs. mtDNA groups) while the second option can be used without any prior assignments and find species only from gene trees.

Let’s start with the first option. (I assume you have already set up an environment for tr2.)

You must have two input files: A gene tree file in Newick format and a tab-delimited text file which specify associations of species and individual samples.

In a tree file, one line must contain one gene tree. Trees can have missing taxa. They must be rooted. (Yes. The program is based on “rooted triplet”. So, trees must be rooted. If you do not have outgroups, midpoint rooting or RAxML’s “-I f” option often works well.)

In an association file, the first column represents the names of samples. They must be identical to the names of the tree tips. The second and so forth columns are species groups. You can write as many columns as you want. Also, you can use any codes to describe species names.

For example, a table below specifies three alternative delimitations of samples 16.4-20.4

19.4     4     B     sp5
17.4     4     B     sp5
18.4     4     B     sp4
16.4     4     A     sp4
20.4     4     A     sp4

Association files must contain all sample names which appear in the tree file.

Once you have a tree file and an association file, simply run the tr2 command as follows.

./run_tr2.py -a sp_association.txt -t genetrees.tre

Some example files are stored in the “sim4sp” folder. If you use them to test tr2, the command is like this.

./run_tr2.py -a sim4sp/sp.assoc.4sp.txt -t sim4sp/simulated.gene.trees.nex10.4sp.tre

The outputs of this command must be like below.

write: <stdout>
model score
null 51391.76
model1 5.73

The score of “model1” looks much smaller than the “null” model (, which assumes all samples are from one single species). So, you can be quite confident that model1 is a better delimitation.

How to set up an environment for “tr2”

Installing Python and check versions

Python is required to run the tr2-delimitation on your computer. In many modern operating systems, Python is pre-installed and you do not need to install it by yourself.

However, the installed version of Python matters. The tr2 is written in Python 2,  while, in some recent systems, Python 3 is the default version. This is simply because Python 2 was a standard when I started writing the codes, but Python 3 is becoming a new standard now. (I am translating the tr2 codes into Python3.) So, you first need to check the version of Python installed on your system.

Type on your console,

$python --version

If you see a message like below, you are running Python 2.

Python 2.7.6

In some decent OS’s, both Python2 and 3 are pre-installed, and you can call Python 2 by

$python2

even when your default Python is Python3.

If your environment does not have Python2 at all, visit the Python website and install it, or create a Python2 environment as explained in the next section.

Installing dependencies (scipy/numpy and Java)

Python packages called numpy/scipy is required to run tr2. These packages are for numerical calculations. Visit SciPy website and follow the instructions for your operating system to install Scipy libraries. Installing all Scipy related packages following the instruction is just fine though not all Scipy packages are required to run tr2.

An alternative, easy way to install dependencies is installing an all-in-one suite like Anaconda, which includes Python and related packages. As Anaconda allows you to run multiple versions of Python, you can install it with any version of Python.

If you choose to install Anaconda with Python3, you must run Python2 codes by creating Python2 environment.

$conda create --name python2 python=2 numpy scipy

This “conda” command creates an “environment” where the python version is 2 with numpy/scipy installed. You can switch to it by calling the “activate”  command,

$activate python2

and switch back to the default environment by the “deactivate” command.

(python2) $deactivate

 

Checking packages

You can check if the packages are properly installed by loading them on the interactive shell.

$python

(Just by typing “python”, “interactive shell” starts. You can quit this mode by pressing Ctl+d.)

>>>import numpy, scipy

If it does not return errors, packages are ready to use.

Java is also required to run the consensus tree building program, triplec. Almost all modern operating systems have Java as pre-installed software.

A better plot function for GMYC result

I have received some emails and comments about the plotting function of the splits package. The plot.gmyc lacks some options. It cannot change the font size of the sample names, colors of tips, size of margins and so on. Surely, these options are important for the presentation of results. I have been thinking of updating the function, but its code is a mixture of codes written by several people and disentangling them appears a bit laborious.

So, for now, I upload a new version of the plot function.

plot.result <- function(res, cex=0.5, edge.width=1, no.margin=F, show.tip.label=T, label.offset=0) {
	plot.tree <- function(tr, mrca, cex=0.5, edge.width=1, no.margin=F, show.tip.label=T, label.offset=0) {
		traverse.subtree <- function(tr, n=1) {
			numtip <- length(tr$tip.label)
			sub.node <- tr$edge[which(tr$edge[,1] == n+numtip), 2]

			res.node <- c()
			for (sn in sub.node) {
				res.node <- c(res.node, sn)
				res.node <- c(res.node, traverse.subtree(tr, sn-numtip))
			}
			return (res.node)
		}

		numtip <- length(tr$tip.label)
		br.col <- rep(1, length(tr$edge.length))

		for (i in mrca) {
			for (j in traverse.subtree(tr, i-numtip)) {
				br.col[which(tr$edge[,2]==j)] <- 2

			}
		}

		plot(tr, edge.color=br.col, show.tip.label=show.tip.label, cex=cex, edge.width=edge.width, no.margin=no.margin, label.offset=label.offset)
	}

	plot.tree(res$tree, res$MRCA[[which.max(res$likelihood)]]+length(res$tree$tip.label), cex=cex, edge.width=edge.width, no.margin=no.margin, show.tip.label=show.tip.label, label.offset=label.offset)
}

This plot function accepts the common options of the ape’s plot.phylo function, including cex, edge.width, etc.

You can change the color of edges by replacing “2” at the end of line 20 with another number or a color name. For instance,

br.col[which(tr$edge[,2]==j)] <- "blue" 

This makes branch colors blue.

The function can be used like the old plot function, and it should be more flexible than the old one.

 >res <- gmyc(test.tr)
>plot.result(res, no.margin=T, cex=0.3, edge.width=1.5)

plot.result

exact.match.pairs.R for showing exactly matched species

One of users of the splits package asked me if there is a function to show pairs of species which have exact match. The comp.delimit function counts the number of exact matches between 2 groupings, but does not show which species matches with which one. Surely, it is useful to have this type of information.

At first, tracking pairing information looked complicated for me, but actually it is easily done by simple modifications on existing codes.

exact.match <- function(x, y) {
	if (length(x) != length(y)) {
	return (NA)
	}

	match <- 0
	for (l1 in unique(x)) {
		for (l2 in unique(y)) {
			l1_ids <- sort(which(x == l1))
			l2_ids <- sort(which(y == l2))

			if (length(l1_ids) == length(l2_ids) && all(l1_ids == l2_ids)) {
				match <- match + 1
			}
		}
	}
	return (match)
}

The code above is the “exact match” function used in the comp.delimit. It runs on 2 vectors of delimitation. It does pairwise comparisons between all pairs of species and records the number of pairs which matched exactly.

For instance, consider 2 alternative groupings x=(A,A,B,C) and y=(1,1,2,2) (,where the first element of each vector is species of the first sample, and the second, the third and so on.) The function returns 1, which means one match between species A and species 1.

Tracking the names of pairs with exact match is done just by replacing the counter of matches in the code above with a matrix which keeps names of matched pairs.
The modified code is below.

exact.match.pairs <- function(d1, d2) {
	if (nrow(d1) != nrow(d2)) {
		return (NA)
	}

	x <- d1[,1]
	y <- d2[match(d1[,2], d2[,2]), 1]

	pair <- c()
	for (l1 in unique(x)) {
		for (l2 in unique(y)) {
			l1_ids <- sort(which(x == l1))
			l2_ids <- sort(which(y == l2))

			if (length(l1_ids) == length(l2_ids) && all(l1_ids == l2_ids)) {
				pair <- rbind(pair, c(l1, l2))
			}
		}
	}
	return (pair)
}

Another small modification is the function now accepts 2 tables of delimitation instead of 2 vectors. Lines 6 and 7 were added to sort tables by sample names. The first column of the table must be species names, and the second column sample names.

For a gmyc result, res1, and a dataframe of delimitation, d, it is used as follows.

>exact.match.pairs(spec.list(res1), d)

For 2 tables of delimitation,

>exact.match.pairs(d1, d2)

These commands return a matrix containing pairs of species.

[,1] [,2]
[1,] “1” “spec1”
[2,] “2” “spec18”
[3,] “5” “spec26”
[4,] “12” “spec16”
[5,] “14” “spec25”

Splits download problem

update:
The problem was solved. Now, the splits is available for download from the R-forge.

The R-forge site for the splits is now experiencing a problem of package building. Both source files and the binary package are not available for download. The install.packages command described in “how to install” do not work either. It is unclear when this problem is solved.

I put the source code and package file on Github for now.

https://github.com/tfujisawa/splits_tmp

This is only for temporary download. I have not decided whether the package will be hosted on Github or not.

If you are not familiar with how to use Git, a simple way to install package is following:

0) Make sure you installed the latest R and dependency packages.

1) Click the “Download ZIP” button at the bottom right to download the zipped files.

2) Decompress “splits_1.0-19.tar.gz” from the downloaded zip file.

3) Call install.pacakges on your R console to install from the local file.

>install.packages(“path/to/your/directory/splits_1.0-19.tar.gz”,  type=”source”)