Code to extract node annotations from NEXUS

I found that the read.nexus function in ape does not read node annotations written within square brackets.(If you don’t know “ape”, it is a common R package for phylogenetic analysis.) Commonly-used programs like Figtree or TreeAnnotator outputs this bracket-style annotations in Nexus but can not output annotations in simple Newick format, which can be read by the read.tree function.
A couple of google searches told me that some packages like “PHYLOCH” can read bracket-style annotations in Nexus format. However, the installation of PHYLOC got stuck with a dependency error.

This became a major obstacle on my analysis. So, I decided to write a code to convert the bracket style annotations to the simple newick ones, that is, a code converting a text like below,

((a:1, b:1)[&posterior=1.0]:1, (c:1,d:1)[&posterior=0.98]:1)[&posterior=0.99]:1;

into,

((a:1, b:1)1.0:1, (c:1,d:1)0.98:1)0.99:1;

The outputs of Figtree/TreeAnnotator usually contains annotations other than posterior probability, such as node height. So, if I can choose an annotation key instead of only targeting “posterior”, the code would be more useful.

This is a good (or maybe a painful) practical for regular expressions. Initially, I tried to write the code in shell script using Linux’s grep and sed things, but soon found Python is an easier solution.

First thing to do is finding texts within square brackets and “&”, for example, “[&posterior=1.0]”. A bit tricky point is the text within brackets must not include brackets. Otherwise, a long text like “[&posterior=1.0]:1, (c:1,d:1)[&posterior=0.98]” will be matched.

After a bit of Google searches, I found this is done by, “\[&(.*?)\]”. “.*?” is a non-greedy form of matches to any letters with any length, which does not extend when “]” is found. In Python, this is written like below.

re.findall("\[&(.*?)\]", line)

Once texts within brackets are extracted, they will be parsed.
The second tricky point is the annotation texts are “comma-separated” (each entry is separated by a comma), but commas within curly braces must be ignored when splitting entries. Otherwise, an annotation like “[&height_95={0.1,0.3},posterior=0.98]” will be split into 3 elements, “height_95={0.1”, “0.3}” and “posterior=0.98”.

I could not find a good solution to this even after hours of googling, and I resorted to replacing the commas within braces with “-“.

re.sub("{([0-9]+.[0-9E-]+),([0-9]+.[0-9E-]+)}", "\\1-\\2", text)

re.sub substitutes the text match with the first argument with the second argument. “{([0-9]+.[0-9E-]+),([0-9]+.[0-9E-]+)}” matches with two numbers surrounded by braces and separated by a comma, The “( )” captures the matched numbers, and the captured values are referenced by “\1” and “\2” when substitution occurs.

Annotations are then split by commas and stored in a Python dictionary. They are called with a specific key to replace the original bracket annotations.

The following code is the final version.

import sys
import re

def node_attributes(txt):
	txt = re.sub("{([0-9]+.[0-9E-]+),([0-9]+.[0-9E-]+)}", "\\1-\\2", txt)

	attrs = txt.lstrip("&").split(",")
	attr_val = {}
	for attr in attrs:
		attr = attr.split("=")
		attr_val[attr[0]] = attr[1]

	return attr_val

if len(sys.argv) > 2:
	key = sys.argv[2]
else:
	key = "posterior"

with open(sys.argv[1], "r") as f:
	for line in f:
		s = re.findall("\[&(.*?)\]", line)
		if s:
			for i, m in enumerate(s):
				#print i, m
				attr = node_attributes(m)

				if key in attr:
					line = re.sub(m, attr[key], line)
				else:
					line = re.sub(m, "", line)
			line=re.sub("\[&", "", line)
			line=re.sub("\]", "", line)

		print line.rstrip("\n")

This code runs on a text file containing trees with bracket annotations. If you replace “posterior” in the second argument with “height”, it extracts node heights if annotations include them.

python replace_annotation.py tree.txt posterior
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s