Extracting SNPs from a fasta alignment

I have been working on transcriptome phylogenomics data. The data were originally generated to draw a phylogeny, but we found that we can use them to infer finer-scale demographic histories since they contain a reasonably large number of SNPs within species.

So, I wrote some codes to extract SNPs from a sequence alignment I have constructed.

When you just want to extract variable sites from an alignment, the code must be very simple. You just need to find columns with variable sites by counting bases and keep them when you find two or more unique bases.

import sys

from Bio import AlignIO
from Bio.Seq import Seq
from Bio.Align import MultipleSeqAlignment
from Bio.SeqRecord import SeqRecord

s_bases = ('A', 'C', 'G', 'T')
d_bases = ('R', 'Y', 'S', 'W', 'K', 'M', 'N')
bases = s_bases + d_bases

alig = AlignIO.read(sys.argv[1], 'fasta')

cutoff = 0.5

snp = [SeqRecord(Seq('', s.seq.alphabet), id=s.id, description=s.description) for s in alig]
snp = MultipleSeqAlignment(snp)

for i in range(alig.get_alignment_length()):
	col = alig[:,i]
	col = col.upper()
	col = col.replace('-', 'N')	

	#print col
	base_count = {b:col.count(b) for b in bases}
	del base_count['N']

	if sum(base_count.values()) > len(col)*cutoff:

		genotypes = sorted(base_count.items(), key=lambda x:-x[1])
		genotypes = [b for b in genotypes if b[1] > 0]

		if len(genotypes) > 1:
			snp = snp + alig[:,i:i+1] 

print snp.format('phylip')

This code keeps sites which have two or more unique bases and where missing percentage is smaller than 50. The output looks like below.

$python alignment2snp00.py test.fasta
 24 4
m9467      TTAG
m9092      ----
m12001     TTAG
m8773      TTMG
m9897      TTAG
m8294      CCAA
m8137      TTAG
m6561      TTAG
m8867      TTAG
m10600     CCAA
m4957      TTAG
m10093     TTMG
m12429     TTAG

The idea is very simple, but the behaviour of MultipleSeqAlignment object of Biopython is sometimes tricky and took time to understand it. For example, you can not concatenate columns like this,

snp = snp + alig[:,i]

A valid concatenation of a single column looks like this,

snp = snp + alig[:,i:i+1]

, simply because alig[:,i] is a String but alig[:,i:i+1] is a MultipleSeqAlignment.

The above code is actually not so useful because programs for population genetic inference often use information of genotypes rather than bases. When you need extract SNP genotypes from a sequence alignment, code becomes a bit more complicated.

import sys
import os

from Bio import AlignIO

s_bases = ('A', 'C', 'G', 'T')
d_bases = ('R', 'Y', 'S', 'W', 'K', 'M', 'N')
bases = s_bases + d_bases

deg_map = {'A':('A','A'),'C':('C','C'), 'G':('G','G'), 'T':('T','T'), 'R':('A','G'), 'Y':('C','T'), 'S':('C','G'), 'W':('A','T'), 'K':('G','T'), 'M':('A','C')}

alig = AlignIO.read(sys.argv[1], 'fasta')

inname = os.path.basename(sys.argv[1])

colnames = [sq.id for sq in alig]
colnames = ['file', 'pos'] + colnames
print '\t'.join(colnames)

cutoff = 0.5
for i in range(alig.get_alignment_length()):
	col = alig[:,i]
	col = col.upper()
	col = col.replace('-', 'N')	

	base_count = {b:col.count(b) for b in bases}
	del base_count['N']

	if sum(base_count.values()) > len(col)*cutoff:

		genotypes = sorted(base_count.items(), key=lambda x:-x[1])
		genotypes = [b for b in genotypes if b[1] > 0]

		ref = genotypes[0][0]

		if ref in d_bases:
			ref = deg_map[ref][0]

		allel = [deg_map[b[0]] for b in genotypes]
		allel = reduce(lambda x,y: x+y, allel)
		allel = list(set(allel))

		j = allel.index(ref)
		allel[0], allel[j] = allel[j], allel[0]

		allel_code = {b:k for k, b in enumerate(allel)}

		if len(allel) > 1:
			snps = [inname, '%d'%(i+1), ref, ','.join(allel[1:])]
			for s in col:
				if not s == 'N':
					code = [allel_code[a] for a in deg_map[s]]
					code = sorted(code)
					snps.append('%d/%d' % (code[0], code[1]))

				else:
					snps.append('-/-')
			print '\t'.join(snps)

The output is …

$python alignment2snp.py test.fasta
file	pos	m9467	m9092	m12001	m8773	m9897	m8294	m8137	m6561	m8867	m10600	m4957	m10093	m12429	m9744	m9871	m8953	m8504	m10303	m9328	m9754	m10452	m9391	m8000	m10040
test.fa	57	T	C	0/0	-/-	0/0	0/0	0/0	1/1	0/0	0/0	0/0	1/1	0/0	0/0	0/0	1/1	1/1	0/0	0/0	0/0	1/1	1/1	1/1	0/0	1/1	0/0
test.fa	110	T	C	0/0	-/-	0/0	0/0	0/0	1/1	0/0	0/0	0/0	1/1	0/0	0/0	0/0	1/1	1/1	0/0	0/0	0/0	1/1	1/1	1/1	0/0	1/1	0/0
test.fa	114	A	C	0/0	-/-	0/0	0/1	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/1	0/0	0/0	0/0	0/1	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0
test.fa	119	G	A	0/0	-/-	0/0	0/0	0/0	1/1	0/0	0/0	0/0	1/1	0/0	0/0	0/0	1/1	1/1	0/0	0/0	0/0	1/1	1/1	1/1	0/0	1/1	0/0

Looks neat. The code looks a bit redundant, but it works.

I assume that heterozygous sites in an alignment are represented by the IUPAC ambiguous code (say, Y = C/T). The code first finds variable sites like the previous code, then re-codes IUPAC bases into genotypes like M – A/C – 0/1.   Reference base is determined by choosing the most frequent unambiguous base. The genotypes are now represented by the “0/1” style, where 0/0 is a homozygote of the reference allele and so on. This type of code must be more useful as modifying the output format enables you to write down any types of SNP input files for downstream population genetic analyses.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s