If you have BLASTed your sequences against the NCBI databases, probably you have seen the entries named “hypothetical proteins”.
I have been working on my RNA-seq data recently. When I try to find similar sequences from protein databases, I always stumble upon the “hypothetical proteins”, which often have no information useful for further investigations.
These hypothetical existences are mainly from gene annotation programs run on shotgun assemblies. With the evidence from gene prediction programs or RNA mappings to scaffolds, they predicted that there are protein coding sequences there, but their functions are not confirmed by any experimental works.
Not all of the hypothetical proteins are totally uninformative. Some of them have entries describing the domains or functions predicted from similarity search like this. They are good hypothetical proteins. But others are completely uninformative like this.
Annoyed with the frequent encounters with the uninformative hypothetical proteins, I decided to measure how many hypotheticals there are in the NCBI database.
I used the following query on the NCBI website to count the number of proteins named “hypothetical protein” added/modified during particular periods, and filtered results with database names.
“hypothetical protein”[Protein Name] AND (“1999/01/01″[MDAT] : “2013/01/01″[MDAT]) AND genbank[filter]
and, I counted the total number of entries in the databases by this query.
(“1999/01/01″[MDAT] : “2013/01/01″[MDAT]) AND genbank[filter]
As everyone already knows, the number of entries in the NCBI database is truly skyrocketing. It doubles or triples every 2 years. Whole NCBI database is far larger than the Refseq database.
The proportion of hypothetical proteins is increasing in the Genbank and whole NCBI as I expected. It reached about 30% in 2012. Most entries named as “hypothetical protein” in the Genbank are tagged as “marine metagenome”. This is unsurprising as we can never know what we are sequencing from metagenomic data.
A more surprising fact for me is that about 50% of proteins in the RefSeq Protein database are actually hypothetical proteins. (The Refseq is supposed to be an well-curated and well-annotated database.) They are not from metagenomic samples but from many recent genome projects. This very high proportion explains why I saw so many hypothetical proteins in BLAST hits.
Considering the number of hypothetical sequences in the databases (eg. 1.2 million hypotheticals in Refseq), it is unlikely that they are annotated by lab experiments in near future. We need to go with the hypothetical proteins.
This situation is a bit similar to the story of “dark taxa”. In the famous blog post, Rod Page reported that more than 25% of the NCBI sequences of mammals and invertebrates did not have proper taxonomic names. He called those unidentified sequences “dark taxa”.
Maybe we must do lots of biology without names as Rod Page said in the blog post. It is already done on the good hypothetical proteins.