| Literature DB >> 22434837 |
Ruth Y Eberhardt1, Daniel H Haft, Marco Punta, Maria Martin, Claire O'Donovan, Alex Bateman.
Abstract
As the deluge of genomic DNA sequence grows the fraction of protein sequences that have been manually curated falls. In turn, as the number of laboratories with the ability to sequence genomes in a high-throughput manner grows, the informatics capability of those labs to accurately identify and annotate all genes within a genome may often be lacking. These issues have led to fears about transitive annotation errors making sequence databases less reliable. During the lifetime of the Pfam protein families database a number of protein families have been built, which were later identified as composed solely of spurious open reading frames (ORFs) either on the opposite strand or in a different, overlapping reading frame with respect to the true protein-coding or non-coding RNA gene. These families were deleted and are no longer available in Pfam. However, we realized that these may perform a useful function to identify new spurious ORFs. We have collected these families together in AntiFam along with additional custom-made families of spurious ORFs. This resource currently contains 23 families that identified 1310 spurious proteins in UniProtKB and a further 4119 spurious proteins in a collection of metagenomic sequences. UniProt has adopted AntiFam as a part of the UniProtKB quality control process and will investigate these spurious proteins for exclusion.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22434837 PMCID: PMC3308159 DOI: 10.1093/database/bas003
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Seed alignment for the AntiFam family derived from PF10695. Amino acids are colored by average similarity according to the BLOSUM62 amino acid substitution matrix from most similar (light blue) to less similar (gray). ‘S’ and ‘E’ in the first row stand for sequence start and sequence end, respectively. The final row features a consensus sequence. The alignment was displayed using the Belvu software (http://www.sanger.ac.uk/resources/software/seqtools/).
Figure 2.Graphical representation of exemplar overlapping and spurious proteins. (a) shows two proteins from the Corynebacterium efficiens genome that encode components of a restriction system. The C-termini of the two proteins overlap by 97 nt. (b) Two highly overlapping predicted proteins from the Rhodopirellula baltica genome coded on opposite strands of DNA. The Q7UY10 protein contains two Pfam DUF1596 domains. There is no evidence that these are true expressed proteins. Green boxes represent regions matched by Pfam families, the red shaded areas represent transmembrane domains predicted by Phobius (10) and the blue shaded areas represent regions of low complexity (11).
AntiFam entries derived from Pfam families
| Pfam accession number (identifier) | Last Pfam release present | Reason for deleting from Pfam | No. of matches in UniProt | No. of matches in metagenomics data set |
|---|---|---|---|---|
| PF07612 (DUF1575) | 15.0 | Proteins may not be expressed. Evidence for homology to known protein on opposite strand | 3 | 0 |
| PF07616 (DUF1578) | 15.0 | Proteins may not be expressed. Evidence for homology to known protein on opposite strand | 6 | 6 |
| PF07630 (DUF1591) | 15.0 | Proteins may not be expressed. Evidence for homology to known protein on opposite strand | 6 | 0 |
| PF07633 (DUF1594) | 15.0 | Proteins may not be expressed. Evidence for homology to known protein on opposite strand | 5 | 0 |
| PF11370 (DUF3170) | 25.0 | Probable shadow ORF of Clp protease | 16 | 7 |
| PF11194 (DUF2825) | 25.0 | Probable CRISPR | 159 | 18 |
| PF11664 (DUF3264) | 25.0 | Probable CRISPR repeat regions | 21 | 13 |
| PF10695 (Cw-hydrolase) | 25.0 | Antisense to rRNA ( | 225 | 1,654 |
| PF10919 (DUF2699) | 26.0 | Shadow ORF of PF00665 (integrase core domain 1) | 25 | 11 |
| PF07641 (DUF1596) | 26.0 | Dubious genome annotation. Family comprises only three sequences from | 3 | 0 |
The final two columns show the number of matches of each AntiFam entry to UniProtKB and to a metagenomic data set.
aThe metagenomic set of sequences is the same as that used by Pfam (14).
bCRISPR, Clustered Regularly Interspaced Short Palindromic Repeats.
AntiFam entries derived from custom multiple sequence alignment
| Identifier | Type of spurious family | No. of matches in UniProt | No. of matches in metagenomics data seta |
|---|---|---|---|
| Spurious_ORF_10 | Translated bacterial tRNA, tRNA01 | 196 | 795 |
| Spurious_ORF_11 | Translated bacterial tRNA, tRNA02 | 89 | 170 |
| Spurious_ORF_12 | Translated bacterial tRNA, tRNA03 | 143 | 408 |
| Spurious_ORF_13 | Translated bacterial tRNA, tRNA04 | 77 | 671 |
| Spurious_ORF_14 | Translated bacterial tRNA, tRNA05 | 156 | 191 |
| Spurious_ORF_15 | Translated bacterial tRNA, tRNA06 | 31 | 63 |
| Spurious_ORF_16 | Translated bacterial tRNA, tRNA07 | 40 | 17 |
| Spurious_ORF_17 | Translated bacterial tRNA, tRNA08 | 5 | 10 |
| Spurious_ORF_18 | Translated bacterial tRNA, tRNA09 | 4 | 39 |
| Spurious_ORF_19 | Translated bacterial tRNA, tRNA10 | 7 | 12 |
| Spurious_ORF_20 | Translated bacterial tRNA, tRNA11 | 43 | 28 |
| Spurious_ORF_21 | PrfB frameshift | 24 | 5 |
| Spurious_ORF_22 | From a lncRNA, LINC00174 | 26 | 1 |
aThe metagenomic set of sequences is the same as that used by Pfam (14).