| Literature DB >> 15960857 |
Liisa B Koski1, Michael W Gray, B Franz Lang, Gertraud Burger.
Abstract
BACKGROUND: Assignment of function to new molecular sequence data is an essential step in genomics projects. The usual process involves similarity searches of a given sequence against one or more databases, an arduous process for large datasets.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15960857 PMCID: PMC1182349 DOI: 10.1186/1471-2105-6-151
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
AutoFACT annotation classes
| YES | N/A | N/A | N/A | N/A | N/A | |
| NO | YES | YES | YES | N/A | N/A | |
| NO | YES | YES/NO | NO | NO | N/A | |
| NO | YES/NO | NO | NO | YES | N/A | |
| NO | NO | N/A | N/A | NO | YES | |
| NO | NO | N/A | N/A | NO | NO |
Figure 1AutoFACT methodology. Sequences are classified into one of six annotation categories (purple boxes). The user decides which bit score cutoff to use (default 40) before a BLAST hit is considered significant. For database references, see text.
Databases searched and classification information assigned by AutoFACT
| European Ribosomal Database | Large subunit (LSU) ribosomal RNAs | [25] |
| Uniprot's UniRef 90 | GeneOntology terms Enzyme Commission numbers Locus names | [16,26] |
| Uniprot's UniRef100 | ||
| Clusters of Orthologous Groups (COG) | Functional categories | [27,28] |
| Kyoto Encyclopedia of Genes and Genomes (KEGG) | Metabolic pathways Enzyme Commission numbers Locus names | [29,30] |
| Protein Familes Database (Pfam) | Protein domains | [31] |
| Smart | Signaling domains | [32] |
| NCBI's non-redundant database (nr) | N/A | [33] |
| NCBI's est_others database |
Database description line formats from ACL00000101 BLAST hits
| UniRef90 | ATP synthase beta chain related cluster |
| UniRef100 | ATP synthase subunit beta [Salmonella typhimurium] |
| NCBI's nr | ATP synthase beta chain [Erwinia carotovora subsp. atroseptica SCRI1043] emb|CAG77407.1| ATP synthase beta chain [Erwinia carotovora subsp. atroseptica SCRI1043] |
| KEGG | atpD; membrane-bound ATP synthase, F1 sector, beta-subunit [EC:3.6.3.14] [KO:K02112] |
| COG | [C] COG0055 F0F1-type ATP synthase, beta subunit |
Figure 2Sample HTML output for AutoFACT annotation of Acanthamoeba castellanii EST cluster ACL00000152. Automatic annotation results are displayed at the top of the page and all data used to infer the annotation are represented in the bottom part of the table. Percent sequence identity is the extent to which two (nucleotide or amino acid) sequences, in a High Scoring Segment Pair (HSP), are invariant. In the case of the est_others data, the reported % sequence identity refers to a "translated nucleotide – translated nucleotide" comparison. The "Informative Hit" value specifies whether the first, second, etc., BLAST hit in the corresponding database was informative. The "Color Key for Alignment Scores" displayed at the top of the diagram is from NCBI's BLAST Results page. The scores for the annotation and for the source of the annotation, 627 in this example, are highlighted according to the color key. The page also contains links to relevant database entries.
Figure 3Comparison of AutoFACT annotations across four phylogenetically diverse organisms previously annotated by well-established automatic pipelines. Two hundred previously annotated cDNAs from Homo sapiens [Ensembl Annotation Pipeline], Saccharomyces cerevisiae [MIPS/PEDANT], Plasmodium falciparum [TIGR] and Rickettsia prowazekii [GeneQuiz] were re-annotated with AutoFACT using a bit score cutoff of 40 and a database order of importance as follows: UniRef90, KEGG, COG, NCBI's nr, Pfam and SMART. The top 10 BLAST hits to each database were filtered for functionally uninformative terms. BLAST hits to the species itself were considered uninformative. The portion of the bar representing different results from AutoFACT (dark purple) should not be construed as false positives. For example in the case of GeneQuiz (4.5% differences), it is the AutoFACT annotation that is the better of the two in almost all instances (see Results section). Numbers printed directly on columns represent the number of cDNA sequences (out of 200) in each category.
Differences found between AutoFACT and PEDANT annotations for Saccharomyces cerevisiae
| ID | PEDANT Annotation | AutoFACT Annotation | AutoFACT Score | AutoFACT E-value | AutoFACT % Identity |
| yal048c | vacuolar aspartic protease | 1724 | 0.0 | 50% (360/718) | |
| yhr064c | multi-domain protein | 651 | 3.00E-68 | 28% (154/539) | |
| yhr046c | *Protein qutG related cluster | 378 | 4.00E-35 | 31% (99/310) | |
| yhr143w | multi-domain protein | 229 | 2.00E-19 | 25% (70/278) | |
| yhl043w | DUP domain-containing protein | 205 | 5.00E-17 | 36% (26/72) | |
| yal047c | *Repeat organellar protein related cluster | 160 | 2.00E-09 | 20% (124/620) | |
| yhr167w | *Myosin heavy chain related cluster | 129 | 2.00E-06 | 24% (51/210) | |
| yhr154w | BRCT domain-containing protein | 118 | 4.00E-06 | 28% (24/83) | |
| yhl020c | multi-domain protein | 114 | 5.00E-06 | 24% (30/123) | |
| yhr196w | Borrelia_orfA domain-containing protein | 104 | 1.00E-04 | 19% (75/376) |
Annotations in bold are the same as the original annotations found in the Saccharomyces Genome Database. AutoFACT annotations marked with an asterisk (*) are considered false positives.
Differences found between AutoFACT and TIGR preliminary annotations for Plasmodium falciparum
| ID | TIGR Preliminary Annotation | AutoFACT Annotation | AutoFACT Score | AutoFACT E-value | AutoFACT % Identity |
| 1396.m03572 | PF14_0675 reticulocyte binding protein 2 homolog B, putative Reticulocyte Binding protein; | multi-domain protein | 157 | 1E-10 | 18% (60/320) |
| 1396.m03591 | PF14_0655 RNA helicase-1, putative | Eukaryotic translation initiation factor 4A related cluster | 1591 | 1E-177 | 79% (310/388) |
| 1396.m03721 | PF14_0530 ferlin, putative | heat shock protein DNAJ pfj4 | 534 | 6E-53 | 40% (103/252) |
| 1396.m04144 | PF14_0112 POM1, putative | Twinkle related cluster | 152 | 6E-08 | 38% (34/89) |
| 1396.m04178 | PF14_0078 HAP protein | Asp domain-containing protein | 535 | 8E-55 | 26% (100/371) |
| 1396.m04220 | PF14_0036 acid phosphatase, putative | Metallophos domain-containing protein | 134 | 2E-08 | 20% (45/220) |
| 1396.m04244 | PF14_0015 aminopeptidase, putative | hydrolase, alpha/beta fold family | 179 | 5E-12 | 22% (66/288) |
| 1396.m04296 | PF14_0382 metalloendopeptidase, putative | multi-domain protein | 118 | 0.000006 | 16% (50/297) |
Differences found between AutoFACT and GeneQuiz annotations for Rickettsia prowazekii
| ID | Gene Quiz Annotation | AutoFACT Annotation | AutoFACT Score | AutoFACT E-value | AutoFACT % Identity |
| RP103 | PKM101 CONJUGATION PROTEINS (TRAL), (TRAM), (TRAA), (TRAB), (TRAC), (TRAB), (TRAC), (TRAD), (TRAN), (TRAE), (TRAO), (TRAF), (TRAG), ENTRY EXCLUSION PROTEIN (EEX), (KIKA), (KORB), (KORA) AND ENDONUCLEASE (NUC) GENES, COMPLETE CDS (TRAM) (TRAB) (TRAB) (TRA | 4159 | 0.0 | 100% (805/805) | |
| RP151 | NEMPA PROTEIN PRECURSOR. | 2004 | 0.0 | 82% (398/483) | |
| RP259 | D-STEREOSPECIFIC PEPTIDE HYDROLASE PRECURSOR. | 2048 | 0.0 | 96% (398/414) | |
| RP268 | NADH-UBIQUINONE OXIDOREDUCTASE CHAIN 2 (EC 1.6.5.3). | 794 | 3E-84 | 74% (160/215) | |
| RP282 | *HyfB domain-containing protein related cluster | 1821 | 0.0 | 74% (380/512) | |
| RP287 | CAVEOLIN-2. | 1047 | 1E-114 | 85% (212/247) | |
| RP291 | CONJUGAL TRANSFER PROTEIN TRBI. | 2016 | 0.0 | 85% (413/483) | |
| RP293 | CONJUGAL TRANSFER PROTEIN TRAG. | 3002 | 0.0 | 97% (577/591) | |
| RP414 | LPS BIOSYNTHESIS RFBU RELATED PROTEIN. | *Glycosyltransferase related cluster | 1614 | 1E-180 | 92% (314/338) |
Annotations in bold are the same as the original annotations by Andersson et al. (1998).
AutoFACT annotations marked with an asterisk (*) are considered false positives.
Figure 4Distribution of informative versus uninformative annotations. A. castellanii ESTs (5,130 clusters) were annotated in three ways: (A) by top BLAST hit to NCBI's nr database; (B) by top BLAST hit to UniProt's UniRef90 database; and (C) by AutoFACT. The "uninformative rule" (Andrade et al., 1999) was used to query description lines assigned by all methods. AutoFACT yields an ~50% increase in informative annotations compared to top BLAST hits against NCBI's nr and the UniRef90 databases. AutoFACT's annotation source is shown in parentheses ().