| Literature DB >> 22180819 |
William Klimke, Claire O'Donovan, Owen White, J Rodney Brister, Karen Clark, Boris Fedorov, Ilene Mizrachi, Kim D Pruitt, Tatiana Tatusova.
Abstract
The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries.Entities:
Year: 2011 PMID: 22180819 PMCID: PMC3236044 DOI: 10.4056/sigs.2084864
Source DB: PubMed Journal: Stand Genomic Sci ISSN: 1944-3277
Databases, tools,resources for genomes and annotation.
| NCBI Genome Annotation Workshop | All information from this publication, the Annotation Workshop, and futureannouncements will be made available | ||
| Difference between Archive and Curated Databases | GenBank, RefSeq, TPA and UniProt:What’s in a Name? | Microbe Online | |
| Difference between Archive and Curated Databases | GenBank, RefSeq, TPA and UniProt:What’s in a Name? | NCBI Handbook | |
| INSDC | International Nucleotide Sequence Database Collaboration | ||
| INSDC Feature Table | Feature table document | ||
| DDBJ | DNA Databank of Japan | [ | |
| ENA | European Nucleotide Archive | [ | |
| GenBank | GenBank | [ | |
| NCBI Prokaryotic Genomes Automatic Annotation Pipeline (PGAAP) | Intended for use during the annotation of prokaryotic genomes in preparation for submission to GenBank – capable of annotating complete genomes as wells WGS genomes | ||
| JCVI Annotation Service | Anyone with a prokaryotic genome sequence in need of annotation may submit it to the JCVI Annotation Service completely free-of-charge | ||
| IGS Annotation Engine | a free resource for genomics researchers and educators bringing advanced bioinformatics tools to the lab bench and the classroom. | ||
| KAAS - KEGG Automatic Annotation Server | KAAS (KEGG Automatic Annotation Server)provides functional annotation of genes by BLAST comparisons against the manually curated KEGG GENES database with resulting KO (KEGG Orthology) assignments and automatically generated KEGG pathways | [ | |
| RAST | RAST (Rapid Annotation using Subsystem Technology) is a fully automated service for annotating bacterial and archaeal genomes –provides high quality genome annotations for these genomes across the whole phylogenetic tree | [ | |
| DOE-JGI MAP | Expert Review Data Submission: Microbial Genomes & Management | [ | |
| NCBI Submission Check Tool | For the validation of genome submissions to GenBank –utilizes a series of self-consistency checks as well as comparison of submitted annotations to computed annotations - web-based and downloadable versions available | ||
| NCBI Sequin Validation | Sequin is a standalone tool for submitting and updating sequences | [ | |
| NCBI TBL2ASN | Command-line tool for automation of sequence records to GenBank | [ | |
| NCBI Discrepancy report | Evaluation of ASN.1 files for annotation discrepancies- part of Sequin, available separately as downloadable command line version, and part of tbl2asn | [ | |
| Broad's Gene Pidgin (formerly BioName) | Pidgin is a suite of tools that evaluate and automatically assign gene product names - standardization, comparison, and then selection of the best name | ||
| JCVI's Protein Naming Utility | To address the need to generate high-quality protein names - a web-based database for storing and applying naming rules to identify and correct syntactically incorrect protein names, or to replace synonyms with their preferred name | [ | |
| Frameshift Tool | Frameshift detection in protein coding genes. | [ | |
| Annotation Report | Quantitative report on genome annotation | ||
| GenBank Bacterial Genome Submission Guidelines | Bacterial genome specific submission guide | [ | |
| Annotation Instructions | Detailed instructions on bacterial genome submissions | [ | |
| Project Submission | Bioproject submission form | [ | |
| Locus_tag proposal | Accepted locus_tag standards for INSDC | ||
| UniProt's Protein Naming Guidelines | UniProt's general protein naming guidelines | ||
| UniProt's Protein Naming Guidelines - Prokaryotes | UniProt's prokaryotic-specific protein naming guidelines - adopted by INSDC | ||
| GSC Structured Format | Accepted structured format for genome metadata including SOPs | [ | |
| Insertion Sequences | Insertion sequence finder, nomenclature, and registry | [ | |
| Transposons | Transposon nomenclature and registry | [ | |
| Enzyme Commission Numbers | Official NC-IUBMB site | ||
| UniProt ENZYME | ENZYME is a repository of information relative to the nomenclature of enzymes. | ||
| NCBI COGs | Clusters of orthologous groups - no longer actively curated | [ | |
| NCBI ProtClustDB | Cliques of related proteins - curated and uncurated –for multiple organism groups including prokaryotes and Viruses | [ | |
| NCBI Cluster Comparison Tool | Protein family comparison for functional annotation | ||
| NCBI Cluster Comparison Tool - Core Mode | Protein family core comparison for functional annotation | ||
| List of Core Clusters | Protein family core list | ||
| UniProt HAMAP | system, based on manual protein annotation, that identifies and semi-automatically annotates proteins that are part of well-conserved families or subfamilies in prokaryotes and plastids | [ | |
| KEGG Orthology Groups | Manually defined ortholog groups that correspond to KEGG pathway nodes and BRITE hierarchy nodes | [ | |
| JCVI's TIGRFAMs | Protein families based on Hidden Markov Models | [ | |
| ACLAME | Database dedicated to the collection and classification of mobile genetic elements | [ | |
| E. coli CCDS Project | Comparison of annotation for model E. coli K-12 MG1655 |
Summary of structured evidence for INSDC feature annotation1
| Allowed tokens | |||
|---|---|---|---|
| | | | |
| free text | Yes | No | free text describing the experiment |
| non experimental structured format | No | Yes | structured format of TYPE + EVIDENCE_BASIS (type includes “non experimental”, |
| Yes | Yes | support for annotated coordinates | |
| Yes | Yes | support for description including function | |
| Yes | Yes | support for existence of feature in this organism | |
| Yes | No | publication describing experimental evidence | |
1. Changes proposed and accepted by INSDC to the /experiment and /inference qualifiers. The new tokens (bolded) are optional for both qualifiers.
2. A brief description of the nature of the experimental evidence that supports the feature identification or assignment.
3. A structured description of non-experimental evidence inferring and supporting feature identification or assignment.
Selected annotation report examples1
| 225 | 1 | 4.640 | 50.79 | 4144 | 175 | 21 | 0.89 | 316 | 14 | 20.32 | 90.54 | ||||
| 76 | 1 | 4.216 | 43.51 | 4177 | 178 | 20 | 221 | 0.99 | 294 | 20 | 26.48 | 77.76 | |||
| 17977 | Candidatus | 1 | 0.160 | 182 | 31 | 20 | 44 | 1.14 | 274 | 37 | 32.42 | 96.15 | |||
| 32135 | Candidatus | 1 | 58.39 | 18 | 12* | 37 | 1.18 | 257 | 38 | 33.73 | |||||
| 46847 | 1 | 11.937 | 70.75 | 84 | 21 | 3606 | 0.84 | 342 | 24 | 19.86 | 60.69 | ||||
| 19943 | 1 | 1.268 | 32.45 | 1384 | 37 | 19* | 607 | 1.09 | 17 | 73.55 | |||||
| 81 | 1 | 2.799 | 28.75 | 2373 | 72 | 20 | 247 | 0.85 | 336 | 12.09 | 68.27 | ||||
| 12634 | 1 | 5.013 | 4346 | 58 | 21 | 965 | 0.87 | 349 | 38 | 15.85 | 69.21 | ||||
| 49535 | 1 | 2.616 | 67.27 | 2375 | 51 | 20 | 721 | 0.91 | 317 | 21.14 | 70.57 | ||||
| 43535 | 1 | 1.828 | 32.94 | 1350 | 120 | 21 | 86 | 0.74 | 352 | 95 | 80.00 | ||||
| 105 | 2 | 3.420 | 61.93 | 3412 | 59 | 20 | 1 | 1.00 | 285 | 30 | 27.02 | ||||
| 13128 | 2 | 6.323 | 41.71 | 5413 | 209 | 21 | 2490 | 0.86 | 316 | 35 | 21.97 | 73.88 | |||
| 28711 | 1 | 9.446 | 69.48 | 6719 | 55 | 20 | 1827 | 0.71 | 32 | 13.37 | 79.67 | ||||
| 244 | 1 | 6.414 | 41.35 | 5368 | 64 | 20 | 0.84 | 326 | 17 | 25.58 | 82.41 | ||||
| 19857 | 2 | 5.969 | 45.44 | 5944 | 159 | 20 | 1.00 | 286 | 24 | 30.43 | 84.84 | ||||
| 28111 | 1 | 71.38 | 9375 | 4170 | 0.72 | 401 | 30 | 13.08 | 73.33 | ||||||
| 344 | 1 | 5.057 | 61.09 | 4700 | 247 | 0.93 | 309 | 40 | 19.57 | 80.83 | |||||
| 31271 | 1 | 3.268 | 57.80 | 1604 | 47 | 20 | 143 | 335 | 33 | 21.01 | 54.30 | ||||
| 29335 | 1 | 2.232 | 52.37 | 2662 | 67 | 20 | 324 | 240 | 32 | 41.81 | 71.22 | ||||
1. Selected genomes and categories for INSDC genomes are shown. The first two rows are for the model organisms E. coli and B. subtilis. The other genomes were selected as the minimum (bolded) or maximum (bolded and underlined) in the categories shown. Those marked with an asterisk fall below the minimal standards described in this publication.
2. INSDC Bioproject ID for each genome [57].
3. Number of proteins annotated as 'hypothetical protein'.
4. Number of proteins per Kbp ((total number of proteins/genome length (bp)) * 1000).
5. Number of amino acids for which at least one tRNA is annotated in the genome (excluding predicted or annotated pseudo tRNAs).
6. Percent of short proteins (number less than 150 amino acids in length/total number of proteins * 100).
7. Percent of standard starts for proteins (number of standard starts (ATG)/total starts * 100).
Figure 1Selected comparisons of genome measures. Principal component analysis showed expected relationships among the different measures (data not shown). Selected examples are plotted as double y-axis scatterplots. Legends indicate first or second y-axis for blue dots or red crosses, respectively. Linear regression analysis of each y-axes variable independently with respect to the x-axis variable was done and the trend line is drawn on each plot color-coded with respect to each measure. R2 and p-values are shown for each measure. A-B. Numbers of annotated proteins and RNAs with respect to genome size from INSDC and RefSeq annotation sets for complete prokaryotic genomes. Feature counts were obtained from the Complete Microbial Genomes Annotation Report (Aug 10, 2010) and proteins and RNAs from INSDC and RefSeq are plotted with respect to genome length. The count of proteins follows a linear increase with respect to increasing genome size (blue trend line) while the RNA count, which includes all transfer, ribosomal, and non-coding RNAs, shows less of an increase with respect to genome size. Some genomes have extensively annotated RNA features, whereas others do not. A. All INSDC genomes (total of 1218 as of Aug 10, 2010). Those records that have below minimal standards for essential RNAs are encircled (red ellipse). B. RefSeq genomes (total of 1148 genomes as of Aug 10, 2010). Note, not all INSDC genomes are copied into RefSeq records. For the cases where INSDC records were missing essential RNAs, if there was a RefSeq version, the essential RNAs have been added or properly labeled. In all cases where the full set of essential RNAs could not be annotated it appeared that the missing RNA(s) were either non-functional or completely missing from the genome sequence (Table 3; data not shown). C. Protein lengths with respect to coding density for INSDC annotations. As coding density increases (more proteins per Kbp) the average protein length decreases (blue trend line) and the ratio of short proteins increases (red trend line). D. Hypothetical proteins and start codon ratios versus coding density. The ratio of proteins named 'hypothetical' increases slightly as the coding density increases whereas the standard start codon ratio decreases. Genomes where 'hypothetical protein' ratio is 1 or near 1 (large blue ellipse - every protein is annotated as 'hypothetical protein' in the genome) falls below the minimal annotation standards. For these particular cases, if a RefSeq version of the annotation existed, the functional assignment of a number of proteins was improved via curated clusters in the NCBI ProtClustDB (data not shown).
Figure 2Heatmap of selected annotation report measures for gammaproteobacteria. A set of measures were chosen corresponding to those used in principal component analysis (data not shown) but restricted to INSDC genomes from gammaproteobacteria. A two-dimensional clustering of the selected and scaled data (subtracted column means, division by standard deviation) demonstrates similar clusters that were obtained in the PCA analysis (data not shown). For Figure 2, no clustering was done and the input genomes are arranged alphabetically by organism name and shaded to indicate different genera. A color-key and histogram at bottom right indicate the relative intensities of the annotation measures (the histogram applies to all measures, color intensities apply to each cell). Genomes described in the text are in bold.
Pseudogene annotation strategies and outcomes
| 1 | Pseudogene | "/pseudo" | pseudogene | no translation; product name is in note, associated feature (CDS, tRNA, rRNA, etc.) will be annotated | No | |
| 2 | Potential pseudogene | N/A | normal gene annotated, potential pseudogene status in note | no CDS feature, not documented as a pseudogene, not trackable as protein vs. RNA-coding | No | |
| 3a | Frameshifted gene and sequence | "/pseudo" | combine intervals into a single gene with /pseudo | no translation; product name is in note | No | |
| 3b | Frameshifted gene and sequence | N/A | keep both and add a note to each CDS | two separate coding regions and two protein translations | Yes (Both) | |
| 3c* | Frameshifted gene and there are sequence | /”exception=”annotated by transcript or proteomic data” AND ("/experiment" OR "/inference") | experimental evidence defining the evidence that translation is correct and/or inference pointing to Accession Number with correct translation | protein sequence imported- translation does not match nucleotide | Yes | |
| 3d | Frameshifted gene and there are sequence | "/artificial_location" | locations altered for 'correct' location | all protein deflines prefaced with “LOW-QUALITY PROTEIN:” | Yes | |
| 4 | Region of similarity | N/A | misc_feature denoting location of region of similarity | no gene, no locus_tag, not systematically enumerated | No | |
| 5 | Potential unresolvable problems | N/A | note explaining the issue | no change in annotation | Yes | |
| 64 | Split/interrupted gene in the case of an insertion (ex. transposon insertion) | N/A | could be either a single interval, or a split interval, annotation depends on consequence of insertion | no standards for split genes, locations do not match regions of similarity | No |
1. Qualifier to be used on feature.
2. Downstream consequence of annotation decision, including impacts on presentation of the record.
3. Whether a protein sequence is encoded and will be present in protein and BLAST databases. Note, BLAST dbs only provide the ability to differentiate proteins based on defline changes. ie. Case 3b, 3c, and 5 present undifferentiated protein deflines in BLAST databases whereas case 3d has an altered protein defline.
4. Insertions can result in complicated cases such as gene fusion events. These annotation results should be due to real insertions, not simply regions of the genome that exhibit weak similarity to a part of a protein sequence.
Core proteins added to RefSeq genomes1
| | ||
|---|---|---|
| 30S ribosomal protein S8 | 1 | 131.4+/-2.1 |
| 30S ribosomal protein S11 | 1 | 130.1+/-5.8 |
| 30S ribosomal protein S14 | 10 | 84.1+/-19.3 |
| 30S ribosomal protein S15 | 3 | 94.1+/-17.1 |
| 30S ribosomal protein S19 | 9 | 96.1+/-15.0 |
| 50S ribosomal protein L2 | 1 | 273.8+/-10.2 |
| 50S ribosomal protein L11 | 1 | 144.4+/7.0 |
| 50S ribosomal protein L23 | 2 | 99.2+/-10.3 |
| 50S ribosomal protein L29 | 7 | 68.2+/-9.8 |
| elongation factor P | 1 | 185.4+/-16.9 |
| flap-1 endonuclease | 2 | 832.6+/-204.1 |
| translation initiation factor IF-1 | 4 | 77.3+/-11.1 |
1. Search for protein and nucleotide against RefSeq genomes (Aug. 10, 2010) identified cases where gene/protein were not present as either normal or non-functional. In those cases, a new gene/CDS/protein was added to the RefSeq record.
2. Protein name/functional name.
3. Number of proteins added for each category, in some cases multiple additions to the same genome.
4. The average protein length and standard deviation of lengths for all proteins from all clusters for each functional group. In some cases there are multiple protein clusters for one functional group.
Minimal annotation standards and guidelines accepted At 2010 NCBI genome annotation workshop1
| a. set of ribosomal RNAs (at least one each 5S, 16S, 23S) |
| b. a set of tRNAS (at least one each for each amino acid) |
| c. protein-coding genes at expected density (not all named 'hypothetical protein' and all core genes annotated) |
| Annotation standards should follow feature table format and submission guidelines (GenBank/ENA/DDBJ - |
| a. prior to genome submission a submitted Bioproject record with a registered locus_tag prefix is required and the genome record should contain the Bioproject ID. All proper features should have genes and locus_tags |
| b. the genome submission should be valid according to feature table documentation and follow the standards |
| Information about SOPs and additional meta data can be provided in a structured comment with more specific information about experimental or inference support provided on annotated features (see |
| Exceptions (unusual annotations, annotations not within expected ranges - see |
| Annotated pseudogenes should follow the accepted formats (see |
| Additional (enriched) annotations should follow INSDC guidelines, and be documented as above (SOPs and evidence). |
| This non-exhaustive list of reliable software, sources, and databases for the production of microbial genome annotation is a useful community resource that aids in producing high quality genome annotation ( |
| Validation checks should be done prior to the submission of a new genome record. NCBI has already provided numerous tools to validate and ensure correctness of annotation and additional checks and reports will be put in place to ensure minimal standards are met (see |
1 Guidelines were created for complete genomes (all replicons closed to single contigs). In some cases the minimal set of annotations will not be found on draft genomes, but the guidelines for annotation still apply.