| Literature DB >> 29112715 |
Daniel H Haft1, Michael DiCuccio1, Azat Badretdin1, Vyacheslav Brover1, Vyacheslav Chetvernin1, Kathleen O'Neill1, Wenjun Li1, Farideh Chitsaz1, Myra K Derbyshire1, Noreen R Gonzales1, Marc Gwadz1, Fu Lu1, Gabriele H Marchler1, James S Song1, Narmada Thanki1, Roxanne A Yamashita1, Chanjuan Zheng1, Françoise Thibaud-Nissen1, Lewis Y Geer1, Aron Marchler-Bauer1, Kim D Pruitt1.
Abstract
The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) provides annotation for over 95 000 prokaryotic genomes that meet standards for sequence quality, completeness, and freedom from contamination. Genomes are annotated by a single Prokaryotic Genome Annotation Pipeline (PGAP) to provide users with a resource that is as consistent and accurate as possible. Notable recent changes include the development of a hierarchical evidence scheme, a new focus on curating annotation evidence sources, the addition and curation of protein profile hidden Markov models (HMMs), release of an updated pipeline (PGAP-4), and comprehensive re-annotation of RefSeq prokaryotic genomes. Antimicrobial resistance proteins have been reannotated comprehensively, improved structural annotation of insertion sequence transposases and selenoproteins is provided, curated complex domain architectures have given upgraded names to millions of multidomain proteins, and we introduce a new kind of annotation rule-BlastRules. Continual curation of supporting evidence, and propagation of improved names onto RefSeq proteins ensures that the functional annotation of genomes is kept current. An increasing share of our annotation now derives from HMMs and other sets of annotation rules that are portable by nature, and available for download and for reuse by other investigators. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/. Published by Oxford University Press on behalf of Nucleic Acids Research 2017.Entities:
Mesh:
Year: 2018 PMID: 29112715 PMCID: PMC5753331 DOI: 10.1093/nar/gkx1068
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.New workflow for structural annotation by the PGAP 4.x series pipeline. Computational processes are shown in blue, data in white or gray. GeneMarkS+ provides ab initio prediction of protein-coding genes, but in the context of hints from homology-based evidence, including HMM evidence for the first time. The use of ORFfinder to produce every stop-to-stop translations, and HMM searching to find every translation with an HMM hit, are steps first introduced in the PGAP-4.1 release. The pipeline detects both disrupted genes (e.g. pseudogenes) and exceptional reading frames (e.g. selenoproteins).
Growth of RefSeq genomes and RefSeq proteins. Twenty pathogenic bacterial species account for more than half of the prokaryotic genomes included in RefSeq (54 663), and or a substantial share of incoming genomes. Consequently, the number of nonredundant RefSeq proteins is growing somewhat more slowly than the number of genomes
| Date | Number of Genomes | Number of Non-redundant Proteins |
|---|---|---|
| 1 January 2013 | 7 503 | 0 |
| 1 January 2014 | 14 762 | 16 829 357 |
| 1 January 2015 | 28 146 | 26 883 513 |
| 1 January 2016 | 52 571 | 41 555 561 |
| 1 January 2017 | 77 292 | 55 802 502 |
| 12 September 2017 | 95 336 | 75 878 570 |
Figure 2.A partially expanded view of the homology evidence and protein naming hierarchy used in RefSeq and PGAP annotation. Four families of beta-lactamases are shown (A, metallo, C, and D), each of which is more similar to various hydrolases of other substrates, such as RNA, than to any members of the other beta-lactamase classes. For each class, a protein profile HMM identifies members and suggests a protein product name, but further expansion of the hierarchy can reveal multiple child families, each identified by a more specific HMM that receives a higher precedence during annotation. The hierarchy of evidence largely follows an implicit hierarchy of protein names, with exceptions necessary occasionally, as when unrelated proteins perform closely related functions.
Coverage of RefSeq non-redundant proteins hierarchical evidence rules. A single protein may be supported by evidence of multiple types; this table shows counts of proteins having the highest precedence evidence (not counting those proteins a second time if they also have a lower precedence evidence). The precedence scores shown represent an arbitrary scale, but show how additional forms of evidence could be interleaved if an appropriate precedence is chosen. 75 878 570 proteins were analyzed; some evidence types with small protein counts are not shown. Once RefSeq or Conserved Domain Database biocurators construct and approve a protein product name, the HMM, CDD-SPARCLE, or other evidence becomes the basis of a fully automated rule for RefSeq annotation. Annotation improves over time as new rules are added that reach more proteins, or as rules capable of highly specific annotation overrule prior annotations based on less specific, lower-ranked rules
| Evidence type | Relative precedence | Count of RefSeq proteins where the evidence is selected | Evidence level description |
|---|---|---|---|
| Allele | 100 | 2 426 | An annotation valid for exactly one protein sequence. Used only for antimicrobial resistance (AMR). |
|
| 95 | 22 857 | Very close full-length homologs of reference sequences, typically > = 94% identity. Used mostly for virulence factors. |
|
| 70 | 16 529 578 | Proteins with conserved specific function—a mature annotation rule with a curated product name |
|
| 69 | 40 763 | Proteins with conserved specific function—early stage annotation rule |
| CDD-SPARCLE domain architectures | 60 | 19 107 223 | Proteins with an exact combination of domains, recognized by Conserved Domain Database's RPS-BLAST tools rather than by HMMs. |
|
| 55 | 1 047 045 | Typically full-length homologs, somewhat variable in function, that deserve naming more specific than domain content provides |
|
| 30 | 3 347 261 | Proteins containing an HMM-defined domain—generally an independently folding region shared by proteins of various functions. |
| Pending evidence | 22 972 109 | An HMM or other classifier exists, but the annotation rule is not complete because the name to apply has not been curated. | |
| No evidence | none | 12 775 989 | Proteins with no HMM, CDD-SPARCLE architecture, BlastRule, etc. |