| Literature DB >> 26553804 |
Nuala A O'Leary1, Mathew W Wright1, J Rodney Brister1, Stacy Ciufo1, Diana Haddad1, Rich McVeigh1, Bhanu Rajput1, Barbara Robbertse1, Brian Smith-White1, Danso Ako-Adjei1, Alexander Astashyn1, Azat Badretdin1, Yiming Bao1, Olga Blinkova1, Vyacheslav Brover1, Vyacheslav Chetvernin1, Jinna Choi1, Eric Cox1, Olga Ermolaeva1, Catherine M Farrell1, Tamara Goldfarb1, Tripti Gupta1, Daniel Haft1, Eneida Hatcher1, Wratko Hlavina1, Vinita S Joardar1, Vamsi K Kodali1, Wenjun Li1, Donna Maglott1, Patrick Masterson1, Kelly M McGarvey1, Michael R Murphy1, Kathleen O'Neill1, Shashikant Pujar1, Sanjida H Rangwala1, Daniel Rausch1, Lillian D Riddick1, Conrad Schoch1, Andrei Shkeda1, Susan S Storz1, Hanzhen Sun1, Francoise Thibaud-Nissen1, Igor Tolstoy1, Raymond E Tully1, Anjana R Vatsan1, Craig Wallin1, David Webb1, Wendy Wu1, Melissa J Landrum1, Avi Kimchi1, Tatiana Tatusova1, Michael DiCuccio1, Paul Kitts1, Terence D Murphy1, Kim D Pruitt2.
Abstract
The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55,000 organisms (>4800 viruses, >40,000 prokaryotes and >10,000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management. Published by Oxford University Press on behalf of Nucleic Acids Research 2015. This work is written by (a) US Government employee(s) and is in the public domain in the US.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26553804 PMCID: PMC4702849 DOI: 10.1093/nar/gkv1189
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
RefSeq accession prefixes
| Prefix | Molecule type | Use context |
|---|---|---|
| NC_1 | DNA | Chromosomes |
| Linkage Groups | ||
| AC_1 | DNA | Chromosomes |
| Linkage Groups | ||
| NZ_2 | DNA | Chromosomes |
| Scaffolds | ||
| Used predominantly for prokaryotic genomes. | ||
| NT_3 | DNA | Scaffolds |
| NW_3 | DNA | Scaffolds |
| NG_1 | DNA | Genomic regions. |
| A genomic region record may represent a single or multiple genetic loci (e.g. rRNA targeted locus, RefSeqGene, non-transcribed pseudogene) | ||
| NM_3,4 | mRNA | protein-coding transcripts |
| XM_3,5 | mRNA | protein-coding transcripts |
| NR_3,4 | RNA | non-protein-coding transcripts including lncRNAs, structural RNAs, transcribed pseudogenes, and transcripts with unlikely protein-coding potential from protein-coding genes |
| XR_3,5 | RNA | non-protein-coding transcripts, as above |
| NP_3,4 | protein | Proteins annotated on NM_ transcript accessions or annotated on genomic molecules without an instantiated transcript (e.g. some mitochondrial genomes, viral genomes, and reference bacterial genomes |
| AP_3 | protein | Proteins annotated on AC_ genomic accessions or annotated on genomic molecules without an instantiated transcript record |
| XP_3,5 | protein | Proteins annotated on XM_ transcript accessions or annotated on genomic molecules without an instantiated transcript record |
| YP_3 | protein | Proteins annotated on genomic molecules without an instantiated transcript record |
| WP_6 | protein | Proteins that are non-redundant across multiple strains and species. A single protein of this type may be annotated on more than one prokaryotic genome |
1The complete accession number format consists of the prefix, including the underscore, followed by 6 numbers followed by the sequence version number.
2The complete accession format consists of the prefix followed by the INSDC accession number that the RefSeq record is based on followed by the RefSeq sequence version number.
3The complete accession number format consists of the prefix, including the underscore, followed by 6 or 9 numbers followed by the sequence version number.
4Records with this accession prefix have been curated by NCBI staff or a model organism database, or are in the pool of accessions that curators work with. These records are referred to as the ‘known’ RefSeq dataset.
5Records with this accession prefix are generated through either the eukaryotic genome annotation pipeline, or the small eukaryotic genome annotation pipeline. Records generated via the first method are referred to as the ‘model’ RefSeq dataset.
6The complete accession number format consists of the prefix, including the underscore, followed by 9 numbers followed by the version number. The version number is always ‘.1’ as these records are not subject to update. See online documentation for additional information: www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/.
Annual growth in the number organisms, proteins, and transcripts represented in the comprehensive RefSeq release, per FTP release directory
| Release Directory | Organisms | % Change | Transcripts | % Change | Proteins | % Change |
|---|---|---|---|---|---|---|
| Archaea | 952 | 12 | 1109 | 318 | 1037407 | -5 |
| Bacteria | 39660 | 40 | 19650 | 488 | 40194748 | 14 |
| Fungi | 3367 | 18 | 1438749 | 17 | 1440956 | 17 |
| Invertebrate | 1786 | 29 | 1435978 | 76 | 1367317 | 74 |
| Mitochondrion | 5732 | 24 | 112 | -15 | 83208 | 24 |
| Plant | 847 | 59 | 2181963 | 86 | 2067971 | 75 |
| Plasmid | 2139 | 31 | 12 | 9 | 126725 | -62 |
| Plastid | 843 | 54 | 120 | 0 | 72579 | 50 |
| Protozoa | 273 | 27 | 849678 | 46 | 865048 | 45 |
| Vertebrate_mammalian | 776 | 14 | 3778288 | 44 | 3266845 | 39 |
| Vertebrate_other | 2755 | 26 | 2097939 | 85 | 2023378 | 84 |
| Viral | 4850 | 17 | 0 | 0 | 230360 | 15 |
| Complete | 55267 | 34 | 11803354 | 56 | 52494032 | 20 |
aCounts are based on statistics reports that are available from the RefSeq FTP site at ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-statistics/ (e.g. archaea.acc_taxid_growth.txt and related files). The percent annual change is based on comparing data counts for RefSeq release 71 (July 2015) and RefSeq release 66 (July 2014).