| Literature DB >> 22121212 |
Kim D Pruitt1, Tatiana Tatusova, Garth R Brown, Donna R Maglott.
Abstract
The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of genomic, transcript and protein sequence records. These records are selected and curated from public sequence archives and represent a significant reduction in redundancy compared to the volume of data archived by the International Nucleotide Sequence Database Collaboration. The database includes over 16,00 organisms, 2.4 × 0(6) genomic records, 13 × 10(6) proteins and 2 × 10(6) RNA records spanning prokaryotes, eukaryotes and viruses (RefSeq release 49, September 2011). The RefSeq database is maintained by a combined approach of automated analyses, collaboration and manual curation to generate an up-to-date representation of the sequence, its features, names and cross-links to related sources of information. We report here on recent growth, the status of curating the human RefSeq data set, more extensive feature annotation and current policy for eukaryotic genome annotation via the NCBI annotation pipeline. More information about the resource is available online (see http://www.ncbi.nlm.nih.gov/RefSeq/).Entities:
Mesh:
Year: 2011 PMID: 22121212 PMCID: PMC3245008 DOI: 10.1093/nar/gkr1079
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Annual Growth of the RefSeq release
| Release directory | Number of organisms | Number of records | ||||
|---|---|---|---|---|---|---|
| Release 43 | Release 49 | Increase (%) | Release 43 | Release 49 | Increase (%) | |
| Complete | 10 854 | 16 248 | 49.7 | 15 934 055 | 18 236 994 | 14.5 |
| Fungi | 280 | 301 | 7.5 | 1 178 671 | 1 319 842 | 12.0 |
| Invertebrate | 637 | 754 | 18.4 | 1 993 670 | 2 232 026 | 12.0 |
| Microbial | 5585 | 10 346 | 85.2 | 9 031 974 | 10 711 822 | 18.6 |
| mitochondrion | 2266 | 2654 | 17.1 | 34 688 | 40 664 | 17.2 |
| Plant | 182 | 229 | 25.8 | 817 648 | 842 720 | 3.1 |
| Plasmid | 952 | 1061 | 11.4 | 160 065 | 191 018 | 19.3 |
| Plastid | 186 | 233 | 25.3 | 16 908 | 21 103 | 24.8 |
| Protozoa | 134 | 146 | 9.0 | 932 990 | 956 479 | 2.5 |
| Vertebrate_mammalian | 327 | 354 | 8.3 | 1 492 157 | 1 587 895 | 6.4 |
| Vertebrate_other | 1120 | 1334 | 19.1 | 398 084 | 483 449 | 21.4 |
| Viral | 2250 | 2745 | 22.0 | 87 759 | 101 664 | 15.8 |
aRelease 43 included data available on 7 September 2010; release 49 included data available on 5 September 2011.
Distribution of RefSeq release 49 by ftp directory
| Release directory | Percent of total | |
|---|---|---|
| Organisms | Accessions | |
| Fungi | 1.9 | 7.2 |
| Invertebrate | 4.6 | 12.2 |
| Microbial | 63.7 | 58.7 |
| Mitochondrion | 16.3 | 0.2 |
| Plant | 1.4 | 4.6 |
| Plasmid | 6.5 | 1.0 |
| Plastid | 1.4 | 0.1 |
| Protozoa | 0.9 | 5.2 |
| Vertebrate_mammalian | 2.2 | 8.7 |
| Vertebrate_other | 8.2 | 2.7 |
| Viral | 16.9 | 0.6 |
Current status of human transcripts and proteins
| Type | Accessions in Release 49 | ||
|---|---|---|---|
| Total | Curated | Percent curated | |
| Known protein-coding transcripts | 31 933 | 29 531 | 92.5 |
| Model protein-coding transcripts | 1118 | NA | |
| Known non-coding transcripts | 5932 | 3396 | 57.2 |
| Model non-coding transcripts | 3762 | NA | |
| Total | 42 745 | 32 927 | 77.0 |
aCurated records have a review status of ‘Validated’ or ‘Reviewed’ which is not applied to model RefSeq records.
Figure 1.NM_145204.3 is shown in the Nucleotide Graphical display format. The display was configured to show the six-frame translation track restricted to the sense strand, and to add three markers highlighting the annotated upstream in-frame stop codon, the translation initiation codon and a second in-frame AUG codon located further downstream. The observation of a stop codon upstream of, and in the same reading frame, suggests the annotated CDS is 5′ complete.