| Literature DB >> 24259432 |
Kim D Pruitt1, Garth R Brown, Susan M Hiatt, Françoise Thibaud-Nissen, Alexander Astashyn, Olga Ermolaeva, Catherine M Farrell, Jennifer Hart, Melissa J Landrum, Kelly M McGarvey, Michael R Murphy, Nuala A O'Leary, Shashikant Pujar, Bhanu Rajput, Sanjida H Rangwala, Lillian D Riddick, Andrei Shkeda, Hanzhen Sun, Pamela Tamez, Raymond E Tully, Craig Wallin, David Webb, Janet Weber, Wendy Wu, Michael DiCuccio, Paul Kitts, Donna R Maglott, Terence D Murphy, James M Ostell.
Abstract
The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of annotated genomic, transcript and protein sequence records derived from data in public sequence archives and from computation, curation and collaboration (http://www.ncbi.nlm.nih.gov/refseq/). We report here on growth of the mammalian and human subsets, changes to NCBI's eukaryotic annotation pipeline and modifications affecting transcript and protein records. Recent changes to NCBI's eukaryotic genome annotation pipeline provide higher throughput, and the addition of RNAseq data to the pipeline results in a significant expansion of the number of transcripts and novel exons annotated on mammalian RefSeq genomes. Recent annotation changes include reporting supporting evidence for transcript records, modification of exon feature annotation and the addition of a structured report of gene and sequence attributes of biological interest. We also describe a revised protein annotation policy for alternatively spliced transcripts with more divergent predicted proteins and we summarize the current status of the RefSeqGene project.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24259432 PMCID: PMC3965018 DOI: 10.1093/nar/gkt1114
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Annual growth of mammalian and human RefSeq transcript records
| Release | Taxa | Mammalian records | Human records | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Total transcripts | Percent models | Percent curatedd | Percent ncRNA | Total transcripts | Total genes | Percent models | Percent curatedd | Percent ncRNA | ||
| 1 | 5 | 126 980 | 68 | 7 | <1 | 38 556 | na | 50 | 22 | <1 |
| 6 | 10 | 79 686 | 41 | 17 | <1 | 28 176 | na | 23 | 42 | <1 |
| 12 | 19 | 158 111 | 65 | 11 | <1 | 29 490 | na | 18 | 50 | 1 |
| 18 | 28 | 263 628 | 77 | 7 | 5 | 40 342 | 28 514 | 38 | 39 | 2 |
| 24 | 35 | 338 204 | 80 | 7 | 8 | 38 709 | 29 398 | 34 | 46 | 14 |
| 30 | 37 | 340 968 | 77 | 9 | 9 | 45 511 | 27 741 | 41 | 45 | 16 |
| 36 | 42 | 346 976 | 74 | 12 | 8 | 43 589 | 29 071 | 30 | 60 | 13 |
| 42 | 42 | 425 170 | 76 | 12 | 9 | 46 111 | 29 954 | 27 | 63 | 15 |
| 48 | 43 | 470 979 | 76 | 12 | 9 | 46 912 | 27 619 | 20 | 70 | 25 |
| 54 | 45 | 515 900 | 76 | 12 | 9 | 44 951 | 26 440 | 10 | 79 | 23 |
| 60 | 59 | 1 263 067 | 90 | 5 | 6 | 47 619 | 26 266 | 11 | 79 | 24 |
aRelease numbers listed correspond to ∼12 month intervals beginning from the first release in June 2003. The number of human transcripts in release 60 (July 2013) reflects the November 2012 genome annotation of three assemblies (GRCh37.p10, HuRef and CHM1_1.0) plus records added through ongoing curation activities.
bThe number of distinct NCBI Taxonomy IDs included in the RefSeq vertebrate_mammalian FTP directory that have a publicly available nuclear genome records. Twelve taxa are represented by un-annotated ENCODE genomic region records only. Mammals for which only a mitochondrial genome sequence is available are excluded. Data reported in Table 1 were extracted from archived reports available at ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/ using files named as ‘RefSeq-release##.catalog.gz’.
cThe percent of total transcripts that are model RefSeqs (with XM or XR accession prefix) generated by NCBI’s eukaryotic annotation pipeline. The percent known RefSeqs (with NM or NR prefix) can be inferred from this value (100% − percent model RefSeqs = percent known RefSeqs).
dThe percent of total transcripts that are known RefSeq records that have been curated by NCBI staff and are annotated with a ‘validated’ or ‘reviewed’ status in the COMMENT block of the RefSeq record. Validated records have undergone sequence review by NCBI staff, whereas a reviewed record includes curation of descriptive information, such as names, publications and a RefSeq summary in addition to sequence review. Known RefSeq records that have not been curated are not included; thus, the number of model records and curated records do not sum to 100%.
eThe percent of total transcripts that are not protein coding. This includes model or known long non-coding RNAs (lncRNA), small RNAs (e.g. microRNA, snoRNA, etc.), ribosomal RNAs and transcribed pseudogenes. Transfer RNAs, which are annotated on genomic records using tRNAscan but not tracked with RefSeq accessions, are not included.
fThe number of human genes per release was derived using FTP files named as ‘release##.accession2geneid.gz’. This file was not provided prior to release 14.
Figure 1.Number of vertebrate and other eukaryotic genome annotations released by NCBI per year since 2001. Additional information about recently annotated genomes is available at http://www.ncbi.nlm.nih.gov/genome/annotation_euk/status/#recent.
Figure 2.Both known and model RefSeq records may be associated with the same locus. A portion of the ‘Genomic regions, transcripts, and products’ section of the Gene record for human GSTZ1 (NCBI GeneID 2954) is shown. Chromosome 14 coordinates corresponding to annotation of assembly GRCh13.p13 (NC_000014.8), NCBI annotation release 105 are shown at the top. The gene is associated with three known RefSeq transcripts (e.g. NM_145870.2, NM_145871.2 and NM_001513.3) and three model transcripts (e.g. XM_005267557.1, XM_005267558.1 and XM_005267559.1). The first exon of the overlapping POMT2 gene is also visible in this display. Supplementing curated RefSeqs (NM, NR, or NP prefixes) with model RefSeqs (XM, XR and XP accessions) enables better representation of alternative splice variants and exons.
Figure 3.Structured comments provide information on supporting evidence and biological attributes. A portion of the COMMENT section of the NM_006440.4 record is displayed, illustrating the two structured comments. (A) The Evidence Data comment reports supporting evidence for the exon combination represented in the record. (B) The RefSeq Attributes comment reports biological attributes. Each comment type includes the attribute category on the left and supporting evidence on the right. Structured comments include special formatting and are bracketed by START and END to support parsing.
Number of mammalian and human transcript records annotated with evidence support, by evidence type categories
| Evidence type | Number of transcript records | |
|---|---|---|
| Mammals | Human | |
| transcript exon combination | 96 124 | 33 378 |
| CDS exon combination | 2985 | 1486 |
| Intronless | 1968 | 734 |
| RNAseq, single sample | 33 446 | 23 251 |
| RNAseq, mixed/partial | 12 918 | 11 056 |
| Total distinct transcript records | 106 895 | 38 911 |
aCounts as of 10 September 2013.
bThis evidence category is supported by a combination of curation and alignment evaluation and is under-reported.
cRNAseq data was used as an input reagent in calculating genome annotation for nine organisms at this time.