| Literature DB >> 26215545 |
Kelly M McGarvey1, Tamara Goldfarb1, Eric Cox1, Catherine M Farrell1, Tripti Gupta1, Vinita S Joardar1, Vamsi K Kodali1, Michael R Murphy1, Nuala A O'Leary1, Shashikant Pujar1, Bhanu Rajput1, Sanjida H Rangwala1, Lillian D Riddick1, David Webb1, Mathew W Wright1, Terence D Murphy1, Kim D Pruitt2.
Abstract
Complete and accurate annotation of the mouse genome is critical to the advancement of research conducted on this important model organism. The National Center for Biotechnology Information (NCBI) develops and maintains many useful resources to assist the mouse research community. In particular, the reference sequence (RefSeq) database provides high-quality annotation of multiple mouse genome assemblies using a combinatorial approach that leverages computation, manual curation, and collaboration. Implementation of this conservative and rigorous approach, which focuses on representation of only full-length and non-redundant data, produces high-quality annotation products. RefSeq records explicitly link sequences to current knowledge in a timely manner, updating public records regularly and rapidly in response to nomenclature updates, addition of new relevant publications, collaborator discussion, and user feedback. Whole genome re-annotation is also conducted at least every 12-18 months, and often more frequently in response to assembly updates or availability of informative data. This article highlights key features and advantages of RefSeq genome annotation products and presents an overview of NCBI processes to generate these data. Further discussion of NCBI's resources highlights useful features and the best methods for accessing our data.Entities:
Mesh:
Year: 2015 PMID: 26215545 PMCID: PMC4602073 DOI: 10.1007/s00335-015-9585-8
Source DB: PubMed Journal: Mamm Genome ISSN: 0938-8990 Impact factor: 2.957
Fig. 1Overview of NCBI’s eukaryotic annotation pipeline from http://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/#process. Briefly, genomic sequences are repeat masked (gray), and transcripts (blue), proteins (green), RNA-seq reads (orange), and curated RefSeq sequences (pink) are aligned to the genome. Based on these alignments, gene model predictions are calculated (brown), best models are selected, named and accessioned (purple), and finally annotation products are released publicly (yellow). During re-annotations, models and genes are given special attention and are tracked from one annotation release to the next
Fig. 2Examples of loci benefitting from manual curation. a The first 397 nucleotides of NM_144531.3 and NM_001109684.1 are missing from the GRCm38 reference genome assembly. The 5′ portion of the chromosome 4 gene Kazn (GeneID: 71529) was screen captured from NCBI's sequence viewer in the Gene resource and labels were edited (http://www.ncbi.nlm.nih.gov/gene/?term=71529#genomic-regions-transcripts-products). The partial alignment of the 5′ end of these RefSeq records is indicated by the double black arrows, and by the qualifier statement which is revealed upon hovering the mouse over the RefSeq transcript graphic. b Supporting evidence is reported on the NM_144531.3 record (http://www.ncbi.nlm.nih.gov/nuccore/NM_144531.3). The comments section shows that the full exon combination represented by NM_144531.3 is supported by the messenger RNA transcript, AK173090.1. This type of support evidence is associated with the ECO ID:0000332. The set of ECO IDs reported has been previously described (Pruitt et al. 2014). c The Apela gene on chromosome 8 (GeneID: 100038489) was defined as a non-coding locus in Mus musculus Annotation Release 104 (represented by NR_040692.1), but manual curation resulted in an update of the locus type to protein-coding in Annotation Release 105 (represented by NM_001297554.1/NP_001284483.1). The graphical display of RefSeq genome annotation that is shown in a, c was screen captured from NCBI's sequence viewer in the Gene resource and labels were edited (http://www.ncbi.nlm.nih.gov/gene/?term=100038489#genomic-regions-transcripts-products)
Summary of NCBI mouse annotation releases 104 and 105
| Organism | Mouse | |
| Assembly | GRCm38.p2 | GRCm38.p3 |
| Annotation release | 104 | 105 |
| Release date | December 2013 | March 2015 |
aIncludes transcribed pseudogenes and transcripts for protein-coding genes that are deemed unlikely to be translated
bIncludes guide RNA, telomerase RNA, vault RNA, scRNA, Y RNA, RNase P, and RNase MRP
cAs of August 2013
dAs of May 2015
Sources http://www.ncbi.nlm.nih.gov/genome/annotation_euk/Mus_musculus/105/, http://www.ncbi.nlm.nih.gov/genome/annotation_euk/Mus_musculus/104/, http://www.ncbi.nlm.nih.gov/projects/CCDS/CcdsBrowse.cgi?REQUEST=SHOW_STATISTICS
Fig. 3Graphical display of mouse RefSeq transcripts using NCBI’s Gene Resource. a Genomic context. Coordinates on multiple mouse genome assemblies and a graphical display of the location and orientation of genes neighboring the Fst (GeneID: 14313) locus are shown here. b Genomic regions, transcripts, and products. Tracks displayed with the default settings are indicated with red arrows. The configure button (red circle) may be used to customize tracks. c Zoom and pan features allow easy identification of differences between transcript variants. Quantitative RNA-seq intron features data are displayed in this view. d Shown here is a subset of the links and related information displayed in the sidebar of each Gene record
Fig. 4The UCSC Genome Browser does not accurately represent RefSeq data. a NCBI Sequence Viewer. Coordinates on mouse chromosome 17 (NC_000083.6 from 25,957,500 to 25,988,800) and a graphical display the neighboring loci, 1700022N22Rik (GeneID: 69431) and Capn15 (GeneID: 50817), were screen captured from NCBI sequence viewer in the Gene resource and labels were edited. b UCSC Genome Browser. Coordinates on mouse chromosome 17 (NC_000083.6) and the RefSeq Genes track were screen captured from the UCSC Genome Browser and labels were edited. No RefSeq models are displayed in the RefSeq Genes track