Literature DB >> 15608248

NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

Kim D Pruitt¹, Tatiana Tatusova, Donna R Maglott.

Abstract

The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database (http://www.ncbi.nlm.nih.gov/RefSeq/) provides a non-redundant collection of sequences representing genomic data, transcripts and proteins. Although the goal is to provide a comprehensive dataset representing the complete sequence information for any given species, the database pragmatically includes sequence data that are currently publicly available in the archival databases. The database incorporates data from over 2400 organisms and includes over one million proteins representing significant taxonomic diversity spanning prokaryotes, eukaryotes and viruses. Nucleotide and protein sequences are explicitly linked, and the sequences are linked to other resources including the NCBI Map Viewer and Gene. Sequences are annotated to include coding regions, conserved domains, variation, references, names, database cross-references, and other features using a combined approach of collaboration and other input from the scientific community, automated annotation, propagation from GenBank and curation by NCBI staff.

Entities: Disease Gene Species

Mesh：

Year: 2005 PMID： 15608248 PMCID： PMC539979 DOI： 10.1093/nar/gki025

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

RefSeq is a public database of nucleotide and protein sequences with corresponding feature and bibliographic annotation. The RefSeq database is built and distributed by the NCBI, a division of the National Library of Medicine located at the US National Institutes of Health. NCBI makes RefSeq publicly available, at no cost, over the internet via FTP, Entrez query (1), Basic Local Alignment Search Tool (BLAST) (2,3) programs, and incorporation in a wide range of NCBI resources. NCBI builds RefSeq from the sequence data available in the archival database GenBank (4), which is a comprehensive public repository of sequences submitted to, and exchanged among, GenBank in the US, the EMBL Data Library in the UK and the DNA Data Bank of Japan. In addition, the annotated RefSeq record and/or supplementary information may be provided by multiple collaborations established with nomenclature groups, model organism databases and other facets of the scientific community. RefSeq records indicate the source GenBank data, include references and annotations relevant to the gene, transcript and protein, and indicate curation with attribution to the curation group. The RefSeq collection is unique in providing a curated, non-redundant, explicitly linked nucleotide and protein database representing significant taxonomic diversity. Genomic and protein sequence datasets are provided for the majority of organisms included; transcript records are currently provided for a subset of the eukaryotic collection. The RefSeq database provides a critical foundation for integrating sequence, genetic and functional information, and is used internationally as a standard for genome annotation. The collection is curated on an ongoing basis by collaborating groups and by NCBI staff. Sequence records are presented in a standard format and are subject to computational validation.

DISTINCTION FROM GENBANK

The RefSeq collection is derived from the primary submissions available in GenBank. GenBank is a redundant archival database that represents sequence information generated at different times, and may represent several alternate views of the protein, names or other information. In contrast, RefSeq represents a nearly non-redundant collection that is a synthesis and summary of available information, and represents the ‘current’ view of the sequence information, names and other annotations. RefSeq records can be distinguished from GenBank records by the format of the accession series. RefSeq accession numbers are formatted as two alphabetic characters, followed by an underscore (‘_’), optionally followed by four alphabetic characters (specific to the NZ_ prefix), followed by six, eight or nine numerals. GenBank accessions never include an underscore. Different alphabetic prefixes have implied meaning in terms of both the process of generation and the type of molecule represented. A full definition of the RefSeq accession numbers is available on the RefSeq Web site (http://www.ncbi.nlm.nih.gov/RefSeq/key.html#accessions).

GROWTH

The RefSeq database continues to grow in pace with the large-scale genome and cDNA sequencing projects (see Table 1). As new complete genome assemblies become available, they are incorporated into the RefSeq collection. Most organisms are represented in the collection only after some genomic sequence data (nuclear, plastid, mitochondrial or other genomic molecules) becomes available; however, transcript and protein records may be provided for a subset of eukaryotic model organisms prior to the availability of genomic sequence data.

Table 1.

Annual growth of the RefSeq collection

Date	FTP release	Species	Number of records
			Genomic	Transcript	Protein
6/30/2003	1	2005	64 729	211 803	785 143
7/5/2004	6	2467	68 592	247 639	1 050 975

ANNOTATION

Annotation of RefSeq records originates from several sources including the original GenBank submission, collaborating groups, NCBI computational analysis, user feedback and manual curation at NCBI. For example, collaboration supports the RefSeq representation of Saccharomyces cerevisiae, Drosophila melanogaster and Arabidopsis thaliana, which are directly contributed by the Saccharomyces Genome Database (SGD)(5), FlyBase (6) and The Institute for Genomic Research (TIGR), respectively. Similarly, the entire viral RefSeq collection is reviewed and curated by the NCBI Viral Genome Advisors group. See the RefSeq Collaborators page for more information about contributions from collaborators (http://www.ncbi.nlm.nih.gov/RefSeq/collaborators.html). All RefSeq records include explicit cross-links between the nucleotide and protein cognates and to Entrez Gene (7), which provides gene-oriented access to the RefSeq collection. Additional links, annotated as ‘db_xref’ notations, are provided on some records to organism-specific genome resources such as Mouse Genome Informatics (MGI) (8) or FlyBase. For other species, including Apis mellifera (honey bee), Gallus gallus (chicken), Homo sapiens (human), Mus musculus (mouse) and Rattus norvegicus (rat), genome annotation is provided by a NCBI computational process that utilizes transcript alignments, protein support and a hidden Markov model (HMM) ab initio prediction algorithm (see the NCBI Handbook; http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books). Genomic RefSeq records that are annotated by this process represent genes, transcripts and proteins, and include additional feature annotation to represent STS markers. The available RefSeq transcript dataset, with the ‘NM_’ accession prefix, is an important reagent in this annotation pipeline. Comprehensive representation of the proteins, explicitly linked to a RefSeq nucleotide record, is a major focus of the RefSeq project. The goal is to represent the full-length protein product; however, partial protein products are represented for some genomes when partial protein annotation is contributed by a collaborator or when proteins are predicted from incomplete genome sequence data. Proteins are annotated by computation and curation. Conserved domains are calculated by an automatic process using data maintained in the NCBI Conserved Domain Database (CDD) (9); this annotation provides hints about possible function. Likewise, variation features that are located in the coding region are automatically calculated from data available in the NCBI dbSNP database (10). Additional features including Enzyme Commission (EC) numbers, other landmark regions of the protein sequence and references may be added by curation either by an external collaborator or by NCBI staff. Transcript records are provided for a subset of eukaryotic species, including those in the Chordata taxonomic lineage, to represent protein-coding sequences, transcribed pseudogenes, ribosomal RNAs and other small RNAs. Annotation results from a mixture of automated and curatorial analysis. Variation features are calculated automatically from data in the dbSNP database, and the nucleotide region corresponding to the annotated protein conserved domains are also provided automatically (as a miscellaneous feature, or ‘misc_feat’). Other features, such as polyadenylation signals and sites, alternate transcription start sites and RNA editing sites, are provided by curation.

CURATION AND QUALITY CONTROL

RefSeq sequences are validated to confirm the following: (i) accurate nucleotide-to-protein sequence correspondence; (ii) valid ASN.1 format and (iii) for species supported by collaboration with official nomenclature groups, current preferred name and symbol designations. Validation of map location is available for species that are annotated via the NCBI annotation pipeline. NCBI staff review and manually modify a subset of the RefSeq collection including those provided for viruses, some bacteria, mammals and some additional species. The goal of this manual curation is to provide accurate and full-length sequence data, to ensure accurate sequence-to-gene associations, to expand the collection by adding previously unrepresented genes and/or alternate splice products, and to provide additional feature annotation to represent mature peptide products, regions of interest and/or to highlight less frequent biological events such as non-AUG initiation sites (11) or selenoproteins (12). The curation status is annotated on RefSeq records, as a COMMENT feature; the status terms used include model, predicted, provisional, inferred, validated and reviewed, with the latter two indicating that sequence-level curation has taken place. Curation status terms are documented on the RefSeq Web site (http://www.ncbi.nlm.nih.gov/RefSeq/key.html#status). Several processes are used to identify records that will benefit most from staff review. For instance, records targeted for review include those that differ relative to available genomic sequence, those with significant protein length variation compared to homologous groups calculated by the NCBI HomoloGene resource (13), and those for which there are no related proteins other than the GenBank record used to construct the RefSeq. Several additional tests for transcript and protein quality are in place but are not enumerated here. In addition, review is based on user feedback that identifies additional data or errors. We welcome user feedback to help maintain and improve the RefSeq collection. A feedback form is provided online, or users can contact the main NCBI Help Desk (see Table 2).

Table 2.

RefSeq information, access and feedback

Resource	URL
RefSeq home page	http://www.ncbi.nlm.nih.gov/RefSeq/
FTP—RefSeq release	ftp://ftp.ncbi.nih.gov/refseq/release/
BLAST home page	http://www.ncbi.nlm.nih.gov/BLAST/
Entrez home page	http://www.ncbi.nlm.nih.gov/Entrez/
RefSeq feedback form	http://www.ncbi.nlm.nih.gov/RefSeq/update.cgi
Contact NCBI Help Desk	info@ncbi.nlm.nih.gov
Subscribe to RefSeq announce	http://www.ncbi.nlm.nih.gov/mailman/listinfo/refseq-announce

RETRIEVING DATA

The RefSeq collection can be accessed multiple ways at NCBI, including by Entrez query, BLAST, FTP, and links provided from NCBI databases and resources (see Table 2).

Entrez query

RefSeq results are included in the results returned when performing a global query of the Entrez databases from the NCBI or Entrez homepage. Returned results can be restricted to include only RefSeq records by going to the homepage of the nucleotide or protein database and either using the Entrez Limits page to select ‘Only from RefSeq’ or adding one of the RefSeq-specific property restrictions directly to the entered text query. For example, a query to retrieve all RefSeq nucleotide records that include the name ‘BRCA1’ somewhere in the record is formatted as BRCA1 AND srcdb_refseq[prop]. The RefSeq Web site provides definitions of the available property restrictions (http://www.ncbi.nlm.nih.gov/RefSeq/key.html#query). Entrez queries from the Entrez home page, where it is possible to query against all of the Entrez databases at once, will also return results to the Entrez Gene and Genomes (14) databases, which are both components of the RefSeq project. Entrez Gene integrates gene-specific annotation from RefSeq records with other sources of information, and thus provides a gene-oriented view of data about genes (7). When there is sequence for a complete genome or chromosome, the data are also included in the Entrez Genome database, which provides multiple tools to display and analyze the information.

BLAST and BLink

RefSeq records are included in the main BLAST nr databases and are also made available in genome-specific BLAST database collections (listed at http://www.ncbi.nlm.nih.gov/BLAST/). Hits to RefSeq records can be immediately identified by the distinct format of the accession numbers. BLAST nr results can be configured to show only those hits to the RefSeq collection by entering the Entrez property query on the format page (e.g. srcdb_refseq[prop]). RefSeq records are also included in the pre-computed BLAST analysis that is done to provide Entrez links to related sequences (nucleotide or protein) and to BLink, a visualization tool for the related protein sequences dataset. The BLink interface includes an option to show only RefSeq proteins.

FTP

The complete RefSeq collection is made available for anonymous FTP as bi-monthly releases in conjunction with daily and cumulative updates between the release cycles. The RefSeq release is structured to provide access to the full RefSeq collection or to a portion of the collection organized by main taxonomic categories (e.g. plant, viral, vertebrate_mammalian) or molecules of interest (e.g. organelle, plasmid). Documentation includes an indication of files and sequences provided, sequences that have been removed since the previous release, and a full description of the release structure and content. Announcements about large changes, problems and the availability of a RefSeq release are emailed to the refseq-announce email list (see Table 2). Additional FTP data is provided for some organisms of interest, including the transcript and protein dataset for human, mouse and rat. Users may be interested in subscribing to refseq-announce@ncbi.nlm.nih.gov to receive information about the RefSeq releases and planned modifications as they occur over time.

Links

Multiple NCBI databases and resources include links to RefSeq records. Links to RefSeq records can be found in many Entrez databases and resources including Gene, UniGene, HomoloGene, Map Viewer, UniSTS.

14 in total

Review 1. Regulation of gene expression by stop codon recoding: selenocysteine.

Authors: Paul R Copeland
Journal: Gene Date: 2003-07-17 Impact factor: 3.688

2. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

3. Entrez: molecular biology database and retrieval system.

Authors: G D Schuler; J A Epstein; H Ohkawa; J A Kans
Journal: Methods Enzymol Date: 1996 Impact factor: 1.600

Review 4. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

5. The Mouse Genome Database (MGD): integrating biology with the genome.

Authors: Carol J Bult; Judith A Blake; Joel E Richardson; James A Kadin; Janan T Eppig; R M Baldarelli; K Barsanti; M Baya; J S Beal; W J Boddy; D W Bradt; D L Burkart; N E Butler; J Campbell; R Corey; L E Corbani; S Cousins; H Dene; H J Drabkin; K Frazer; D M Garippa; L H Glass; C W Goldsmith; P L Grant; B L King; M Lennon-Pierce; J Lewis; I Lu; C M Lutz; L J Maltais; L M McKenzie; D Miers; D Modrusan; L Ni; J E Ormsby; D Qi; S Ramachandran; T B K Reddy; D J Reed; R Sinclair; D R Shaw; C L Smith; P Szauter; B Taylor; P Vanden Borre; M Walker; L Washburn; I Witham; J Winslow; Y Zhu
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

6. Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms.

Authors: Karen R Christie; Shuai Weng; Rama Balakrishnan; Maria C Costanzo; Kara Dolinski; Selina S Dwight; Stacia R Engel; Becket Feierbach; Dianna G Fisk; Jodi E Hirschman; Eurie L Hong; Laurie Issel-Tarver; Robert Nash; Anand Sethuraman; Barry Starr; Chandra L Theesfeld; Rey Andrada; Gail Binkley; Qing Dong; Christopher Lane; Mark Schroeder; David Botstein; J Michael Cherry
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

Review 7. Generation of protein isoform diversity by alternative initiation of translation at non-AUG codons.

Authors: Christian Touriol; Stéphanie Bornes; Sophie Bonnal; Sylvie Audigier; Hervé Prats; Anne-Catherine Prats; Stéphan Vagner
Journal: Biol Cell Date: 2003 May-Jun Impact factor: 4.458

8. Entrez Gene: gene-centered information at NCBI.

Authors: Donna Maglott; Jim Ostell; Kim D Pruitt; Tatiana Tatusova
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

9. Database resources of the National Center for Biotechnology Information.

Authors: David L Wheeler; Tanya Barrett; Dennis A Benson; Stephen H Bryant; Kathi Canese; Deanna M Church; Michael DiCuccio; Ron Edgar; Scott Federhen; Wolfgang Helmberg; David L Kenton; Oleg Khovayko; David J Lipman; Thomas L Madden; Donna R Maglott; James Ostell; Joan U Pontius; Kim D Pruitt; Gregory D Schuler; Lynn M Schriml; Edwin Sequeira; Steven T Sherry; Karl Sirotkin; Grigory Starchenko; Tugba O Suzek; Roman Tatusov; Tatiana A Tatusova; Lukas Wagner; Eugene Yaschenko
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

10. GenBank.

Authors: Dennis A Benson; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; David L Wheeler
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

759 in total

1. Multiple templates-based homology modeling enhances structure quality of AT1 receptor: validation by molecular dynamics and antagonist docking.

Authors: Pandian Sokkar; Shylajanaciyar Mohandass; Murugesan Ramachandran
Journal: J Mol Model Date: 2010-10-06 Impact factor: 1.810

Review 2. The dynorphin/κ-opioid receptor system and its role in psychiatric disorders.

Authors: H A Tejeda; T S Shippenberg; R Henriksson
Journal: Cell Mol Life Sci Date: 2011-10-16 Impact factor: 9.261

3. Changes in exon-intron structure during vertebrate evolution affect the splicing pattern of exons.

Authors: Sahar Gelfman; David Burstein; Osnat Penn; Anna Savchenko; Maayan Amit; Schraga Schwartz; Tal Pupko; Gil Ast
Journal: Genome Res Date: 2011-10-05 Impact factor: 9.043

4. Using Galaxy to perform large-scale interactive data analyses.

Authors: Jennifer Hillman-Jackson; Dave Clements; Daniel Blankenberg; James Taylor; Anton Nekrutenko
Journal: Curr Protoc Bioinformatics Date: 2012-06

5. Dual-targeting siRNAs.

Authors: Katrin Tiemann; Britta Höhn; Ali Ehsani; Stephen J Forman; John J Rossi; Pål Saetrom
Journal: RNA Date: 2010-04-21 Impact factor: 4.942

6. Quantitative measures for the management and comparison of annotated genomes.

Authors: Karen Eilbeck; Barry Moore; Carson Holt; Mark Yandell
Journal: BMC Bioinformatics Date: 2009-02-23 Impact factor: 3.169

7. Deep transcriptome profiling of mammalian stem cells supports a regulatory role for retrotransposons in pluripotency maintenance.

Authors: Alexandre Fort; Kosuke Hashimoto; Daisuke Yamada; Md Salimullah; Chaman A Keya; Alka Saxena; Alessandro Bonetti; Irina Voineagu; Nicolas Bertin; Anton Kratz; Yukihiko Noro; Chee-Hong Wong; Michiel de Hoon; Robin Andersson; Albin Sandelin; Harukazu Suzuki; Chia-Lin Wei; Haruhiko Koseki; Yuki Hasegawa; Alistair R R Forrest; Piero Carninci
Journal: Nat Genet Date: 2014-04-28 Impact factor: 38.330

8. Starch content differences between two sweet potato accessions are associated with specific changes in gene expression.

Authors: Songtao Yang; Xiaojing Liu; Shuai Qiao; Wenfang Tan; Ming Li; Junyan Feng; Cong Zhang; Xiang Kang; Tianbao Huang; Youlin Zhu; Lan Yang; Dong Wang
Journal: Funct Integr Genomics Date: 2018-05-12 Impact factor: 3.410

9. Combinatorial regulation of endothelial gene expression by ets and forkhead transcription factors.

Authors: Sarah De Val; Neil C Chi; Stryder M Meadows; Simon Minovitsky; Joshua P Anderson; Ian S Harris; Melissa L Ehlers; Pooja Agarwal; Axel Visel; Shan-Mei Xu; Len A Pennacchio; Inna Dubchak; Paul A Krieg; Didier Y R Stainier; Brian L Black
Journal: Cell Date: 2008-12-12 Impact factor: 41.582

10. Pure Distal 7q Duplication: Describing a Macrocephalic Neurodevelopmental Syndrome, Case Report and Review of the Literature.

Authors: Kerri Bosfield; Jullianne Diaz; Eyby Leon
Journal: Mol Syndromol Date: 2021-03-29