Literature DB >> 21115458

Entrez Gene: gene-centered information at NCBI.

Donna Maglott1, Jim Ostell, Kim D Pruitt, Tatiana Tatusova.   

Abstract

Entrez Gene (http://www.ncbi.nlm.nih.gov/gene) is National Center for Biotechnology Information (NCBI)'s database for gene-specific information. Entrez Gene maintains records from genomes which have been completely sequenced, which have an active research community to submit gene-specific information, or which are scheduled for intense sequence analysis. The content represents the integration of curation and automated processing from NCBI's Reference Sequence project (RefSeq), collaborating model organism databases, consortia such as Gene Ontology and other databases within NCBI. Records in Entrez Gene are assigned unique, stable and tracked integers as identifiers. The content (nomenclature, genomic location, gene products and their attributes, markers, phenotypes and links to citations, sequences, variation details, maps, expression, homologs, protein domains and external databases) is available via interactive browsing through NCBI's Entrez system, via NCBI's Entrez programming utilities (E-Utilities) and for bulk transfer by FTP.

Entities:  

Mesh:

Year:  2010        PMID: 21115458      PMCID: PMC3013746          DOI: 10.1093/nar/gkq1237

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Entrez Gene is the gene-specific database at the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine, located on the campus of the US National Institutes of Health in Bethesda, MD, USA. Entrez Gene generates unique integers (GeneID) as stable identifiers for genes and other loci for a subset of model organisms. It tracks those identifiers and uses them to integrate multiple types of information including nomenclature, summary descriptions, accessions of gene-specific and gene product-specific sequences, chromosomal localization, reports of pathways and protein interactions, associated markers and phenotypes. Because the GeneID is used to represent gene-specific information in other databases at NCBI, the full Entrez Gene report includes a wealth of links to gene-specific literature citations, sequences, variations, homologs and databases outside of NCBI. Entrez Gene is integrated with NCBI’s Entrez system for interactive query, Linkout and access by E-Utilities (1). Data in Entrez Gene result from integration of results from automated analyses and curation by Reference Sequence project (RefSeq) staff. Gene-specific annotation in sequences from NCBI’s RefSeq (2) or the International Nucleotide Sequence Database Collaboration (INSDC) (3) usually serves as the foundation, with value added by with information from collaborating model organism databases, public users and literature review (especially the Gene References into Function or GeneRIFs submitted by the public and staff of the National Library of Medicine). Updates are posted daily, and corrections or suggestions are welcomed (http://www.ncbi.nlm.nih.gov/RefSeq/update.cgi). As of September 2010, there were almost 7 million current records in Entrez Gene, distributed among more than 7300 taxa (Table 1). Not all the taxa are represented comprehensively in Entrez Gene; most of the eukaryotes, for example, have records only for their mitochondrial or plastid genomes. The Gene Statistics site (http://www.ncbi.nlm.nih.gov/projects/Gene/gentrez_stats.cgi) reports both current and historical counts of records by taxonomic node and species. The history reports can be used to track the growth of the database. For example, the history of the eukaryotic node (http://www.ncbi.nlm.nih.gov/projects/Gene/gentrez_stats.cgi?HIS=1&TAXORG=2759) shows that from 2004 until the present the number of genes represented increased almost 10-fold (221997–2520683), with a 5-fold increase in the number of species (485–2265).
Table 1.

Representative statistics

CategoryTaxaGeneIDs
Records with GO terms3724 1633
Records with GeneRIFs147559 627
From archea/bacteria22904 090 330
From fungi190586 394
From protozoa146400 187
From viruses240482 684
From plants148354 241
From invertebrates670554 008
From vertebrates (non-mammalian)1090134 355
From mammalia311471 725
Representative statistics

FUNCTION OF THE DATABASE

A major goal of the database is to facilitate access to gene-specific information, and thus to expedite data exchange. The unique integer identifier assigned to each record (GeneID) is species specific. In other words, the integer assigned to dystrophin in human is different from that in any other species. The GeneID is reported in RefSeq records as a ‘db_xref’ (e.g. /db_xref= “GeneID:1756”, in GenBank format). The GeneID is also used to define genes in multiple files available for FTP, so that the information associated with GeneIDs is provided for unrestricted public use. Entrez Gene is also key to representation of gene-specific information at NCBI. The information conveyed by establishing the relationship between sequence and a GeneID is used by many NCBI resources. For example, the names associated with GeneIDs are used in HomoloGene, UniGene and RefSeqs. The curated gene to sequence relationship reported in Entrez Gene is used to inform automated annotation of genomes and UniGene clustering.

WEB REPORTS

Entrez Gene provides multiple reports. For the interactive user, the defaults are web pages or files to download based on a query result, which are accessed by making selections revealed when ‘Display Settings’ or ‘Send to’ is activated (Figure 1).
Figure 1.

Representative ‘Summary’ report of query results. Result (partial) of a query to retrieve information about gckr as a gene symbol in mammals or fungi. This figure illustrates several points: (i) use of field restriction in the query; (ii) the display when ‘Limits’ is invoked to restrict results, in this case by species; (iii) use of ‘Display Settings’ to report five records per page ordered by Gene Weight (computed by number of gene-specific citations and conservation) and (iv) use of MyNCBI to highlight matches to the query term in the result set in green. ‘Limits Activated’: Mammalia, Fungi indicates that Mammalia and Fungi were both checked on the form accessed from ‘Limits’ over the query bar. Of the 15 results that were returned, the information under ‘Filter your results’ at the upper right indicate that 11 are current (Current Only, highlighted), 5 have genotype information available in dbSNP (Gene Genotype), 9 can be viewed in Map Viewer (Gene Map Viewer) and 8 have expression data in UniGene (Gene UniGene). For each GeneID returned by the filtered query, the summary includes the species, preferred and alternate symbols, preferred and other descriptive names, chromosome localization, the GeneID and the MIM number when appropriate. Click on any symbol to link to the full report or click on the Entrez Gene text at the upper left to return to Entrez Gene’s home page. The ‘Find related data’ menu in the column at the right allows selecting a database in which to find data related to initial query results. For example, to look for homologs of genes in a result set, select HomoloGene from the menu, read about how these links are calculated and click on ‘Find items’.

The ‘summary display’ results from a query and provides the standard Entrez tools to navigate to information related to the set of records that matched the query (Figure 1). Each gene-specific ‘full report’ is accessed either by a gene-specific URL (e.g. http://www.ncbi.nlm.nih.gov/gene/7097) or by clicking on the symbol in the summary page (Figure 2).
Figure 2.

Representative full report in Entrez Gene. This figure is based on http://www.ncbi.nlm.nih.gov/gene/7097 with several sections closed to allow the report to fit on one page. Note that the concepts enumerated in the Table of Contents at the upper right are provided explicitly in the Entrez Gene full report; concepts enumerated in the Links section are presented from other resources at NCBI. Some of the titles in the Links section do not correspond exactly to the name of an NCBI database. For example, RefSeq proteins result in a display in the Protein database; RefSeq RNA and RefSeqGene result in displays in the Nucleotide database and SNP GeneView results in the gene-specific display from dbSNP.

The ‘Gene Table’ display (e.g. http://www.ncbi.nlm.nih.gov/gene/7097?report=gene_table) reports the intron/exon organization of the gene, as annotated on a RefSeq genomic sequence, with links to access the sequence of each exon, coding region or intron. If a gene is represented on multiple RefSeq genomic sequences, a menu is provided for the user to make a selection. The user can also elect to report the coordinates relative to the selected sequence or relative to the gene. The ‘GeneRIF’ report (e.g. http://www.ncbi.nlm.nih.gov/gene/7097?report=GeneRif) provides a tabular display of GeneRIF texts, with the title and author(s) of each paper. Columns can be sorted by clicking on the column header. The ‘XML’ and ‘ASN.1′ displays are provided as a text-like display without full Entrez functionality. If these pages are opened, the user must use the browser’s back function to return to the Entrez environment. Text of the Summary, Full Report and GeneTable displays can be generated from the ‘Send to’ function at the top right, choosing File, and selecting an option from the menu. Representative ‘Summary’ report of query results. Result (partial) of a query to retrieve information about gckr as a gene symbol in mammals or fungi. This figure illustrates several points: (i) use of field restriction in the query; (ii) the display when ‘Limits’ is invoked to restrict results, in this case by species; (iii) use of ‘Display Settings’ to report five records per page ordered by Gene Weight (computed by number of gene-specific citations and conservation) and (iv) use of MyNCBI to highlight matches to the query term in the result set in green. ‘Limits Activated’: Mammalia, Fungi indicates that Mammalia and Fungi were both checked on the form accessed from ‘Limits’ over the query bar. Of the 15 results that were returned, the information under ‘Filter your results’ at the upper right indicate that 11 are current (Current Only, highlighted), 5 have genotype information available in dbSNP (Gene Genotype), 9 can be viewed in Map Viewer (Gene Map Viewer) and 8 have expression data in UniGene (Gene UniGene). For each GeneID returned by the filtered query, the summary includes the species, preferred and alternate symbols, preferred and other descriptive names, chromosome localization, the GeneID and the MIM number when appropriate. Click on any symbol to link to the full report or click on the Entrez Gene text at the upper left to return to Entrez Gene’s home page. The ‘Find related data’ menu in the column at the right allows selecting a database in which to find data related to initial query results. For example, to look for homologs of genes in a result set, select HomoloGene from the menu, read about how these links are calculated and click on ‘Find items’. Representative full report in Entrez Gene. This figure is based on http://www.ncbi.nlm.nih.gov/gene/7097 with several sections closed to allow the report to fit on one page. Note that the concepts enumerated in the Table of Contents at the upper right are provided explicitly in the Entrez Gene full report; concepts enumerated in the Links section are presented from other resources at NCBI. Some of the titles in the Links section do not correspond exactly to the name of an NCBI database. For example, RefSeq proteins result in a display in the Protein database; RefSeq RNA and RefSeqGene result in displays in the Nucleotide database and SNP GeneView results in the gene-specific display from dbSNP.

FTP AND E-UTILITIES

In addition to these views from Entrez, Gene provides a complete database extraction as well as several special reports for FTP transfer (ftp://ftp.ncbi.nlm.nih.gov/gene/README). Most of the files on the ftp site are refreshed daily. The data are also available from the programmatic interface to Entrez, namely E-Utilities (1).

CONTENT OF THE DATABASE

When are GeneIDs assigned and how is each categorized?

A GeneID is usually assigned to what is annotated as a gene on a RefSeq record. Exceptions include RefSeqs from bacterial genomes that are annotated whole-genome shotgun sequences. A GeneID may also be assigned when no RefSeq exists. This may occur when an authoritative source for a genome, such as a model organism-specific database, assigns an identifier to what is termed a gene, mapped locus or trait, even though that entity is not completely defined by sequence. When a record in Entrez Gene is established, it is assigned a category (e.g. protein coding, pseudogene, rRNA, unknown) consistent with the molecule types defined by the INSDC. The term ‘unknown’ is used when the category is under review by RefSeq staff, as when some of the sequences defining the gene are annotated with coding regions, but the support for that annotation is inconclusive. The category can change without changing the GeneID.

A representative full record

A full record in Entrez Gene is subdivided into content-specific sections as summarized in its table of contents and the section headers (Figure 2). Each section of the record can be collapsed, and the section divider has both a link (icon: question mark) to documentation and function to return to the top of the page. Not all records will have content in each category, but all have a GeneID, names and information supporting the creation of the record (either sequence, link to an external database or publications). Some of the content is not reviewed by NCBI staff, but integrated automatically. For example, the content in the Interactions section, and several sections of the General Gene Information sections are primarily from external groups [e.g. EcoCyc (4), Gene Ontology Consortium (5), KEGG (6), Reactome (7)]. When genomic RefSeqs annotated with the gene are available, the ‘Genomic regions, transcripts and products’ section includes an embedded, interactive sequence display that can be expanded. To expedite loading of web pages, the default display of the full record often renders only a subset of the bibliographic and interaction information. Links are provided within those sections to navigate to additional pages. To get the full report in one page, the ‘Send to’ option allows saving the record as a text file. Comprehensive and up-to-date documentation of the contents and maintenance of these sections are provided in the Gene Help Book on NCBI’s bookshelf (http://www.ncbi.nlm.nih.gov/books/NBK3839/). In addition to the content it displays directly, Entrez Gene provides numerous links to information from other databases within the text and in the Links menu at the right (Figure 2). For example, clicking on ‘RefSeq protein’, ‘RefSeq RNA’ or RefSeqGene in the menu at the right takes users to the Nucleotide database where the RefSeq records specific to one gene can be retrieved, reviewed and analyzed. Similarly, users may select HomoloGene or ProteinClusters (8) links for integration of information about homologs, Map Viewer for extended genomic context and comparative maps, GENSAT, UniGene and GEO for expression data, Conserved Domain Database for domain content of proteins, OMIM (9) for human Mendelian disorders, PubMed and Books for publications. Entrez Gene also provides extensive links to species- or gene-specific databases or gene records in other browsers. Many groups also use the LinkOut (1) method to link their resources to information in Entrez Gene. The integration of explicit content links to gene-specific reports in other NCBI databases, and links to external resources all contribute to making Entrez Gene an effective site to retrieve gene-specific information.

ACCESS TO ENTREZ GENE

The information in Entrez Gene can be accessed in multiple ways at NCBI (Table 2). The simplest way is to submit an interactive query to Entrez from the NCBI home page and display the results in Gene, or enter a query in any Entrez query bar and restrict the database search to Gene. Starting from Entrez Gene directly, the ‘Limits’ and ‘Advanced Search’ pages make it easier to construct complex queries and submit them. For example, the ‘Limits’ page supports finding genes by chromosome location or in a taxonomic node and the ‘Advanced Search’ page has a query builder, a function to browse all the terms in the database and the fields in which they occur (browse index) and a tool to combine and compare previous query results (search history). All the text in the Entrez Gene record is indexed to support retrieval. For a more comprehensive discussion on how to query Entrez Gene, please refer to the Query Tips section of the help documentation. If the location in the record that matches a query term is not immediately obvious, the text of interest may be in the next page of a paginated section.
Table 2.

Accessing Entrez gene

Direct query
 Enter search term(s) and select results shown in the Gene sectionhttp://www.ncbi.nlm.nih.gov or http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?
 Enter search term(s) and query only Entrez genehttp://www.ncbi.nlm.nih.gov/gene or select Gene as the search option from any Entrez query bar
 E-Utilities: check the result interactively. (Hint: view source if the browser does not display the XML.)http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id =672&retmode=xml l
Record-specific connections to gene from other NCBI databases
 Gene option in the links menu located at the right of a displayClick on gene to find gene records related to the record being displayed.
 Gene ‘ads’A query that looks like a gene symbol results in a gene Ad (located above the query results) suggesting users to check Entrez Gene for additional information; or, for sequence records with explicit links, an Ad is provided in the right column to highlight the link to Entrez gene.
 Gene symbols or GeneIDMany NCBI databases provide links to Entrez Gene anchored on either the gene symbol or the GeneID.
 Links called Gene or GMap Viewer’s annotation of Genes; BLAST retrieval of accessions connected to Gene records.
More information
 Help documentationhttp://www.ncbi.nlm.nih.gov/books/NBK3839/
 General use of Entrezhttp://www.ncbi.nlm.nih.gov/books/NBK3836/
Accessing Entrez gene Another way to access Entrez Gene is to take advantage of links computed by the Entrez system (1). For example, users starting at PubMed may use the ‘Find related data’ or ‘All links from this record’ options to discover records in Entrez Gene connected to the publication(s). The BLAST group uses the GeneID–sequence relationship maintained by Entrez Gene to help users navigate from protein or mRNA accessions matching a sequence query to Entrez Gene via the blue G icon. Map Viewer provides links from annotated genes to Entrez Gene. And RefSeq records include the GeneID as a db_xref in the gene feature. Thus, users can navigate to Entrez Gene not only by text but also by genomic position, RefSeq annotation and sequence data (BLAST, Nucleotide, Protein). Users are encouraged to register for MyNCBI (http://www.ncbi.nlm.nih.gov/books/NBK3843/). which supports registering searches and receiving e-mails when records are created or updated. It also supports customizing the display to identify what subset of records returned by a query has particular attributes.

FUTURE DIRECTIONS

The number of records in Entrez Gene will continue to increase as new species are sequenced and genes are identified. During 2011, sections will be added to the web interface and/or the content will be enhanced so that users will be provided more information in the full report before navigating to related sites at NCBI. This transition was started in 2010 with the addition of the phenotype section. Finally, as new databases with gene-specific content are implemented at NCBI, content and/or links will be added to Entrez Gene.

FEEDBACK

We welcome feedback with respect to the Entrez Gene interface or any data contained therein. Please select from the Feedback options on any Gene page (Figure 1).

FUNDING

Funding for open access charge: The Intramural Research Program of the National Institutes of Health; National Library of Medicine. Conflict of interest statement. None declared.
  9 in total

1.  KEGG: Kyoto Encyclopedia of Genes and Genomes.

Authors:  H Ogata; S Goto; K Sato; W Fujibuchi; H Bono; M Kanehisa
Journal:  Nucleic Acids Res       Date:  1999-01-01       Impact factor: 16.971

2.  EcoCyc: a comprehensive view of Escherichia coli biology.

Authors:  Ingrid M Keseler; César Bonavides-Martínez; Julio Collado-Vides; Socorro Gama-Castro; Robert P Gunsalus; D Aaron Johnson; Markus Krummenacker; Laura M Nolan; Suzanne Paley; Ian T Paulsen; Martin Peralta-Gil; Alberto Santos-Zavaleta; Alexander Glennon Shearer; Peter D Karp
Journal:  Nucleic Acids Res       Date:  2008-10-30       Impact factor: 16.971

3.  NCBI Reference Sequences: current status, policy and new initiatives.

Authors:  Kim D Pruitt; Tatiana Tatusova; William Klimke; Donna R Maglott
Journal:  Nucleic Acids Res       Date:  2008-10-16       Impact factor: 16.971

4.  Reactome knowledgebase of human biological pathways and processes.

Authors:  Lisa Matthews; Gopal Gopinath; Marc Gillespie; Michael Caudy; David Croft; Bernard de Bono; Phani Garapati; Jill Hemish; Henning Hermjakob; Bijay Jassal; Alex Kanapin; Suzanna Lewis; Shahana Mahajan; Bruce May; Esther Schmidt; Imre Vastrik; Guanming Wu; Ewan Birney; Lincoln Stein; Peter D'Eustachio
Journal:  Nucleic Acids Res       Date:  2008-11-03       Impact factor: 16.971

5.  GenBank.

Authors:  Dennis A Benson; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Eric W Sayers
Journal:  Nucleic Acids Res       Date:  2009-11-12       Impact factor: 16.971

6.  The Gene Ontology in 2010: extensions and refinements.

Authors: 
Journal:  Nucleic Acids Res       Date:  2009-11-17       Impact factor: 16.971

7.  Database resources of the National Center for Biotechnology Information.

Authors:  Eric W Sayers; Tanya Barrett; Dennis A Benson; Evan Bolton; Stephen H Bryant; Kathi Canese; Vyacheslav Chetvernin; Deanna M Church; Michael Dicuccio; Scott Federhen; Michael Feolo; Lewis Y Geer; Wolfgang Helmberg; Yuri Kapustin; David Landsman; David J Lipman; Zhiyong Lu; Thomas L Madden; Tom Madej; Donna R Maglott; Aron Marchler-Bauer; Vadim Miller; Ilene Mizrachi; James Ostell; Anna Panchenko; Kim D Pruitt; Gregory D Schuler; Edwin Sequeira; Stephen T Sherry; Martin Shumway; Karl Sirotkin; Douglas Slotta; Alexandre Souvorov; Grigory Starchenko; Tatiana A Tatusova; Lukas Wagner; Yanli Wang; W John Wilbur; Eugene Yaschenko; Jian Ye
Journal:  Nucleic Acids Res       Date:  2009-11-12       Impact factor: 16.971

8.  McKusick's Online Mendelian Inheritance in Man (OMIM).

Authors:  Joanna Amberger; Carol A Bocchini; Alan F Scott; Ada Hamosh
Journal:  Nucleic Acids Res       Date:  2008-10-08       Impact factor: 16.971

9.  The National Center for Biotechnology Information's Protein Clusters Database.

Authors:  William Klimke; Richa Agarwala; Azat Badretdin; Slava Chetvernin; Stacy Ciufo; Boris Fedorov; Boris Kiryutin; Kathleen O'Neill; Wolfgang Resch; Sergei Resenchuk; Susan Schafer; Igor Tolstoy; Tatiana Tatusova
Journal:  Nucleic Acids Res       Date:  2008-10-21       Impact factor: 16.971

  9 in total
  374 in total

1.  Similarity-based disease risk assessment for personal genomes: proof of concept.

Authors:  Jung Hoon Woo; Albert M Lai; Wendy K Chung; Chunhua Weng
Journal:  AMIA Annu Symp Proc       Date:  2011-10-22

2.  The forkhead transcription factor FOXK2 promotes AP-1-mediated transcriptional regulation.

Authors:  Zongling Ji; Ian J Donaldson; Jingru Liu; Andrew Hayes; Leo A H Zeef; Andrew D Sharrocks
Journal:  Mol Cell Biol       Date:  2011-11-14       Impact factor: 4.272

3.  Identifying gene interaction networks.

Authors:  Gurkan Bebek
Journal:  Methods Mol Biol       Date:  2012

4.  pubmed.mineR: an R package with text-mining algorithms to analyse PubMed abstracts.

Authors:  Jyoti Rani; A B Rauf Shah; Srinivasan Ramachandran
Journal:  J Biosci       Date:  2015-10       Impact factor: 1.826

5.  Atlas of Subcellular RNA Localization Revealed by APEX-Seq.

Authors:  Furqan M Fazal; Shuo Han; Kevin R Parker; Pornchai Kaewsapsak; Jin Xu; Alistair N Boettiger; Howard Y Chang; Alice Y Ting
Journal:  Cell       Date:  2019-06-20       Impact factor: 41.582

6.  Reactome pathway analysis to enrich biological discovery in proteomics data sets.

Authors:  Robin Haw; Henning Hermjakob; Peter D'Eustachio; Lincoln Stein
Journal:  Proteomics       Date:  2011-09       Impact factor: 3.984

7.  Genetics of new-onset diabetes after transplantation.

Authors:  Jennifer A McCaughan; Amy Jayne McKnight; Alexander P Maxwell
Journal:  J Am Soc Nephrol       Date:  2013-12-05       Impact factor: 10.121

8.  Widespread signals of convergent adaptation to high altitude in Asia and america.

Authors:  Matthieu Foll; Oscar E Gaggiotti; Josephine T Daub; Alexandra Vatsiou; Laurent Excoffier
Journal:  Am J Hum Genet       Date:  2014-09-25       Impact factor: 11.025

9.  MiR-139-5p is associated with inflammatory regulation through c-FOS suppression, and contributes to the progression of primary biliary cholangitis.

Authors:  Tomohiro Katsumi; Masashi Ninomiya; Taketo Nishina; Kei Mizuno; Kyoko Tomita; Hiroaki Haga; Kazuo Okumoto; Takafumi Saito; Tooru Shimosegawa; Yoshiyuki Ueno
Journal:  Lab Invest       Date:  2016-09-26       Impact factor: 5.662

10.  Tissue-specific regulatory circuits reveal variable modular perturbations across complex diseases.

Authors:  Daniel Marbach; David Lamparter; Gerald Quon; Manolis Kellis; Zoltán Kutalik; Sven Bergmann
Journal:  Nat Methods       Date:  2016-03-07       Impact factor: 28.547

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.