Literature DB >> 34850943

GenBank.

Eric W Sayers¹, Mark Cavanaugh¹, Karen Clark¹, Kim D Pruitt¹, Conrad L Schoch¹, Stephen T Sherry¹, Ilene Karsch-Mizrachi¹.

Abstract

GenBank® (https://www.ncbi.nlm.nih.gov/genbank/) is a comprehensive, public database that contains 15.3 trillion base pairs from over 2.5 billion nucleotide sequences for 504 000 formally described species. Recent updates include resources for data from the SARS-CoV-2 virus, including a SARS-CoV-2 landing page, NCBI Datasets, NCBI Virus and the Submission Portal. We also discuss upcoming changes to GI identifiers, a new data management interface for BioProject, and advice for providing contextual metadata in submissions. Published by Oxford University Press on behalf of Nucleic Acids Research 2021.

Entities: Chemical

Mesh：

Year: 2022 PMID： 34850943 PMCID： PMC8690257 DOI： 10.1093/nar/gkab1135

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

GenBank (1) is a comprehensive public database of nucleotide sequences and supporting bibliographic and biological annotations built and distributed by the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine (NLM), located on the campus of the US National Institutes of Health (NIH) in Bethesda, MD, USA. After discussing updates to SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) resources, this paper summarizes the growth of GenBank in the past year and briefly reviews recent updates and developments.

SARS-CoV-2 RESOURCES

As part of our ongoing response to the COVID-19 pandemic that emerged in early 2020, NCBI continues to update several tools and interfaces to support both submitters and consumers of sequence data for SARS-CoV-2. These include the SARS-CoV-2 landing page, NCBI Datasets, NCBI Virus, and the Submission Portal.

SARS-CoV-2 landing page

The SARS-CoV-2 landing page (https://www.ncbi.nlm.nih.gov/sars-cov-2/) collects a wide variety of data and resources related to SARS-CoV-2, including all relevant data in GenBank. Of particular interest to users seeking GenBank data are links to NCBI Datasets and NCBI Virus (see below) along with a link to download the full list of nucleotide accessions for SARS-CoV-2.

NCBI Datasets

NCBI Datasets is an experimental product that allows users to download complex genomic datasets easily using either a web interface, an API or a UNIX/LINUX command-line tool (https://www.ncbi.nlm.nih.gov/datasets/). The specialized coronavirus page released last year now provides genome downloads for almost 430 000 complete SARS-CoV-2 genomes, an annual increase of 29-fold (https://www.ncbi.nlm.nih.gov/datasets/coronavirus/genomes). This page provides downloads of metadata tables for SARS-CoV-2 genomes as well as complete genomic datasets. Users interested in SARS-CoV-2 proteins can access these data on a separate specialized page (https://www.ncbi.nlm.nih.gov/datasets/coronavirus/proteins/). Finally, NCBI Datasets also includes a new genome interface that supports taxonomic searches and selection based on the taxonomic tree. This page may be of interest to users seeking data for other coronaviruses not included on the specialized SARS-CoV-2 pages.

NCBI Virus

The NCBI Virus resource contains an SARS-CoV-2 Hub (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/sars-cov-2) that organizes an extensive set of data and visualizations about SARS-CoV-2 data (Figure 1), including data from the Sequence Read Archive (SRA). The visualizations on the default “Dashboard” view include a world map showing the geographical distribution of SARS-CoV-2 collection locations. Two interactive filters allow users to subset these data by collection date and release date, and these filters update the map display. A 'Tabular View' option loads an interactive table listing all SARS-CoV-2 sequences with 21 filters, including sequence length, collection date, and geographic region. Conveniently, any filters set on the Dashboard transfer to the table, allowing easy exploration of the data. Moreover, users can spawn alignments from this table, and can also build phylogenetic trees.

Figure 1.

SARS-CoV-2 Data Hub in the NCBI Virus resource.

Submission Portal

NCBI continues to update a customized submission portal for both assembled and unassembled SARS-CoV-2 sequences (https://submit.ncbi.nlm.nih.gov/sarscov2/). On average this portal provides accessions back to submitters in 1–2 h, and assembled sequences will be annotated with VADR (2). We encourage submitters to use these portals, as this ensures not only that sequence data are made available through the INSDC databases, but also through the NCBI Virus resource (3), RefSeq (4), and BLAST (5). We also encourage submitters to submit both reads and traditional GenBank sequences, and to submit data to BioProject and BioSample. We are actively updating all of these resources to support novel variants and to update the content of the SARS-CoV-2 pages discussed above.

GROWTH OF THE DATABASE

Divisions with notable increases

GenBank sequences are organized into 21 divisions, each of which is represented by a three-letter abbreviation (Table 1). As shown in Table 1, especially large increases occurred in the VRL, UNA and INV divisions. Not surprisingly, the large increase in the VRL division resulted from the many submissions of SARS-CoV-2 sequences (Figure 2).

Table 1.

Growth of GenBank Divisions

Division	Description	Base pairs^a	Annual increase^b
VRL	Viruses	39 351 597 469	575.68%
UNA	Unannotated	4 421 782	550.93%
INV	Invertebrates	108 680 334 593	450.00%
ROD	Rodents	23 336 550 435	93.02%
PRI	Primates	15 165 437 356	72.97%
WGS	Whole genome shotgun data	13 888 187 863 722	57.08%
TLS	Targeted Loci Studies	39 930 167 315	43.50%
MAM	Other mammals	28 568 850 588	37.06%
VRT	Other vertebrates	85 320 979 451	34.22%
BCT	Bacteria	130 518 385 589	32.07%
PLN	Plants	350 590 744 188	30.12%
TSA	Transcriptome shotgun data	454 757 992 932	19.31%
PHG	Phages	935 884 237	19.59%
PAT	Patent sequences	29 588 418 021	11.85%
ENV	Environmental samples	7 394 414 660	9.46%
SYN	Synthetic	7 994 601 379	0.78%
HTC	High-throughput cDNA	737 423 641	0.57%
HTG	High-throughput genomic	27 800 219 072	0.07%
EST	Expressed sequence tags	43 324 455 796	0.05%
GSS	Genome survey sequences	26 380 049 011	0.01%
STS	Sequence tagged sites	640 923 137	0.00%
TOTAL	All GenBank sequences	15 309 209 714 374	54.79%

aRelease 245 (8/2021).

bRelative to release 239 (8/2020).

Figure 2.

Growth of SARS-CoV-2 sequence data in GenBank. Each data point represents the cumulative number of records (left axis) or base pairs (right axis) at each date.

Growth of GenBank Divisions aRelease 245 (8/2021). bRelative to release 239 (8/2020). Growth of SARS-CoV-2 sequence data in GenBank. Each data point represents the cumulative number of records (left axis) or base pairs (right axis) at each date.

Handling long sequence records

As previously discussed (1), improving sequencing technologies are now capable of producing very long sequences, some of which are longer than what signed 32-bit integers can represent (about 2.1 Gbp). In such cases, submitters must split such records in order to submit them to GenBank. A recent example of such a case is chromosome 1 from the West African lungfish, Protopterus annectens. The total length of this chromosome is 5.26 Gbp, and so in GenBank it is represented by three records: CM033073 (2.00 Gbp), CM033074 (2.00 Gbp) and CM033075 (1.26 Gbp). We encourage GenBank users and developers of products that rely on GenBank data to be aware of the implications involved in representing very long sequences and to consider preparing their own tools for sequence lengths and feature locations that will require 64-bit integers.

RECENT DEVELOPMENTS

Updates to integer sequence identifiers

In addition to the above issue of handling very long individual sequences, GenBank is approaching a point where the number of sequences will exhaust the space of GI identifiers provided by 32-bit integers. To mitigate this, we are taking multiple approaches. First, we continue to recommend that users shift to using accession.version identifiers to refer to all GenBank data (6). Most external NCBI interfaces, including the Entrez web interface and the E-utilities API, now accept and return accession.version identifiers for all sequences. Second, we are transitioning our internal software to use 64-bit integers for GI identifiers. Once the transition occurs, GenBank users will encounter these identifiers in the XML and ASN.1 presentations of GenBank data provided through the Entrez web interface and in GenBank FTP products (https://ncbiinsights.ncbi.nlm.nih.gov/2021/09/02/64-bit-gis/). We encourage developers who rely on GenBank data to ensure that their software is capable of handling these 64-bit identifiers. Such identifiers are easy to recognize, as they are any integer greater than 2 147 483 647.

BioProject data management

When submitters register sequencing projects in the BioProject database (https://www.ncbi.nlm.nih.gov/bioproject), we can create reliable linkages between such sequencing projects and the data they produce, and in many cases to the BioSample database (7) that provides additional information about the biological materials used in the study. In many cases, submitters create BioProject records before they have collected all relevant data and published the results of the study. We have now made it easier for submitters to update their BioProject records with such information by offering a ‘Manage Data’ interface in the Submission portal (https://dataview.ncbi.nlm.nih.gov/?archive=bioproject). Using this interface, submitters can add publications and grants or edit text metadata such as the BioProject title and description. We hope this will allow BioProject to reflect better the current state of these projects and provide a better service to the community.

Advice for submitters

Contextual metadata

As discussed previously (1), we continue to encourage submitters to provide contextual metadata, particularly data that specifies the sampling location (e.g. country, latitude, and longitude). The importance of such basic geographic information, such as country codes displayed on public sequence records (https://insdc.org/country), has only grown with the urgency to verify and track distribution of biodiversity in the current era. Including other data such as the isolate name or number and applicable museum/collection identifiers is also helpful. Where possible, adding links to permanent samples or vouchers at biorepositories provides access to sources with important, richly populated information. This facilitates replication and validation, while also allowing for analyses across scientific disciplines (8). GenBank has long followed the standards of structuring vouchers using Darwin Core formats (9) that allows us to link to specimen pages at external biorepositories using URLs curated in the NCBI BioCollections database (10). Recently, BioCollections included a new category, 'digital repository', that will include online data aggregators of collection data that do not include physical specimens. Additionally, to prepare the way for a more comprehensive treatment of these data elements and to make their presence required as part of the submission process, GenBank and the INSDC developed a set of standardized terms to clearly indicate when submitters cannot provide voucher information, for example in cases where the data were not collected or cannot be reported because of privacy concerns (https://www.insdc.org/missing-value-reporting). In addition to the above, there are other ways in which submitters can enhance their data. Submitters can use evidence tags to provide information about supporting evidence for annotations (https://www.ncbi.nlm.nih.gov/genbank/evidence/). They can cite within their submission the accession numbers of any publicly available sequencing reads they used to improve the quality of their assemblies. When submitting prokaryotic genomes, they can create annotated genomes with NCBI’s Prokaryotic Genome Annotation Pipeline (PGAP; https://www.ncbi.nlm.nih.gov/genome/annotation_prok/) either by submitting FASTA files and requesting PGAP during submission of the genomes to GenBank or by running the public version of PGAP themselves and then submitting the GenBank-ready ASN.1 output file.

Acquiring the database

NCBI provides GenBank sequence records in both the traditional flat file format and in a structured ASN.1 format by anonymous FTP at ftp.ncbi.nlm.nih.gov/genbank. For release 245 (15 August 2021) there are 4032 files requiring 1888 GB of uncompressed disk storage. In addition, daily GenBank incremental update files containing new records and those updated since the most recent release are available in flat file format at ftp.ncbi.nlm.nih.gov/genbank/daily-nc/.

CITING GENBANK

If you use the GenBank database in your published research, we ask that this article be cited.

10 in total

1. GenBank.

Authors: Eric W Sayers; Mark Cavanaugh; Karen Clark; James Ostell; Kim D Pruitt; Ilene Karsch-Mizrachi
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

2. Darwin Core: an evolving community-developed biodiversity data standard.

Authors: John Wieczorek; David Bloom; Robert Guralnick; Stan Blum; Markus Döring; Renato Giovanni; Tim Robertson; David Vieglais
Journal: PLoS One Date: 2012-01-06 Impact factor: 3.240

3. BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata.

Authors: Tanya Barrett; Karen Clark; Robert Gevorgyan; Vyacheslav Gorelenkov; Eugene Gribov; Ilene Karsch-Mizrachi; Michael Kimelman; Kim D Pruitt; Sergei Resenchuk; Tatiana Tatusova; Eugene Yaschenko; James Ostell
Journal: Nucleic Acids Res Date: 2011-12-01 Impact factor: 16.971

4. NCBI viral genomes resource.

Authors: J Rodney Brister; Danso Ako-Adjei; Yiming Bao; Olga Blinkova
Journal: Nucleic Acids Res Date: 2014-11-26 Impact factor: 16.971

5. The NCBI BioCollections Database.

Authors: Shobha Sharma; Stacy Ciufo; Elena Starchenko; Dakshesh Darji; Larry Chlumsky; Ilene Karsch-Mizrachi; Conrad L Schoch
Journal: Database (Oxford) Date: 2018-01-01 Impact factor: 3.451

6. VADR: validation and annotation of virus sequence submissions to GenBank.

Authors: Alejandro A Schäffer; Eneida L Hatcher; Linda Yankie; Lara Shonkwiler; J Rodney Brister; Ilene Karsch-Mizrachi; Eric P Nawrocki
Journal: BMC Bioinformatics Date: 2020-05-24 Impact factor: 3.169

7. BLAST: a more efficient report with usability improvements.

Authors: Grzegorz M Boratyn; Christiam Camacho; Peter S Cooper; George Coulouris; Amelia Fong; Ning Ma; Thomas L Madden; Wayne T Matten; Scott D McGinnis; Yuri Merezhuk; Yan Raytselis; Eric W Sayers; Tao Tao; Jian Ye; Irena Zaretskaya
Journal: Nucleic Acids Res Date: 2013-04-22 Impact factor: 16.971

8. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.

Authors: Nuala A O'Leary; Mathew W Wright; J Rodney Brister; Stacy Ciufo; Diana Haddad; Rich McVeigh; Bhanu Rajput; Barbara Robbertse; Brian Smith-White; Danso Ako-Adjei; Alexander Astashyn; Azat Badretdin; Yiming Bao; Olga Blinkova; Vyacheslav Brover; Vyacheslav Chetvernin; Jinna Choi; Eric Cox; Olga Ermolaeva; Catherine M Farrell; Tamara Goldfarb; Tripti Gupta; Daniel Haft; Eneida Hatcher; Wratko Hlavina; Vinita S Joardar; Vamsi K Kodali; Wenjun Li; Donna Maglott; Patrick Masterson; Kelly M McGarvey; Michael R Murphy; Kathleen O'Neill; Shashikant Pujar; Sanjida H Rangwala; Daniel Rausch; Lillian D Riddick; Conrad Schoch; Andrei Shkeda; Susan S Storz; Hanzhen Sun; Francoise Thibaud-Nissen; Igor Tolstoy; Raymond E Tully; Anjana R Vatsan; Craig Wallin; David Webb; Wendy Wu; Melissa J Landrum; Avi Kimchi; Tatiana Tatusova; Michael DiCuccio; Paul Kitts; Terence D Murphy; Kim D Pruitt
Journal: Nucleic Acids Res Date: 2015-11-08 Impact factor: 16.971

9. GenBank.

Authors: Eric W Sayers; Mark Cavanaugh; Karen Clark; Kim D Pruitt; Conrad L Schoch; Stephen T Sherry; Ilene Karsch-Mizrachi
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

10 in total

7 in total

1. Outbreak of central nervous system infections among children in Thai Binh, Viet Nam.

Authors: Duc Long Phi; Xuan Duong Tran; Minh Manh To; Hai Yen Dang; Thi Dung Pham; Thi Thu Trang Vu; Trong Kiem Tran; Manh Dung Do; Thi Thuy Vu; Stéphane Ranque; Laetitia Ninove; Sylvie Pillet; Philippe Colson; Bernard La Scola; Van Thuan Hoang; Philippe Gautret
Journal: Emerg Microbes Infect Date: 2022-12 Impact factor: 19.568

2. Robust Mutation Profiling of SARS-CoV-2 Variants from Multiple Raw Illumina Sequencing Data with Cloud Workflow.

Authors: Hendrick Gao-Min Lim; Shih-Hsin Hsiao; Yang C Fann; Yuan-Chii Gladys Lee
Journal: Genes (Basel) Date: 2022-04-13 Impact factor: 4.141

Review 3. Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format.

Authors: Kirill Kryukov; Lihua Jin; So Nakagawa
Journal: Patterns (N Y) Date: 2022-07-07

Review 4. Role of genomics in combating COVID-19 pandemic.

Authors: K A Saravanan; Manjit Panigrahi; Harshit Kumar; Divya Rajawat; Sonali Sonejita Nayak; Bharat Bhushan; Triveni Dutt
Journal: Gene Date: 2022-03-04 Impact factor: 3.688

5. Culture and identification of a "Deltamicron" SARS-CoV-2 in a three cases cluster in southern France.

Authors: Philippe Colson; Pierre-Edouard Fournier; Jeremy Delerce; Matthieu Million; Marielle Bedotto; Linda Houhamdi; Nouara Yahi; Jeremy Bayette; Anthony Levasseur; Jacques Fantini; Didier Raoult; Bernard La Scola
Journal: J Med Virol Date: 2022-05-06 Impact factor: 20.693

6. Discovery of Three New Mucor Species Associated with Cricket Insects in Korea.

Authors: Thuong T T Nguyen; Hyang Burm Lee
Journal: J Fungi (Basel) Date: 2022-06-03

Review 7. Computational approaches for predicting variant impact: An overview from resources, principles to applications.

Authors: Ye Liu; William S B Yeung; Philip C N Chiu; Dandan Cao
Journal: Front Genet Date: 2022-09-29 Impact factor: 4.772

7 in total