Literature DB >> 26657633

The International Nucleotide Sequence Database Collaboration.

Guy Cochrane¹, Ilene Karsch-Mizrachi², Toshihisa Takagi³.

Abstract

The International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org) comprises three global partners committed to capturing, preserving and providing comprehensive public-domain nucleotide sequence information. The INSDC establishes standards, formats and protocols for data and metadata to make it easier for individuals and organisations to submit their nucleotide data reliably to public archives. This work enables the continuous, global exchange of information about living things. Here we present an update of the INSDC in 2015, including data growth and diversification, new standards and requirements by publishers for authors to submit their data to the public archives. The INSDC serves as a model for data sharing in the life sciences.

Entities: Chemical Disease Species

Mesh：

Year: 2015 PMID： 26657633 PMCID： PMC4702924 DOI： 10.1093/nar/gkv1323

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org) (1) is one of the largest, longest-standing partnerships championing public access to primary scientific data. It has spearheaded the establishment of standards, formats and protocols for the collection of nucleotide sequence data and metadata, and has provided the internationally recognised system of accession numbers for data submitters and scientific journals since the early 1980s. INSDC partners are (in alphabetical order): the DNA Data Bank of Japan (DDBJ; http://www.ddbj.nig.ac.jp/) at the National Institute for Genetics in Mishima, Japan; the European Nucleotide Archive (ENA, www.ebi.ac.uk/ena) at the EMBL European Bioinformatics Institute (EMBL-EBI) in Cambridge, UK; and GenBank (http://www.ncbi.nlm.nih.gov/genbank) at the National Center for Biotechnology Information (NCBI) in Bethesda, MD, USA. The INSDC's policy (http://www.insdc.org/policy.html), first published in 2002, emphasises the collaboration's mandate to uphold free, unrestricted access to all of the data records their databases contain. Every day, INSDC partners capture, preserve, share and exchange a comprehensive collection of nucleotide sequence and associated information. Thanks to the falling costs of sequencing, INSDC handles a staggering volume of data (2,400 trillion bases in 2015) and develops new services to handle the changing landscape of data types generated using emerging high-throughput technologies. Its repositories are built to accommodate everything from raw data (i.e. next-generation sequencing reads (2) through assembly data, experimental design details, taxonomic information, functional annotation and information about the projects and biological samples associated with sequencing efforts. All INSDC partners provide assembled sequences and annotations and are well synchronised in their wider activities, including routine data exchange, standard formats and sharing technology. The INSDC is responsive to the rapidly changing needs of the world's growing molecular-biology community, but its core mandate remains unchanged. This article presents a reaffirmation of INSDC policy and overview of recent activities of the global, public nucleotide sequence archives.

INSDC POLICY

The core of the INSDC policy is maintaining public access to the global archives of nucleotide data generated in publicly funded experiments. A key instrument for this is submission as pre-requisite for publication in scholarly journals, a convention in which INSDC partners and publishers work together to ensure timely and smooth flow of data into repositories for release before, or at the time of, literature publication. The primary benefit of this is that scientists all over the world can access these records at any time to plan experiments, analyse published findings or support their critique. It also ensures that the author of the work receives the appropriate credit, and that this narrative context remains linked to underlying data that remain in perpetuity. All database records submitted to the INSDC remain permanently accessible as part of the scientific record. INSDC partners do not themselves place restrictions on the redistribution or use of the data. Terms of use are available from each partner (DDBJ: http://www.ddbj.nig.ac.jp/intro-e.html#mission, ENA: http://www.ebi.ac.uk/about/terms-of-use, GenBank: http://www.ncbi.nlm.nih.gov/home/about/policies.shtml). The INSDC places responsibility for ensuring quality and accuracy firmly with the submitting authors, though the teams that maintain the databases provide a wealth of tools and support for submitters to achieve the best quality and organisation of content possible.

HIGH STANDARDS

The INSDC could not operate without the standardisation of all deposited data. The consortium's work in this area focuses on harmonising syntactical representation, supporting minimum information efforts and providing annotation style recommendations for consistency and clarity. Guidelines, data structures and systematic vocabularies developed by the INSDC include the Feature Table Definitions document (http://www.insdc.org/documents/feature-table), the INSDC country list (http://www.insdc.org/country.html) and conventions in the description of experimental support for annotated features (http://www.insdc.org/recommendations-vocabulary-insdc-experiment-qualifiers). The partners support standardisation efforts driven by the expert communities for which sequence data is an essential tool. This includes the ‘Minimum Information about any (x) Sequence’ standard (MIxS, (3), which is developed by the Genomic Standards Consortium (4), and the Minimum Contextual Data Checklist for pathogen surveillance data, which is developed by the Global Microbial Identifier (GMI) initiative. The MIxS relates to reporting on biological material provenance and experimentation procedure associated with genomes, metagenomes and marker gene sequences and has a particular importance in environmental genomics. The GMI checklist relates to instructions for genome-scale pathogen sequence submissions, enabling the global identification of microorganisms and, ultimately, detection of outbreaks and new pathogens (see http://bit.ly/mindatamatch). Using a standardised, INSDC-agreed language, submitters now report missing values for mandatory descriptors with structure (http://www.ebi.ac.uk/ena/about/missing-values-reporting). Such enhanced information makes it easier to interpret these cases sensibly. INSDC partners have developed submission systems that guide users through the deposition of sequences, annotations and contextual data. These systems incorporate validations to ensure that deposited data is of high quality. Adherence to agreed data standards allows INSDC partners to develop complementary data-submission tools with the same essential reporting requirements, to exchange data on a daily basis and to present the same content in different ways according to local user needs.

DOUBLING TIME

The INSDC assembled/annotated sequence dataset grew from 450 481 663 919 bases in September 2012 to 1 401 669 271 501 in September 2015 (see Table 1 and http://www.ebi.ac.uk/ena/about/statistics). This means that since 2012, this part of INSDC has trebled in size.

Table 1.

Growth in INSDC assembled/annotated sequences, 2012–2015

Year	Nucleotides in assembled/annotated sequences
September 2012	450 481 663 919
September 2013	670 004 320 378
September 2014	997 958 152 853
September 2015	1 401 669 271 501

After several years of aggressive data growth, the doubling time of read data in the public archives is increasing. In other words, it is taking longer for the volume of raw data to double. The doubling time for raw data in October 2015 was 20.6 months. While this rate is slower than in previous years, it remains rapid and challenging in terms of managing network and data storage. INSDC partners continue to exploit reference-based compression models (5). CRAM (http://www.ebi.ac.uk/ena/about/compression-policy) and cSRA (https://github.com/ncbi/sra-tools/blob/master/README.md) are supported in sequence databases at EMBL-EBI and NCBI, respectively. In addition, INSDC partners work with data provider communities to control the size of submitted datasets through improved data structuring and organisation. In contrast, we have witnessed a sharp drop in doubling time for assembled/annotated sequence since 2013, representing an increase in growth rates for this section of INSDC. Figure 1 shows these rapidly increasing rates of growth in this area. We ascribe much of this effect to the embracing of whole-genome sequencing by the pathogen genomics community, covering surveillance, identification, typing and drug-resistance profiling and more exploratory research approaches for which data sharing and rapid access are vital.

Figure 1.

Cumulative growth in INSDC. (A) Base pairs (black, 2365.5 trillion) and sequence reads (blue, 17.8 trillion) for INSDC raw data. (B) Base pairs (black 1449 billion) and sequences (blue, 651.5 million) in INSDC assembled/annotated data. Between January 2014 and October 2015, over 35 000 assembled prokaryotic and eukaryotic genomes were submitted to the INSDC databases.

COLLABORATION

INSDC partners work in close collaboration with one another and with countless life-science communities throughout the world. The annual meetings of the INSDC address issues spanning day-to-day operations, specific details of the Feature Table Definitions document and questions of policy and strategy—notably in managing the rapidly growing volumes of sequence data that must be archived.

5 in total

1. Efficient storage of high throughput DNA sequencing data using reference-based compression.

Authors: Markus Hsi-Yang Fritz; Rasko Leinonen; Guy Cochrane; Ewan Birney
Journal: Genome Res Date: 2011-01-18 Impact factor: 9.043

2. Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications.

Authors: Pelin Yilmaz; Renzo Kottmann; Dawn Field; Rob Knight; James R Cole; Linda Amaral-Zettler; Jack A Gilbert; Ilene Karsch-Mizrachi; Anjanette Johnston; Guy Cochrane; Robert Vaughan; Christopher Hunter; Joonhong Park; Norman Morrison; Philippe Rocca-Serra; Peter Sterk; Manimozhiyan Arumugam; Mark Bailey; Laura Baumgartner; Bruce W Birren; Martin J Blaser; Vivien Bonazzi; Tim Booth; Peer Bork; Frederic D Bushman; Pier Luigi Buttigieg; Patrick S G Chain; Emily Charlson; Elizabeth K Costello; Heather Huot-Creasy; Peter Dawyndt; Todd DeSantis; Noah Fierer; Jed A Fuhrman; Rachel E Gallery; Dirk Gevers; Richard A Gibbs; Inigo San Gil; Antonio Gonzalez; Jeffrey I Gordon; Robert Guralnick; Wolfgang Hankeln; Sarah Highlander; Philip Hugenholtz; Janet Jansson; Andrew L Kau; Scott T Kelley; Jerry Kennedy; Dan Knights; Omry Koren; Justin Kuczynski; Nikos Kyrpides; Robert Larsen; Christian L Lauber; Teresa Legg; Ruth E Ley; Catherine A Lozupone; Wolfgang Ludwig; Donna Lyons; Eamonn Maguire; Barbara A Methé; Folker Meyer; Brian Muegge; Sara Nakielny; Karen E Nelson; Diana Nemergut; Josh D Neufeld; Lindsay K Newbold; Anna E Oliver; Norman R Pace; Giriprakash Palanisamy; Jörg Peplies; Joseph Petrosino; Lita Proctor; Elmar Pruesse; Christian Quast; Jeroen Raes; Sujeevan Ratnasingham; Jacques Ravel; David A Relman; Susanna Assunta-Sansone; Patrick D Schloss; Lynn Schriml; Rohini Sinha; Michelle I Smith; Erica Sodergren; Aymé Spo; Jesse Stombaugh; James M Tiedje; Doyle V Ward; George M Weinstock; Doug Wendel; Owen White; Andrew Whiteley; Andreas Wilke; Jennifer R Wortman; Tanya Yatsunenko; Frank Oliver Glöckner
Journal: Nat Biotechnol Date: 2011-05 Impact factor: 54.908

3. The sequence read archive.

Authors: Rasko Leinonen; Hideaki Sugawara; Martin Shumway
Journal: Nucleic Acids Res Date: 2010-11-09 Impact factor: 16.971

4. The genomic standards consortium: bringing standards to life for microbial ecology.

Authors: Pelin Yilmaz; Jack A Gilbert; Rob Knight; Linda Amaral-Zettler; Ilene Karsch-Mizrachi; Guy Cochrane; Yasukazu Nakamura; Susanna-Assunta Sansone; Frank Oliver Glöckner; Dawn Field
Journal: ISME J Date: 2011-04-07 Impact factor: 10.302

5. The International Nucleotide Sequence Database Collaboration.

Authors: Yasukazu Nakamura; Guy Cochrane; Ilene Karsch-Mizrachi
Journal: Nucleic Acids Res Date: 2012-11-24 Impact factor: 16.971

5 in total

62 in total

1. Databases: Reminder to deposit DNA sequences.

Authors: Steven L Salzberg
Journal: Nature Date: 2016-05-12 Impact factor: 49.962

2. grabseqs: simple downloading of reads and metadata from multiple next-generation sequencing data repositories.

Authors: Louis J Taylor; Arwa Abbas; Frederic D Bushman
Journal: Bioinformatics Date: 2020-06-01 Impact factor: 6.937

Review 3. A review of methods and databases for metagenomic classification and assembly.

Authors: Florian P Breitwieser; Jennifer Lu; Steven L Salzberg
Journal: Brief Bioinform Date: 2019-07-19 Impact factor: 11.622

4. Quantifying and Cataloguing Unknown Sequences within Human Microbiomes.

Authors: Sejal Modha; David L Robertson; Joseph Hughes; Richard J Orton
Journal: mSystems Date: 2022-03-08 Impact factor: 7.324

5. DFAST and DAGA: web-based integrated genome annotation tools and resources.

Authors: Yasuhiro Tanizawa; Takatomo Fujisawa; Eli Kaminuma; Yasukazu Nakamura; Masanori Arita
Journal: Biosci Microbiota Food Health Date: 2016-07-14

6. viruSITE-integrated database for viral genomics.

Authors: Matej Stano; Gabor Beke; Lubos Klucar
Journal: Database (Oxford) Date: 2016-12-26 Impact factor: 3.451

7. The ProteomeXchange consortium in 2017: supporting the cultural change in proteomics public data deposition.

Authors: Eric W Deutsch; Attila Csordas; Zhi Sun; Andrew Jarnuczak; Yasset Perez-Riverol; Tobias Ternent; David S Campbell; Manuel Bernal-Llinares; Shujiro Okuda; Shin Kawano; Robert L Moritz; Jeremy J Carver; Mingxun Wang; Yasushi Ishihama; Nuno Bandeira; Henning Hermjakob; Juan Antonio Vizcaíno
Journal: Nucleic Acids Res Date: 2016-10-18 Impact factor: 16.971

8. IPD-MHC 2.0: an improved inter-species database for the study of the major histocompatibility complex.

Authors: Giuseppe Maccari; James Robinson; Keith Ballingall; Lisbeth A Guethlein; Unni Grimholt; Jim Kaufman; Chak-Sum Ho; Natasja G de Groot; Paul Flicek; Ronald E Bontrop; John A Hammond; Steven G E Marsh
Journal: Nucleic Acids Res Date: 2016-11-28 Impact factor: 16.971

9. European Nucleotide Archive in 2016.

Authors: Ana Luisa Toribio; Blaise Alako; Clara Amid; Ana Cerdeño-Tarrága; Laura Clarke; Iain Cleland; Susan Fairley; Richard Gibson; Neil Goodgame; Petra Ten Hoopen; Suran Jayathilaka; Simon Kay; Rasko Leinonen; Xin Liu; Josué Martínez-Villacorta; Nima Pakseresht; Jeena Rajan; Kethi Reddy; Marc Rosello; Nicole Silvester; Dmitriy Smirnov; Daniel Vaughan; Vadim Zalunin; Guy Cochrane
Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971

10. Comparative genomics of the major parasitic worms.

Authors:
Journal: Nat Genet Date: 2018-11-05 Impact factor: 38.330