Literature DB >> 29190397

The international nucleotide sequence database collaboration.

Ilene Karsch-Mizrachi¹, Toshihisa Takagi², Guy Cochrane³.

Abstract

For more than 30 years, the International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org/) has been committed to capturing, preserving and providing access to comprehensive public domain nucleotide sequence and associated metadata which enables discovery in biomedicine, biodiversity and biological sciences. Since 1987, the DNA Data Bank of Japan (DDBJ) at the National Institute for Genetics in Mishima, Japan; the European Nucleotide Archive (ENA) at the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) in Hinxton, UK; and GenBank at National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health in Bethesda, Maryland, USA have worked collaboratively to enable access to nucleotide sequence data in standardized formats for the worldwide scientific community. In this article, we reiterate the principles of the INSDC collaboration and briefly summarize the trends of the archival content. Published by Oxford University Press on behalf of Nucleic Acids Research 2017.

Entities: Chemical Gene Species

Mesh：

Year: 2018 PMID： 29190397 PMCID： PMC5753279 DOI： 10.1093/nar/gkx1097

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The International Nucleotide Sequence Database Collaboration (1) (INSDC: http://www.insdc.org) represents one of the most celebrated global initiatives in public domain data sharing. The collaboration consists of three nodes: DNA Data Bank of Japan (DDBJ: http://www.ddbj.nig.ac.jp/) in Mishima, Japan (2); European Nucleotide Archive (ENA: http://www.ebi.ac.uk/ena/) in Hinxton, UK (3) and GenBank (https://www.ncbi.nlm.nih.gov/genbank/) in Bethesda, Maryland, USA (4). The INSDC members work together to ensure that all public domain nucleotide sequence data deposited in the archives is preserved as part of the scientific record and is accessible in standardized formats across the three sites through daily data exchange. The INSDC archives work together to respond to emerging sequencing technologies. The scope of data in INSDC includes raw sequence reads and alignments in the read archives (SRA), and assembled sequences with functional annotation in the traditional archives. Structured metadata describing the biological sample including taxonomic information, experimental design and project scope are submitted along with the sequences to provide context. The INSDC works in concert with appropriate standards communities, such as the Genomics Standards Consortium for environmental microbiology data (5) and the Global Microbial Identifier for pathogen data (http://www.globalmicrobialidentifier.org/) to ensure rich metadata capture for understanding the origin of the sequences. Each center provides tools to facilitate the deposition of data and associated metadata, as well as gateways for the analysis and retrieval of deposited data. Routine data exchange through standardized formats provides global synchrony across the collaboration to facilitate the study of living things through sequence analysis. These long-held tenets of INSDC are a model for the FAIR Data Principles (6) which promotes published data to be Findable, Accessible, Interoperable and Reusable.

COLLABORATION

Members of the INSDC meet annually to discuss issues related to building and maintaining the sequence archives. The database standards and policies that result from these meetings are presented on the INSDC website (http://www.insdc.org/). The Feature Table Definitions Document (http://www.insdc.org/documents/feature-table) describes feature keys and qualifiers presented in the Flat File report format in the traditional archives. Many of the feature key qualifiers use controlled vocabularies (http://www.insdc.org/insdc-controlled-vocabularies). In addition, documents are provided that describe policies for acceptance of certain data types such as genome assemblies and Third Party (TPA), and best practices for data deposited to the public archives (http://www.insdc.org/documents). Each center provides its user community with tools for the submission of nucleotide sequence data. Improvements are being made to submissions systems at all three sites to make submitting data easier through templated web wizards that guide the submitter to provide rich contextual information along with the sequences and annotation. Validations within the wizards ensure that minimal requirements have been met and that the data are syntactically and semantically valid. A submitter deposits their data at one site and through a coordinated exchange, the data will be presented at all three sites. Each center also provides its user community with tools for the retrieval and analysis of the sequenced data. Though each center has its own tools, the data presented at each site is the same due to the nightly exchange of data. Sequences are accessioned across a single namespace such that an accession search yields the same data content regardless of where the data are accessed.

POLICY

INSDC data are provided openly and free of charge to users. Data presented in the archive can be retrieved and incorporated in subsequent studies which may lead to important scientific discoveries. Ciiting INSDC accession numbers associated with each sequence ensures that the original data submitter is properly credited in accordance with FAIR data sharing principles. Submitters to the database may request that their sequence records are made publicly available immediately following submission. Alternatively, a submission may be kept confidential prior to publication but data are released publicly as soon as the work is presented in a publication. To comply with the consent agreements of human donors who have provided material for sequencing, authorization may be required for access to this data. INSDC archives do not manage these records, rather each partner's institute works under their respective legislative systems with the appropriate ethical bodies and committees to implement appropriate levels of security in their respective data archives (JGA at DDBJ (2); EGA at EMBL-EBI (7) and dbGaP at NCBI (8). INSDC databases are data hosts and not owners; data ownership, and hence editorial control of the scientific content, remains with the original data provider. However, during submission processing, database staff may make minor modifications to submitted data in an effort to provide standardized, validated records to the users. Furthermore, only data owners and their approved delegates are permitted to update their records. To ensure consistency, updates to data must be performed at the INSDC node where the data was initially submitted. The updated records are then propagated to the partner nodes. As a requirement for publication in most journals, any new sequence described in the article should be submitted to INSDC and the accession numbers assigned to the data be cited in the article. In the past, the INSDC worked with journal editors to establish this policy so that that a reader will have access to the underlying data that was described in the paper. In 2016, this principle of data sharing and data citation was reaffirmed by the International Advisory Committee for INSDC in a letter to the scientific community (9,10).

CONTENT IN 2017

Since our previous report on the status of the International Nucleotide Sequence Database Collaboration (1), the assembled/annotated portion of the sequence data maintained by the INSDC grew from 1.432 trillion bases in August 2015 to 2.650 trillion bases in August 2017. The growth rate over this two-year period was 185%, just short of a doubling. During the same time period, the read archive grew by 233% with the addition of 3000 trillion bases. The space required to store a single copy of the reads, increased from 1.5 to 3.2 Petabytes, an increase of 210%. Storage efficiency increased due to storing a greater fraction of submitted data in aligned and compressed format. The cumulative growth in the number of sequenced bases and the number of sequence records in the assembled/annotated portion of the archive over the last decade is detailed in Figure 1A. The doubling time for the number of bases during this ten-year period is 28.4 months, while the doubling time for the number of records is 37.9 months. Figure 1B depicts the growth of the read archive both in storage and in space.

Figure 1.

(A) Cumulative 10-Year INSDC Growth of Assembled/Annotated Data: Sequence bases (solid) and sequence records (dashed). (B) Cumulative 10-Year INSDC Growth of SRA Data: Sequence bases (solid) and single-copy data storage (dashed).

TAXONOMY

The NCBI Taxonomy database (http://www.ncbi.nlm.nih.gov/taxonomy) provides a central organizing hub for many INSDC resources and is curated by taxonomist specialists residing at NCBI (11). All partners in the INSDC send consults whenever a sequence is submitted with an organism name that is not present in the taxonomy database. This originated from a 1997 agreement by the INSDC members to resolve taxonomic issues prior to the release of new sequence data. The final number for all taxa on 1 January 2017 was 512 941. The yearly increase of formal taxonomic names used by INSDC is indicated in Figure 2. The viruses and non-metazoan eukaryotes (4194 and 8940 names respectively on 1 January 2017) are not indicated. It is evident that the increase is constant in all cases with the ratio between the different groups very stable as well.

Figure 2.

Cumulative 10-Year INSDC Growth of Formal Taxonomic Names (all ranks). Names are broken down into Fungi, metazoan eukaryotes (Metazoa), green plants (Viridiplantae) and prokaryotes (Archaea combined with Bacteria). For simplicity, non-metazoan eukaryotic groups and viruses are excluded. In addition to cataloguing taxonomic names, the database also keeps track of voucher identifiers assigned by museums, herbariums and other collections that are declared as type material. This is usually a physical specimen or culture that was assigned during the formal process of describing new species names under rules defined by the various codes of biological nomenclature (12). These vouchers have a special status and are crucial for any comparative work done to determine species identity. Genome sequences derived from type material are currently used to correct the taxonomic assignments of prokaryotic genomes at NCBI. A new INSDC qualifier (/type_material) was introduced and will be added to INSDC records using information from the NCBI Taxonomy database. A list of accepted terms that describes the classes of type material is available from http://www.insdc.org/controlled-vocabulary-typematerial-qualifer

FUTURE OUTLOOK

While a variety of high-throughput life science assay platforms are emerging into the ‘big data’ limelight, we expect that interest in nucleic acid sequencing will continue to grow and be adopted by broader user communities. With ever higher yields and increasing affordability, nucleic acid sequencing is adopted for new uses and to supplement other biological assay types. We see growth in community and population genomics, metagenomes and whole biome microbial surveys like the TARA oceans project (13). Such large-scale efforts not only expand the breadth and depth of scientific knowledge but yield actionable insights valuable to medicine, crop and livestock industries. Sequencing is also proving its value to clinical diagnostics and microbial pathogen surveillance (14). We expect nucleic acid sequence submissions and the need for re-analysis and re-use to continue to grow across existing and new user communities. Given the expected onward growth, INSDC partners continue their development and maintenance of scalable data submission and retrieval systems. The INSDC assures uniform and synchronized data content. Technical implementation details and software development are managed by each partner in accordance with their stakeholders and host institutions. We will continue to engage with international initiatives and the life-sciences community as they drive application of sequencing in different domains. With the increasing value of sequence data and the time-sensitive nature of pathogen surveillance, we continue our work with the Global Microbial Identifier (GMI) initiative in building a global system for rapid sharing of well-structured whole genome sequence data across bacteria, viruses and eukaryotic parasites. The INSDC is committed to providing well-described data sets with maximum discoverability, interoperability and reusability. To this end, we work extensively with community standards groups like the Genomics Standards Consortium on integrating the MIxS standards into submission procedures which yields rich, yet practical, checklist or package based metadata standards for organismal and metagenomic data sets (15). In addition, we lead the Data Standards Working Group of the GMI that drives at metadata and data standards around shared whole genome pathogen sequencing data. Finally, while we will continue to enjoy financial support from our respective institutions, organizations and regional funders, we are actively participating in the global effort to develop a sound footing for core bioinformatics resources, through the HSFPO initiative (16).

16 in total

1. Databases: Reminder to deposit DNA sequences.

Authors: Steven L Salzberg
Journal: Nature Date: 2016-05-12 Impact factor: 49.962

2. Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications.

Authors: Pelin Yilmaz; Renzo Kottmann; Dawn Field; Rob Knight; James R Cole; Linda Amaral-Zettler; Jack A Gilbert; Ilene Karsch-Mizrachi; Anjanette Johnston; Guy Cochrane; Robert Vaughan; Christopher Hunter; Joonhong Park; Norman Morrison; Philippe Rocca-Serra; Peter Sterk; Manimozhiyan Arumugam; Mark Bailey; Laura Baumgartner; Bruce W Birren; Martin J Blaser; Vivien Bonazzi; Tim Booth; Peer Bork; Frederic D Bushman; Pier Luigi Buttigieg; Patrick S G Chain; Emily Charlson; Elizabeth K Costello; Heather Huot-Creasy; Peter Dawyndt; Todd DeSantis; Noah Fierer; Jed A Fuhrman; Rachel E Gallery; Dirk Gevers; Richard A Gibbs; Inigo San Gil; Antonio Gonzalez; Jeffrey I Gordon; Robert Guralnick; Wolfgang Hankeln; Sarah Highlander; Philip Hugenholtz; Janet Jansson; Andrew L Kau; Scott T Kelley; Jerry Kennedy; Dan Knights; Omry Koren; Justin Kuczynski; Nikos Kyrpides; Robert Larsen; Christian L Lauber; Teresa Legg; Ruth E Ley; Catherine A Lozupone; Wolfgang Ludwig; Donna Lyons; Eamonn Maguire; Barbara A Methé; Folker Meyer; Brian Muegge; Sara Nakielny; Karen E Nelson; Diana Nemergut; Josh D Neufeld; Lindsay K Newbold; Anna E Oliver; Norman R Pace; Giriprakash Palanisamy; Jörg Peplies; Joseph Petrosino; Lita Proctor; Elmar Pruesse; Christian Quast; Jeroen Raes; Sujeevan Ratnasingham; Jacques Ravel; David A Relman; Susanna Assunta-Sansone; Patrick D Schloss; Lynn Schriml; Rohini Sinha; Michelle I Smith; Erica Sodergren; Aymé Spo; Jesse Stombaugh; James M Tiedje; Doyle V Ward; George M Weinstock; Doug Wendel; Owen White; Andrew Whiteley; Andreas Wilke; Jennifer R Wortman; Tanya Yatsunenko; Frank Oliver Glöckner
Journal: Nat Biotechnol Date: 2011-05 Impact factor: 54.908

3. The European Genome-phenome Archive of human data consented for biomedical research.

Authors: Ilkka Lappalainen; Jeff Almeida-King; Vasudev Kumanduri; Alexander Senf; John Dylan Spalding; Saif Ur-Rehman; Gary Saunders; Jag Kandasamy; Mario Caccamo; Rasko Leinonen; Brendan Vaughan; Thomas Laurent; Francis Rowland; Pablo Marin-Garcia; Jonathan Barker; Petteri Jokinen; Angel Carreño Torres; Jordi Rambla de Argila; Oscar Martinez Llobet; Ignacio Medina; Marc Sitges Puy; Mario Alberich; Sabela de la Torre; Arcadi Navarro; Justin Paschall; Paul Flicek
Journal: Nat Genet Date: 2015-07 Impact factor: 38.330

4. A holistic approach to marine eco-systems biology.

Authors: Eric Karsenti; Silvia G Acinas; Peer Bork; Chris Bowler; Colomban De Vargas; Jeroen Raes; Matthew Sullivan; Detlev Arendt; Francesca Benzoni; Jean-Michel Claverie; Mick Follows; Gaby Gorsky; Pascal Hingamp; Daniele Iudicone; Olivier Jaillon; Stefanie Kandels-Lewis; Uros Krzic; Fabrice Not; Hiroyuki Ogata; Stéphane Pesant; Emmanuel Georges Reynaud; Christian Sardet; Michael E Sieracki; Sabrina Speich; Didier Velayoudon; Jean Weissenbach; Patrick Wincker
Journal: PLoS Biol Date: 2011-10-18 Impact factor: 8.029

5. The NCBI Taxonomy database.

Authors: Scott Federhen
Journal: Nucleic Acids Res Date: 2011-12-01 Impact factor: 16.971

6. Type material in the NCBI Taxonomy Database.

Authors: Scott Federhen
Journal: Nucleic Acids Res Date: 2014-11-14 Impact factor: 19.160

7. The International Nucleotide Sequence Database Collaboration.

Authors: Guy Cochrane; Ilene Karsch-Mizrachi; Toshihisa Takagi
Journal: Nucleic Acids Res Date: 2015-12-10 Impact factor: 16.971

8. GenBank.

Authors: Dennis A Benson; Mark Cavanaugh; Karen Clark; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Eric W Sayers
Journal: Nucleic Acids Res Date: 2016-11-28 Impact factor: 16.971

9. NCBI's Database of Genotypes and Phenotypes: dbGaP.

Authors: Kimberly A Tryka; Luning Hao; Anne Sturcke; Yumi Jin; Zhen Y Wang; Lora Ziyabari; Moira Lee; Natalia Popova; Nataliya Sharopova; Masato Kimura; Michael Feolo
Journal: Nucleic Acids Res Date: 2013-12-01 Impact factor: 16.971

10. The FAIR Guiding Principles for scientific data management and stewardship.

Authors: Mark D Wilkinson; Michel Dumontier; I Jsbrand Jan Aalbersberg; Gabrielle Appleton; Myles Axton; Arie Baak; Niklas Blomberg; Jan-Willem Boiten; Luiz Bonino da Silva Santos; Philip E Bourne; Jildau Bouwman; Anthony J Brookes; Tim Clark; Mercè Crosas; Ingrid Dillo; Olivier Dumon; Scott Edmunds; Chris T Evelo; Richard Finkers; Alejandra Gonzalez-Beltran; Alasdair J G Gray; Paul Groth; Carole Goble; Jeffrey S Grethe; Jaap Heringa; Peter A C 't Hoen; Rob Hooft; Tobias Kuhn; Ruben Kok; Joost Kok; Scott J Lusher; Maryann E Martone; Albert Mons; Abel L Packer; Bengt Persson; Philippe Rocca-Serra; Marco Roos; Rene van Schaik; Susanna-Assunta Sansone; Erik Schultes; Thierry Sengstag; Ted Slater; George Strawn; Morris A Swertz; Mark Thompson; Johan van der Lei; Erik van Mulligen; Jan Velterop; Andra Waagmeester; Peter Wittenburg; Katherine Wolstencroft; Jun Zhao; Barend Mons
Journal: Sci Data Date: 2016-03-15 Impact factor: 6.444

62 in total

1. EMBL2checklists: A Python package to facilitate the user-friendly submission of plant and fungal DNA barcoding sequences to ENA.

Authors: Michael Gruenstaeudl; Yannick Hartmaring
Journal: PLoS One Date: 2019-01-10 Impact factor: 3.240

2. Neural networks for protein structure and function prediction and dynamic analysis.

Authors: Yuko Tsuchiya; Kentaro Tomii
Journal: Biophys Rev Date: 2020-03-12

Review 3. NCBI Taxonomy: a comprehensive update on curation, resources and tools.

Authors: Conrad L Schoch; Stacy Ciufo; Mikhail Domrachev; Carol L Hotton; Sivakumar Kannan; Rogneda Khovanskaya; Detlef Leipe; Richard Mcveigh; Kathleen O'Neill; Barbara Robbertse; Shobha Sharma; Vladimir Soussov; John P Sullivan; Lu Sun; Seán Turner; Ilene Karsch-Mizrachi
Journal: Database (Oxford) Date: 2020-01-01 Impact factor: 3.451

4. Database resources of the National Center for Biotechnology Information.

Authors: Eric W Sayers; Jeff Beck; J Rodney Brister; Evan E Bolton; Kathi Canese; Donald C Comeau; Kathryn Funk; Anne Ketter; Sunghwan Kim; Avi Kimchi; Paul A Kitts; Anatoliy Kuznetsov; Stacy Lathrop; Zhiyong Lu; Kelly McGarvey; Thomas L Madden; Terence D Murphy; Nuala O'Leary; Lon Phan; Valerie A Schneider; Françoise Thibaud-Nissen; Bart W Trawick; Kim D Pruitt; James Ostell
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

5. Species Identification in Plant-Associated Prokaryotes and Fungi Using DNA.

Authors: Patrik Inderbitzin; Barbara Robbertse; Conrad L Schoch
Journal: Phytobiomes J Date: 2020-03-23

6. Universal Primers for Detection of Novel Plant Capsid-Less Viruses: Papaya Umbra-like Viruses as Example.

Authors: Jorge H Ramirez-Prado; Luisa A Lopez-Ochoa
Journal: Methods Mol Biol Date: 2022

7. Insights into the Function and Horizontal Transfer of Isoproturon Degradation Genes (pdmAB) in a Biobed System.

Authors: Veronika Storck; Sara Gallego; Sotirios Vasileiadis; Sabir Hussain; Jérémie Béguet; Nadine Rouard; Céline Baguelin; Chiara Perruchon; Marion Devers-Lamrani; Dimitrios G Karpouzas; Fabrice Martin-Laurent
Journal: Appl Environ Microbiol Date: 2020-07-02 Impact factor: 4.792

Review 8. Integrating Systems and Synthetic Biology to Understand and Engineer Microbiomes.

Authors: Patrick A Leggieri; Yiyi Liu; Madeline Hayes; Bryce Connors; Susanna Seppälä; Michelle A O'Malley; Ophelia S Venturelli
Journal: Annu Rev Biomed Eng Date: 2021-03-29 Impact factor: 9.590

9. A streamlined workflow for conversion, peer review, and publication of genomics metadata as omics data papers.

Authors: Mariya Dimitrova; Raïssa Meyer; Pier Luigi Buttigieg; Teodor Georgiev; Georgi Zhelezov; Seyhan Demirov; Vincent Smith; Lyubomir Penev
Journal: Gigascience Date: 2021-05-13 Impact factor: 6.524

10. Comparative Analysis and Data Provenance for 1,113 Bacterial Genome Assemblies.

Authors: David A Yarmosh; Juan G Lopera; Nikhita P Puthuveetil; Patrick Ford Combs; Amy L Reese; Corina Tabron; Amanda E Pierola; James Duncan; Samuel R Greenfield; Robert Marlow; Stephen King; Marco A Riojas; John Bagnoli; Briana Benton; Jonathan L Jacobs
Journal: mSphere Date: 2022-05-02 Impact factor: 5.029