Literature DB >> 23180798

The International Nucleotide Sequence Database Collaboration.

Yasukazu Nakamura¹, Guy Cochrane, Ilene Karsch-Mizrachi.

Abstract

The International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org), one of the longest-standing global alliances of biological data archives, captures, preserves and provides comprehensive public domain nucleotide sequence information. Three partners of the INSDC work in cooperation to establish formats for data and metadata and protocols that facilitate reliable data submission to their databases and support continual data exchange around the world. In this article, the INSDC current status and update for the year of 2012 are presented. Among discussed items of international collaboration meeting in 2012, BioSample database and changes in submission are described as topics.

Entities: Chemical Gene

Mesh：

Year: 2012 PMID： 23180798 PMCID： PMC3531182 DOI： 10.1093/nar/gks1084

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

For over 30 years, the International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org) has maintained the primary nucleotide sequence database. INSDC has collected nucleotide sequence data and metadata from researchers and has issued the internationally authorized accession number, for data submitters and scientific journals. The INSDC consists of three partners; the DNA Data Bank of Japan (DDBJ; http://www.ddbj.nig.ac.jp/) at the National Institute for Genetics in Mishima, Japan; the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI; http://www.ebi.ac.uk/ena) in Hinxton, UK and the National Center for Biotechnology Information (NCBI; http://www.ncbi.nlm.nih.gov/genbank/) in Bethesda, MD, USA. The INSDC has a uniform policy of free and unrestricted access to the data (1). Under the policy, the INSDC captures, preserves, provides and exchanges the comprehensive nucleotide sequence and associated information on a daily basis. As new sequencing technology has emerged and has been deployed, the scope of sequencing activity has grown enormously, and INSDC has launched new services that deal with the richness of the domain, including repositories for raw data [the Trace Archives for Sanger method and Sequence Read Archive (SRA) for next-generation platforms] (2), assembly data, experimental design details, taxonomic information, functional annotation, project information and sample information. As a traditional data set, assembled sequences and annotations are available from DDBJ (3), the EMBL-Bank component of the European Nucleotide Archive (4) and GenBank from NCBI (5). Routine data exchange, standard formats and the sharing of technology provide global synchrony across the collaboration. In this article, we outline the current status of, and changes to, INSDC including the creation of the BioSample databases (6,7) and some modifications that allow INSDC partners to respond to demands of the research domain.

CONTENT IN 2012

In total, whole INSDC data set has grown overall ∼2-fold in terms of the number of bases in 2012. The latest release 90.0 of DDBJ contains data prepared as of August 24, 2012, from DDBJ, EMBL-Bank/EBI and GenBank, that is, traditional part of INSDC data set. The release consists of 156 952 755 sequence entries and 144 754 534 372 nucleotides. From August 2011 to August 2012, traditional INSDC has grown 1.3-fold in terms of the number of bases and 1.2-fold in the number of entries (Figure 1), whereas SRA for raw data of NGS has grown 2.4-fold in the number of bases (Figure 2).

Figure 1.

Figure 2.

Cumulative base pairs in INSDC over time since 1980, broken down into selected data components. Data volume in base pairs of assembled sequence (whole genome shotgun methods and others) and raw next-generation-sequence data, excluding the Trace Archive (raw data from capillary sequencing platforms).

Cumulative growth (a) in the number of entry of sequences (b) in the number of nucleotides included in the traditional INSDC sequence archives over time. Bulk sequence data includes non-WGS bulk submission types, that is, Expressed Sequence Tag (EST), Genome Survey Sequence (GSS), Patent and Transcriptome Shotgun Assembly (TSA). Whole Genome Shotgun (WGS) includes the number of sequence overlap contigs. Non-bulk data are the remainder. Cumulative base pairs in INSDC over time since 1980, broken down into selected data components. Data volume in base pairs of assembled sequence (whole genome shotgun methods and others) and raw next-generation-sequence data, excluding the Trace Archive (raw data from capillary sequencing platforms).

COLLABORATION FOR THE YEAR 2013

Members of the INSDC meet annually to discuss practical matters to maintain and develop the nucleotide sequence archives. Issues range from the addition of feature or qualifier elements to the feature tables present in the flat file report format in the traditional archive records to policy issues and strategies for dealing with the increasing sequence data to be archived. In 2012, the annual meeting was held at NCBI, Bethesda, MD USA, 11–13 June. At the meeting, we discussed and came to agreement on many issues. The outcomes of the meeting are summarized later in the text.

BIOSAMPLE DATABASE

With the emergence of high-performance nucleotide sequencing devices, the same biological sample can be analysed several times in the same or other project. For the convenience of data submitters, EBI (6) and NCBI (7) independently launched BioSample databases to store and provide sample information. The BioSample databases contain descriptions of biological source materials used in experimental assays. The purpose of the BioSample database is to provide unified storage and access to information about biological samples, which may have assay data stored in multiple databases. In 2012, DDBJ started to prepare to join this BioSample framework; in consequence, all INSDC members will collect and exchange sequence-related BioSample data as part of this new collaborative activity. All organism names that are represented in the sequence data of INSDC are registered to the NCBI taxonomy database (http://www.ncbi.nlm.nih.gov/taxonomy). Since 2009, the taxonomy database has considered terminating the assignment of strain-level taxonomy ID for microorganism genomes. However, taxonomy database agreed not to stop assigning strain-level taxonomy IDs for prokaryotic strains with sequenced genomes until at least 2013. The change to the current practice will not be made until BioSample has reached maturity and sample records representing these strains can be exchanged; hence, the change may not take place until later in 2013 or beyond.

TERMINATION OF MASS SEQUENCE FOR GENOME ANNOTATION SUBMISSION

Since 2004, INSDC has accepted the submission of Mass sequence for Genome Annotation (MGA) as one means of supporting large-scale of sequence data that provides information for annotation of genome assemblies/sequences. However, along with the popularization of new sequencing platforms, the MGA method has become out of date. Therefore, the INSDC decided to terminate accepting new submission of MGA data.

GENOME COLLECTION

Both submitters and users require INSDC to collect genome data with varied samples and study goals. Especially for bulk sequenced and re-sequenced genomes, INSDC requests that data providers to submit at least one set of assembled/annotated reference genome data, to submit raw reads to SRA for other genomes with associated Binary Alignment/Map (BAM), Variant Call Format (VCF) and General Feature Format (GFF) as ‘analysis’ objects; that is, without draft assembles of Whole Genome Shotgun (WGS) and scaffold Contig/Constructed (CON) data. Although in cases where genomes are sequenced/assembled to finished level, that is, possibly treated as a reference genome, INSDC should not label ‘complete genome’ in KEYWORDS section for genome data without feature annotation. The INSDC encourages submitters to annotate their sequences by providing tools and help documents describing minimal standards and requirements. NCBI introduced the Assembly database (http://www.ncbi.nlm.nih.gov/assembly) which has information about the structure of assembled genomes as represented in an AGP (A Golden Path) format file or as a collection of completely sequenced chromosomes. The INSDC members agreed to collaborate with this activity.

FUNDING

DDBJ by the Ministry of Education, Culture, Sports, Science and Technology of Japan (MEXT); European Nucleotide Archive by the European Molecular Biology Laboratory, the Wellcome Trust, the FP7 Programme of the European Commission and the Biotechnology and Biological Sciences Research Council; NCBI by the Intramural Research Program of the National Institutes of Health; National Library of Medicine. Funding for open access charge: DDBJ management expense grant from MEXT, Japan. Conflict of interest statement. None declared.

7 in total

1. The DNA Data Bank of Japan launches a new resource, the DDBJ Omics Archive of functional genomics experiments.

Authors: Yuichi Kodama; Jun Mashima; Eli Kaminuma; Takashi Gojobori; Osamu Ogasawara; Toshihisa Takagi; Kousaku Okubo; Yasukazu Nakamura
Journal: Nucleic Acids Res Date: 2011-11-22 Impact factor: 16.971

2. The BioSample Database (BioSD) at the European Bioinformatics Institute.

Authors: Mikhail Gostev; Adam Faulconbridge; Marco Brandizi; Julio Fernandez-Banet; Ugis Sarkans; Alvis Brazma; Helen Parkinson
Journal: Nucleic Acids Res Date: 2011-11-16 Impact factor: 16.971

3. The Sequence Read Archive: explosive growth of sequencing data.

Authors: Yuichi Kodama; Martin Shumway; Rasko Leinonen
Journal: Nucleic Acids Res Date: 2011-10-18 Impact factor: 16.971

4. The International Nucleotide Sequence Database Collaboration.

Authors: Ilene Karsch-Mizrachi; Yasukazu Nakamura; Guy Cochrane
Journal: Nucleic Acids Res Date: 2011-11-12 Impact factor: 16.971

5. BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata.

Authors: Tanya Barrett; Karen Clark; Robert Gevorgyan; Vyacheslav Gorelenkov; Eugene Gribov; Ilene Karsch-Mizrachi; Michael Kimelman; Kim D Pruitt; Sergei Resenchuk; Tatiana Tatusova; Eugene Yaschenko; James Ostell
Journal: Nucleic Acids Res Date: 2011-12-01 Impact factor: 16.971

6. GenBank.

Authors: Dennis A Benson; Ilene Karsch-Mizrachi; Karen Clark; David J Lipman; James Ostell; Eric W Sayers
Journal: Nucleic Acids Res Date: 2011-12-05 Impact factor: 16.971

7. Major submissions tool developments at the European Nucleotide Archive.

Authors: Clara Amid; Ewan Birney; Lawrence Bower; Ana Cerdeño-Tárraga; Ying Cheng; Iain Cleland; Nadeem Faruque; Richard Gibson; Neil Goodgame; Christopher Hunter; Mikyung Jang; Rasko Leinonen; Xin Liu; Arnaud Oisel; Nima Pakseresht; Sheila Plaister; Rajesh Radhakrishnan; Kethi Reddy; Stephane Rivière; Marc Rossello; Alexander Senf; Dimitriy Smirnov; Petra Ten Hoopen; Daniel Vaughan; Robert Vaughan; Vadim Zalunin; Guy Cochrane
Journal: Nucleic Acids Res Date: 2011-11-12 Impact factor: 16.971

7 in total

64 in total

1. Minimum information for reporting next generation sequence genotyping (MIRING): Guidelines for reporting HLA and KIR genotyping via next generation sequencing.

Authors: Steven J Mack; Robert P Milius; Benjamin D Gifford; Jürgen Sauter; Jan Hofmann; Kazutoyo Osoegawa; James Robinson; Mathijs Groeneweg; Gregory S Turenchalk; Alex Adai; Cherie Holcomb; Erik H Rozemuller; Maarten T Penning; Michael L Heuer; Chunlin Wang; Marc L Salit; Alexander H Schmidt; Peter R Parham; Carlheinz Müller; Tim Hague; Gottfried Fischer; Marcelo Fernandez-Viňa; Jill A Hollenbach; Paul J Norman; Martin Maiers
Journal: Hum Immunol Date: 2015-09-25 Impact factor: 2.850

2. A genomic update on clostridial phylogeny: Gram-negative spore formers and other misplaced clostridia.

Authors: Natalya Yutin; Michael Y Galperin
Journal: Environ Microbiol Date: 2013-07-09 Impact factor: 5.491

3. Development of a Prototype System for Archiving Integrative/Hybrid Structure Models of Biological Macromolecules.

Authors: Brinda Vallat; Benjamin Webb; John D Westbrook; Andrej Sali; Helen M Berman
Journal: Structure Date: 2018-04-12 Impact factor: 5.006

Review 4. Online Databases for Taxonomy and Identification of Pathogenic Fungi and Proposal for a Cloud-Based Dynamic Data Network Platform.

Authors: Peralam Yegneswaran Prakash; Laszlo Irinyi; Catriona Halliday; Sharon Chen; Vincent Robert; Wieland Meyer
Journal: J Clin Microbiol Date: 2017-02-08 Impact factor: 5.948

5. Adaptive Immune Receptor Repertoire Community recommendations for sharing immune-repertoire sequencing data.

Authors: Florian Rubelt; Christian E Busse; Syed Ahmad Chan Bukhari; Jean-Philippe Bürckert; Encarnita Mariotti-Ferrandiz; Lindsay G Cowell; Corey T Watson; Nishanth Marthandan; William J Faison; Uri Hershberg; Uri Laserson; Brian D Corrie; Mark M Davis; Bjoern Peters; Marie-Paule Lefranc; Jamie K Scott; Felix Breden; Eline T Luning Prak; Steven H Kleinstein
Journal: Nat Immunol Date: 2017-11-16 Impact factor: 25.606

6. Content discovery and retrieval services at the European Nucleotide Archive.

Authors: Nicole Silvester; Blaise Alako; Clara Amid; Ana Cerdeño-Tárraga; Iain Cleland; Richard Gibson; Neil Goodgame; Petra Ten Hoopen; Simon Kay; Rasko Leinonen; Weizhong Li; Xin Liu; Rodrigo Lopez; Nima Pakseresht; Swapna Pallreddy; Sheila Plaister; Rajesh Radhakrishnan; Marc Rossello; Alexander Senf; Dmitriy Smirnov; Ana Luisa Toribio; Daniel Vaughan; Vadim Zalunin; Guy Cochrane
Journal: Nucleic Acids Res Date: 2014-11-17 Impact factor: 16.971

7. Detection and Diversity of Fungal Nitric Oxide Reductase Genes (p450nor) in Agricultural Soils.

Authors: Steven A Higgins; Allana Welsh; Luis H Orellana; Konstantinos T Konstantinidis; Joanne C Chee-Sanford; Robert A Sanford; Christopher W Schadt; Frank E Löffler
Journal: Appl Environ Microbiol Date: 2016-05-02 Impact factor: 4.792

8. Update on RefSeq microbial genomes resources.

Authors: Tatiana Tatusova; Stacy Ciufo; Scott Federhen; Boris Fedorov; Richard McVeigh; Kathleen O'Neill; Igor Tolstoy; Leonid Zaslavsky
Journal: Nucleic Acids Res Date: 2014-12-15 Impact factor: 16.971

9. Using WormBase: A Genome Biology Resource for Caenorhabditis elegans and Related Nematodes.

Authors: Christian Grove; Scott Cain; Wen J Chen; Paul Davis; Todd Harris; Kevin L Howe; Ranjana Kishore; Raymond Lee; Michael Paulini; Daniela Raciti; Mary Ann Tuli; Kimberly Van Auken; Gary Williams
Journal: Methods Mol Biol Date: 2018

10. Federating Structural Models and Data: Outcomes from A Workshop on Archiving Integrative Structures.

Authors: Helen M Berman; Paul D Adams; Alexandre A Bonvin; Stephen K Burley; Bridget Carragher; Wah Chiu; Frank DiMaio; Thomas E Ferrin; Margaret J Gabanyi; Thomas D Goddard; Patrick R Griffin; Juergen Haas; Christian A Hanke; Jeffrey C Hoch; Gerhard Hummer; Genji Kurisu; Catherine L Lawson; Alexander Leitner; John L Markley; Jens Meiler; Gaetano T Montelione; George N Phillips; Thomas Prisner; Juri Rappsilber; David C Schriemer; Torsten Schwede; Claus A M Seidel; Timothy S Strutzenberg; Dmitri I Svergun; Emad Tajkhorshid; Jill Trewhella; Brinda Vallat; Sameer Velankar; Geerten W Vuister; Benjamin Webb; John D Westbrook; Kate L White; Andrej Sali
Journal: Structure Date: 2019-11-25 Impact factor: 5.006