Literature DB >> 25197497

Toward richer metadata for microbial sequences: replacing strain-level NCBI taxonomy taxids with BioProject, BioSample and Assembly records.

Scott Federhen¹, Karen Clark¹, Tanya Barrett¹, Helen Parkinson², James Ostell¹, Yuichi Kodama³, Jun Mashima³, Yasukazu Nakamura³, Guy Cochrane², Ilene Karsch-Mizrachi¹.

Abstract

Microbial genome sequence submissions to the International Nucleotide Sequence Database Collaboration (INSDC) have been annotated with organism names that include the strain identifier. Each of these strain-level names has been assigned a unique 'taxid' in the NCBI Taxonomy Database. With the significant growth in genome sequencing, it is not possible to continue with the curation of strain-level taxids. In January 2014, NCBI will cease assigning strain-level taxids. Instead, submitters are encouraged provide strain information and rich metadata with their submission to the sequence database, BioProject and BioSample.

Entities: Disease Species

Year: 2014 PMID： 25197497 PMCID： PMC4149001 DOI： 10.4056/sigs.4851102

Source DB: PubMed Journal: Stand Genomic Sci ISSN： 1944-3277

Toward richer metadata for microbial sequences

The NCBI taxonomy database provides the organism nomenclature and classification that is used in sequence entries by the International Nucleotide Sequence Database Collaboration (INSDC [1]; comprising GenBank, ENA and the DDBJ) [2]. The NCBI Taxonomy Group is responsible for curating names for taxa that are regulated by the relevant codes of nomenclature [3-5], for providing informal names for specimens that are not identified with Linnaean species binomials, and for maintaining the ‘taxid’ namespace. This is a labor-intensive and largely manual effort undertaken by this small group of diligent and dedicated taxonomists at the NCBI [6]. It has been almost twenty years since the first bacterial genomes started to appear in the sequence databases, beginning with in 1995, followed within a year by . In those days each new genome sequence was of significant scientific interest and represented a considerable technical achievement. At that time, for the convenience of those at INSDC institutes and their users, the taxonomy group started assigning strain-level taxids for prokaryotes with complete genome sequences, e.g.: “ Rd” [7] and “ K12” [8]. (That genome is currently indexed as “ str. K-12 substr. MG1655”, since there are now many genomes sequenced from ‘strain’ K-12.) Since that time, the policy of assigning strain-level taxids for genome sequences has been extended to cover eukaryotic microbes as well – unicellular fungi, algae and protists – but it has never been applied to the multicellular eukaryotes. In particular, strain-level taxids have never been assigned for breeds of dogs, or for inbred strains of mice, or for individual human genomes. Sequencing technology has undergone remarkable development over the past twenty years and it has become increasingly cheap and easy to sequence genomes, a trend that promises to continue in the foreseeable future. We are already seeing the submission of hundreds of genomes at a time that are simply time points of micro-evolutionary studies in , or Saccharomyces cerevisiae. Another growing industry in genome submissions is in efforts to track epidemics, food-borne illnesses and hospital infection pathways. More will appear as this technology finds applications in other fields. Our recognition that the curation of strain-level taxids will not remain possible under such growth, and that alternative data resources relating to biological samples are maturing at the INSDC partner institutes, has led us to a review of our practices in this area. We intend to discontinue the curation of strain-level taxids for microbial genomes submitted beyond January 2014. Importantly, this change in practice will not be applied retrospectively; we will not remove any of the thousands of strain-level nodes that we have added in the past, and we will continue to add informal strain-specific names for genomes from specimens that have not been identified to the species level, e.g.: “ CCGE 510” and “Salpingoeca sp. ATCC 50818”. We strongly encourage submitters to annotate their genome submissions with the relevant source metadata, including strain, culture collection and isolation information as appropriate, plus the appropriate species (or subspecies) name. The Genomic Standards Consortium maintains checklists of Minimal Information about any (x) Sequence (MIxS) [9] that contain mandatory and optional descriptive metadata fields for a variety of organism types. These MIxS checklists can be included in the genome submission. Our alternative system for recording and presenting strain-level annotation will be provided by the respective BioSample databases of the INSDC partner institutes [10-12]. BioSample records provide a single accessioned unit of information relating to a sample that has been assayed using sequencing or other platforms. This information serves to gather together taxonomic information, informal infraspecies information (such as strain), descriptors relating to the sampling process, accession information for the physical sample itself, etc. For genome submissions, INSDC databases guide submitters through a series of logical steps in which the information required is requested and transferred. An early step is the registration of the initiative (BioProject) or indication that the genome data are connected to an existing initiative (This registration is applied within the INSDC host institutes’ respective BioProject and study databases). Following this, users are prompted to provide rich descriptive information about the sequenced sample(s) (BioSample) or an indication that samples already registered have been sequenced. Description of new samples, and updates and enhancements to existing samples, take advantage of defined checklists or ‘packages’ of attributes, appropriate for the initiative. In later steps of the genome submission process, users provide sequence data and functional annotation that connect to the samples described or selected. BioSample records are one tool that can be used as an organizing and retrieval key to the genome datasets, as the strain-level taxid was in the past. BioSample accessions can be used to aggregate submitted data deposited in various archives, such as those that cover sequence (i.e. INSDC) and those that cover array-based studies (such as GEO, ArrayExpress and the DDBJ Omics Archive) [12-14]. The BioSample record will enable users to retrieve data across databases from samples with particular attributes. For instance, one may wish to retrieve submitted data for all strains isolated from a particular agricultural plant. INSDC assembly records are another powerful tool in this area, as these hold the information about a particular genome assembly and are supported with unique assembly-level identifiers. In these records all of the pieces of a genome are collected together in ways that are much more flexible and powerful for indexing and retrieval purposes than were strain-level taxids. For example, genomes representing independent assemblies of the same sequence data share a BioSample accession, while those representing alternative sequencing studies of the same strain may have independent BioSample accessions. The (taxid 170187) genome initiative is described in PRJNA766132. This record contains two genome assembles that were built from sequence reads from a single BioSample, SAMN001035273. Two different assembly algorithms were used to create the assemblies, which are detailed in GCA_0002696654 and GCA_0002734455. In an era when microbial genome sequencing was not as commonplace as it is now, using a taxid as a key to retrieve the genome and associated project metadata was a reasonable approach. However, with next-generation sequencing technology, one can sequence the genomes of hundreds of closely related microbes in a few hours [15]. Therefore, data consumers are better served by the new resources that we describe above that enable them to retrieve sets of genomes based on common attributes or initiatives. The INSDC is prepared to stop assigning strain-level tax ids for strains of microbes that have their genome sequenced by January 2014 and encourages users to exploit other resources that allow them to explore sequence data by initiative, specimen or genome assembly.

11 in total

1. The complete genome sequence of Escherichia coli K-12.

Authors: F R Blattner; G Plunkett; C A Bloch; N T Perna; V Burland; M Riley; J Collado-Vides; J D Glasner; C K Rode; G F Mayhew; J Gregor; N W Davis; H A Kirkpatrick; M A Goeden; D J Rose; B Mau; Y Shao
Journal: Science Date: 1997-09-05 Impact factor: 47.728

2. Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications.

Authors: Pelin Yilmaz; Renzo Kottmann; Dawn Field; Rob Knight; James R Cole; Linda Amaral-Zettler; Jack A Gilbert; Ilene Karsch-Mizrachi; Anjanette Johnston; Guy Cochrane; Robert Vaughan; Christopher Hunter; Joonhong Park; Norman Morrison; Philippe Rocca-Serra; Peter Sterk; Manimozhiyan Arumugam; Mark Bailey; Laura Baumgartner; Bruce W Birren; Martin J Blaser; Vivien Bonazzi; Tim Booth; Peer Bork; Frederic D Bushman; Pier Luigi Buttigieg; Patrick S G Chain; Emily Charlson; Elizabeth K Costello; Heather Huot-Creasy; Peter Dawyndt; Todd DeSantis; Noah Fierer; Jed A Fuhrman; Rachel E Gallery; Dirk Gevers; Richard A Gibbs; Inigo San Gil; Antonio Gonzalez; Jeffrey I Gordon; Robert Guralnick; Wolfgang Hankeln; Sarah Highlander; Philip Hugenholtz; Janet Jansson; Andrew L Kau; Scott T Kelley; Jerry Kennedy; Dan Knights; Omry Koren; Justin Kuczynski; Nikos Kyrpides; Robert Larsen; Christian L Lauber; Teresa Legg; Ruth E Ley; Catherine A Lozupone; Wolfgang Ludwig; Donna Lyons; Eamonn Maguire; Barbara A Methé; Folker Meyer; Brian Muegge; Sara Nakielny; Karen E Nelson; Diana Nemergut; Josh D Neufeld; Lindsay K Newbold; Anna E Oliver; Norman R Pace; Giriprakash Palanisamy; Jörg Peplies; Joseph Petrosino; Lita Proctor; Elmar Pruesse; Christian Quast; Jeroen Raes; Sujeevan Ratnasingham; Jacques Ravel; David A Relman; Susanna Assunta-Sansone; Patrick D Schloss; Lynn Schriml; Rohini Sinha; Michelle I Smith; Erica Sodergren; Aymé Spo; Jesse Stombaugh; James M Tiedje; Doyle V Ward; George M Weinstock; Doug Wendel; Owen White; Andrew Whiteley; Andreas Wilke; Jennifer R Wortman; Tanya Yatsunenko; Frank Oliver Glöckner
Journal: Nat Biotechnol Date: 2011-05 Impact factor: 54.908

3. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.

Authors: R D Fleischmann; M D Adams; O White; R A Clayton; E F Kirkness; A R Kerlavage; C J Bult; J F Tomb; B A Dougherty; J M Merrick
Journal: Science Date: 1995-07-28 Impact factor: 47.728

4. NCBI GEO: archive for functional genomics data sets--update.

Authors: Tanya Barrett; Stephen E Wilhite; Pierre Ledoux; Carlos Evangelista; Irene F Kim; Maxim Tomashevsky; Kimberly A Marshall; Katherine H Phillippy; Patti M Sherman; Michelle Holko; Andrey Yefanov; Hyeseung Lee; Naigong Zhang; Cynthia L Robertson; Nadezhda Serova; Sean Davis; Alexandra Soboleva
Journal: Nucleic Acids Res Date: 2012-11-27 Impact factor: 16.971

5. The DNA Data Bank of Japan launches a new resource, the DDBJ Omics Archive of functional genomics experiments.

Authors: Yuichi Kodama; Jun Mashima; Eli Kaminuma; Takashi Gojobori; Osamu Ogasawara; Toshihisa Takagi; Kousaku Okubo; Yasukazu Nakamura
Journal: Nucleic Acids Res Date: 2011-11-22 Impact factor: 16.971

6. The NCBI Taxonomy database.

Authors: Scott Federhen
Journal: Nucleic Acids Res Date: 2011-12-01 Impact factor: 16.971

7. The BioSample Database (BioSD) at the European Bioinformatics Institute.

Authors: Mikhail Gostev; Adam Faulconbridge; Marco Brandizi; Julio Fernandez-Banet; Ugis Sarkans; Alvis Brazma; Helen Parkinson
Journal: Nucleic Acids Res Date: 2011-11-16 Impact factor: 16.971

8. BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata.

Authors: Tanya Barrett; Karen Clark; Robert Gevorgyan; Vyacheslav Gorelenkov; Eugene Gribov; Ilene Karsch-Mizrachi; Michael Kimelman; Kim D Pruitt; Sergei Resenchuk; Tatiana Tatusova; Eugene Yaschenko; James Ostell
Journal: Nucleic Acids Res Date: 2011-12-01 Impact factor: 16.971

9. The International Nucleotide Sequence Database Collaboration.

Authors: Yasukazu Nakamura; Guy Cochrane; Ilene Karsch-Mizrachi
Journal: Nucleic Acids Res Date: 2012-11-24 Impact factor: 16.971

10. ArrayExpress update--trends in database growth and links to data analysis tools.

Authors: Gabriella Rustici; Nikolay Kolesnikov; Marco Brandizi; Tony Burdett; Miroslaw Dylag; Ibrahim Emam; Anna Farne; Emma Hastings; Jon Ison; Maria Keays; Natalja Kurbatova; James Malone; Roby Mani; Annalisa Mupo; Rui Pedro Pereira; Ekaterina Pilicheva; Johan Rung; Anjan Sharma; Y Amy Tang; Tobias Ternent; Andrew Tikhonov; Danielle Welter; Eleanor Williams; Alvis Brazma; Helen Parkinson; Ugis Sarkans
Journal: Nucleic Acids Res Date: 2012-11-27 Impact factor: 16.971

21 in total

Review 1. NCBI Taxonomy: a comprehensive update on curation, resources and tools.

Authors: Conrad L Schoch; Stacy Ciufo; Mikhail Domrachev; Carol L Hotton; Sivakumar Kannan; Rogneda Khovanskaya; Detlef Leipe; Richard Mcveigh; Kathleen O'Neill; Barbara Robbertse; Shobha Sharma; Vladimir Soussov; John P Sullivan; Lu Sun; Seán Turner; Ilene Karsch-Mizrachi
Journal: Database (Oxford) Date: 2020-01-01 Impact factor: 3.451

Review 2. A review of methods and databases for metagenomic classification and assembly.

Authors: Florian P Breitwieser; Jennifer Lu; Steven L Salzberg
Journal: Brief Bioinform Date: 2019-07-19 Impact factor: 11.622

3. Update on RefSeq microbial genomes resources.

Authors: Tatiana Tatusova; Stacy Ciufo; Scott Federhen; Boris Fedorov; Richard McVeigh; Kathleen O'Neill; Igor Tolstoy; Leonid Zaslavsky
Journal: Nucleic Acids Res Date: 2014-12-15 Impact factor: 16.971

4. Database resources of the National Center for Biotechnology Information.

Authors:
Journal: Nucleic Acids Res Date: 2014-11-14 Impact factor: 19.160

5. GenBank.

Authors: Dennis A Benson; Karen Clark; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Eric W Sayers
Journal: Nucleic Acids Res Date: 2014-11-20 Impact factor: 19.160

6. The DDBJ Japanese Genotype-phenotype Archive for genetic and phenotypic human data.

Authors: Yuichi Kodama; Jun Mashima; Takehide Kosuge; Toshiaki Katayama; Takatomo Fujisawa; Eli Kaminuma; Osamu Ogasawara; Kousaku Okubo; Toshihisa Takagi; Yasukazu Nakamura
Journal: Nucleic Acids Res Date: 2014-12-03 Impact factor: 16.971

7. CRISPRdigger: detecting CRISPRs with better direct repeat annotations.

Authors: Ruiquan Ge; Guoqin Mai; Pu Wang; Manli Zhou; Youxi Luo; Yunpeng Cai; Fengfeng Zhou
Journal: Sci Rep Date: 2016-09-06 Impact factor: 4.379

8. Correcting Inconsistencies and Errors in Bacterial Genome Metadata Using an Automated Curation Tool in Excel (AutoCurE).

Authors: Sarah E Schmedes; Jonathan L King; Bruce Budowle
Journal: Front Bioeng Biotechnol Date: 2015-09-11

9. DNA data bank of Japan (DDBJ) progress report.

Authors: Jun Mashima; Yuichi Kodama; Takehide Kosuge; Takatomo Fujisawa; Toshiaki Katayama; Hideki Nagasaki; Yoshihiro Okuda; Eli Kaminuma; Osamu Ogasawara; Kousaku Okubo; Yasukazu Nakamura; Toshihisa Takagi
Journal: Nucleic Acids Res Date: 2015-11-17 Impact factor: 16.971

10. Database resources of the National Center for Biotechnology Information.

Authors:
Journal: Nucleic Acids Res Date: 2015-11-28 Impact factor: 16.971