Literature DB >> 26458273

Metagenomic ventures into outer sequence space.

Bas E Dutilh1.   

Abstract

Sequencing DNA or RNA directly from the environment often results in many sequencing reads that have no homologs in the database. These are referred to as "unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as "biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. There is a pressure on researchers to publish and move on, and the unknown sequences are often left for what they are, and conclusions drawn based on reads with annotated homologs. This can cause abundant and widespread genomes to be overlooked, such as the recently discovered human gut bacteriophage crAssphage. The unknowns may be enriched for bacteriophage sequences, the most abundant and genetically diverse component of the biosphere and of sequence space. However, it remains an open question, what is the actual size of biological sequence space? The de novo assembly of shotgun metagenomes is the most powerful tool to address this question.

Entities:  

Keywords:  biological dark matter; crAssphage; human gut; human virome; metagenome assembly; metagenomics; unknowns

Year:  2014        PMID: 26458273      PMCID: PMC4588555          DOI: 10.4161/21597081.2014.979664

Source DB:  PubMed          Journal:  Bacteriophage        ISSN: 2159-7073


Metagenomics is the untargeted sequencing of genetic material isolated from communities of micro-organisms and viruses. These communities may be derived from bioreactors, environmental, clinical, or industrial samples; in short, from anywhere in our unsterile biosphere. The classical questions in metagenomics that are asked about the sampled microbial community are “Who is there?” and “What are they doing?." Originally an approach to answer these classical questions, metagenomics as a field has made great progress in the past decade. Applications include the use of metagenomics for the discovery of novel genetic functionality, for describing microbial ecosystems and tracking their variation, in untargeted medical diagnostics and forensics, and as a powerful tool to determine the genome sequences of rare, uncultivable microbes. Powered by advances in next-generation sequencing technology, metagenomics has the potential to venture beyond the limits of currently explored sequence space by sampling environmental microbes and viruses at an unprecedented scale and resolution. Quite literally, sequence space is defined as the multi-dimensional space of all possible nucleotide (or protein) sequences. Sequence space contains n dimensions; one dimension per residue that can take one of 4 (or 20, for proteins) states, with a total volume of Σ4n sequences when summed over all possible sequence lengths n. Evolution may have largely explored this space, but it remains an open question how large the current biological sequence space is, i.e. the fraction occupied by extant life. Figuratively, and within the context of this paper, “outer sequence space” is the remainder of this biological sequence space waiting to be explored by science. Metagenomics has traditionally addressed the 2 classical questions listed above by aligning the sequencing reads in metagenomic data sets to a reference database containing known, annotated sequences. This allows the taxonomic and functional diversity of the sampled microbes to be described in terms of existing knowledge, allowing for straightforward interpretation of the results. However, a persistent concern in the analysis of metagenomes has been the unknown fraction, consisting of the reads that cannot be annotated by using database searches. The level of unknowns can range up to 99% of the metagenomic reads, depending on the sampled environment, the protocols used for nucleotide isolation and sequencing, the homology search algorithm, and the reference database. Unknowns exist for 4 reasons that are not unrelated. The first reason is technical. Due to limitations of some next-generation sequencing platforms and library preparation protocols, spurious sequences may be generated that do not reflect true biological molecules. These artificial sequences include artifacts due to the sequencing technology and chimeras, i.e., sequences generated from separate genetic molecules derived from different organisms. Since chimeras frequently arise during PCR amplification, they are expected to be more abundant in environmental amplicon sequencing than in shotgun metagenomics, and can be detected using bioinformatic tools. The second reason that unknowns exist is biological, as they reflect the enormous natural diversity of microorganisms that we are only beginning to unveil with metagenomics. This is both overwhelming and exciting, highlighting how much remains to be discovered in biology. This genetic diversity has been referred to as biological “dark matter," and is especially pronounced in viral metagenomes. This issue can only be resolved by expanding reference databases, as exemplified by recent studies of one of the most studied microbial ecosystems: the human gut. The first metagenomic snapshots of the microbiota in the human gut were taken from 2 healthy adults, and revealed a high inter-individual diversity and many unknowns. To a large extent, these unknowns were resolved when a reference catalog was created based on the sequences in the gut metagenomes themselves, decreasing the percentage of unknowns from ∼85% to ∼20%. Moreover, subsequent large scale sequencing efforts revealed that in fact, many people share a similar intestinal flora, regardless of whether these similarities are viewed as discrete enterotypes or as gradients. These results illustrate how unknowns can be depleted by expanding the databases with appropriate reference sequences. This not only requires increased sequencing effort of phylogenetically diverse isolates or single cells, but also mining of draft genomes from metagenomes, sampled from microbial environments around the globe. Thus, by mapping the global sequence space, we can provide reassurance that at least some level of sampling saturation can be achieved. For viruses, and particularly for bacteriophages, efforts to provide a denser sampling of sequence space are still lacking. The third reason that unknowns exist is methodological. Because the advances in DNA sequencing technology have greatly outpaced improvements in computer power, bioinformatic approaches to analyze metagenomes often cut corners. For example, reference databases may be reduced to include only those references that are expected in the sample a priori. Moreover, read annotation may be limited to identifying almost exact sequence matches, as this can be computed much faster than if sequence variations needs to be taken into consideration in a permissive homology search. These issues lead to an inherent blind spot for discovering true novelty, such as sequences that are not expected in the sample, or organisms that have not been observed before. One way to, at least partially resolve this issue is by de novo assembly of the metagenome. Depending on the diversity of the sample, assembly can combine many short sequences (individual reads) into fewer, longer ones (assembled contigs). Reducing the number, and increasing the length of the sequences allows homology searches to be performed with more sensitive, computationally more expensive algorithms such as translated homology searches or profile searches, leading to more specific annotation and improved biological interpretation. Moreover, larger and more comprehensive reference databases can be used, allowing unexpected hits to be found. The fourth reason that unknowns exist is logistical. Most research projects that generate metagenomic sequencing datasets deposit the read files in large repositories, provide an accession number in the associated publication, and move on. It is not unlikely that many of these data sets, consisting of files sometimes gigabytes in size, are never looked at again. Thus, while a certain sequence may have been “seen” in a metagenome and is thus strictly no longer “dark matter," it will still not be recognized when it is observed again. Re-identification of this sequence would only be possible if the publishing researcher identified it as an interesting sequence in his or her (assembled) metagenome, and submitted it to a searchable database like Genbank. Because GenBank maintains very high standards for the sequences it accepts, submission can be a tedious process that is rarely worthwhile for unknown metagenomic contigs. An in depth investigation of the unknowns is rarely within the scope of a research project, and those sequences are thus first ignored and later forgotten. This is a waste of valuable resources: time, money, and work. The metagenomes available in public databases should be better exploited and mined for common sequences. To facilitate this, it is critical that metadata annotations of the metagenomes include a detailed description of the samples and sequencing protocol. Exploiting these datasets will allow us to create more comprehensive maps of sequence space, and greatly improve our understanding and interpretation of metagenomes. In the short term, ignoring the unknowns can facilitate the interpretation of a metagenome. Because a taxonomic or functional description cannot be provided, the classical questions in metagenomics are left unanswered for the unknown fraction of the metagenome, and concentrating on the annotated sequences leads to a more straightforward answer. However, unexpected or novel sequences are quickly overlooked, even if they represent highly abundant or widespread organisms. Thus, in the long term, stockpiling the unknown sequencing reads in badly accessible bulk sequence repositories can severely slow down research, the discovery of novel species, and the charting of biological sequence space. One striking example of a novel genome discovered among the unknown sequences is crAssphage, a bacteriophage whose genome uniquely aligned sequencing reads from 73% of the 466 analyzed human gut metagenomes, and constituted a total of 1.68% of those metagenomic reads. Like many bacteriophages, its genome sequence is highly divergent from everything that was present in the annotated part of the Genbank database, which is why it was not observed before. It has been suggested that the unknown fraction of metagenomes is enriched for viral sequences, because viral genomes are thought to evolve more rapidly than the genomes of cellular organisms, allowing them to explore a larger region of sequence space in the same amount of time. To summarize, unknowns are genetic sequences that are difficult to identify using standard methods, such as by alignment to an annotated reference database. Unknowns remain a persistent elephant in the room in most metagenomics research projects, and exist for technical, biological, methodological, and logistical reasons. The most promising option to resolve the unknowns is by creating improved reference databases that chart biological sequence space, including the outer realms that remain unexplored by science (also known as dark matter). Besides sequencing reference strains or single cells, it may be expected that metagenomic sequencing, assembly, and binning will greatly add to improving these reference databases, for example by identifying common sequences in many metagenomes, and prioritizing them for targeted characterization. Characterizing unknowns will be vital to fully exploit the increasingly available metagenomic data sets from all ecosystems, toward understanding the roles of microbes and viruses in the biosphere. It remains an open question what is the actual size of biological sequence space, but the untargeted, shotgun nature of metagenomics makes it the most powerful tool to address this question.
  23 in total

Review 1.  Metagenomics: application of genomics to uncultured microorganisms.

Authors:  Jo Handelsman
Journal:  Microbiol Mol Biol Rev       Date:  2004-12       Impact factor: 11.056

2.  Natural selection and the concept of a protein space.

Authors:  J M Smith
Journal:  Nature       Date:  1970-02-07       Impact factor: 49.962

3.  Enterotypes of the human gut microbiome.

Authors:  Manimozhiyan Arumugam; Jeroen Raes; Eric Pelletier; Denis Le Paslier; Takuji Yamada; Daniel R Mende; Gabriel R Fernandes; Julien Tap; Thomas Bruls; Jean-Michel Batto; Marcelo Bertalan; Natalia Borruel; Francesc Casellas; Leyden Fernandez; Laurent Gautier; Torben Hansen; Masahira Hattori; Tetsuya Hayashi; Michiel Kleerebezem; Ken Kurokawa; Marion Leclerc; Florence Levenez; Chaysavanh Manichanh; H Bjørn Nielsen; Trine Nielsen; Nicolas Pons; Julie Poulain; Junjie Qin; Thomas Sicheritz-Ponten; Sebastian Tims; David Torrents; Edgardo Ugarte; Erwin G Zoetendal; Jun Wang; Francisco Guarner; Oluf Pedersen; Willem M de Vos; Søren Brunak; Joel Doré; María Antolín; François Artiguenave; Hervé M Blottiere; Mathieu Almeida; Christian Brechot; Carlos Cara; Christian Chervaux; Antonella Cultrone; Christine Delorme; Gérard Denariaz; Rozenn Dervyn; Konrad U Foerstner; Carsten Friss; Maarten van de Guchte; Eric Guedon; Florence Haimet; Wolfgang Huber; Johan van Hylckama-Vlieg; Alexandre Jamet; Catherine Juste; Ghalia Kaci; Jan Knol; Omar Lakhdari; Severine Layec; Karine Le Roux; Emmanuelle Maguin; Alexandre Mérieux; Raquel Melo Minardi; Christine M'rini; Jean Muller; Raish Oozeer; Julian Parkhill; Pierre Renault; Maria Rescigno; Nicolas Sanchez; Shinichi Sunagawa; Antonio Torrejon; Keith Turner; Gaetana Vandemeulebrouck; Encarna Varela; Yohanan Winogradsky; Georg Zeller; Jean Weissenbach; S Dusko Ehrlich; Peer Bork
Journal:  Nature       Date:  2011-04-20       Impact factor: 49.962

4.  Metagenomic analysis of the human distal gut microbiome.

Authors:  Steven R Gill; Mihai Pop; Robert T Deboy; Paul B Eckburg; Peter J Turnbaugh; Buck S Samuel; Jeffrey I Gordon; David A Relman; Claire M Fraser-Liggett; Karen E Nelson
Journal:  Science       Date:  2006-06-02       Impact factor: 47.728

5.  Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia.

Authors:  Ali M Zaki; Sander van Boheemen; Theo M Bestebroer; Albert D M E Osterhaus; Ron A M Fouchier
Journal:  N Engl J Med       Date:  2012-10-17       Impact factor: 91.245

6.  UMARS: Un-MAppable Reads Solution.

Authors:  Sung-Chou Li; Wen-Ching Chan; Chun-Hung Lai; Kuo-Wang Tsai; Chun-Nan Hsu; Yuh-Shan Jou; Hua-Chien Chen; Chun-Hong Chen; Wen-Chang Lin
Journal:  BMC Bioinformatics       Date:  2011-02-15       Impact factor: 3.169

7.  UCHIME improves sensitivity and speed of chimera detection.

Authors:  Robert C Edgar; Brian J Haas; Jose C Clemente; Christopher Quince; Rob Knight
Journal:  Bioinformatics       Date:  2011-06-23       Impact factor: 6.937

8.  Human gut microbiome viewed across age and geography.

Authors:  Tanya Yatsunenko; Federico E Rey; Mark J Manary; Indi Trehan; Maria Gloria Dominguez-Bello; Monica Contreras; Magda Magris; Glida Hidalgo; Robert N Baldassano; Andrey P Anokhin; Andrew C Heath; Barbara Warner; Jens Reeder; Justin Kuczynski; J Gregory Caporaso; Catherine A Lozupone; Christian Lauber; Jose Carlos Clemente; Dan Knights; Rob Knight; Jeffrey I Gordon
Journal:  Nature       Date:  2012-05-09       Impact factor: 49.962

9.  GenBank.

Authors:  Dennis A Benson; Karen Clark; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Eric W Sayers
Journal:  Nucleic Acids Res       Date:  2013-11-11       Impact factor: 16.971

10.  A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes.

Authors:  Bas E Dutilh; Noriko Cassman; Katelyn McNair; Savannah E Sanchez; Genivaldo G Z Silva; Lance Boling; Jeremy J Barr; Daan R Speth; Victor Seguritan; Ramy K Aziz; Ben Felts; Elizabeth A Dinsdale; John L Mokili; Robert A Edwards
Journal:  Nat Commun       Date:  2014-07-24       Impact factor: 14.919

View more
  10 in total

1.  Discovery of an expansive bacteriophage family that includes the most abundant viruses from the human gut.

Authors:  Natalya Yutin; Kira S Makarova; Ayal B Gussow; Mart Krupovic; Anca Segall; Robert A Edwards; Eugene V Koonin
Journal:  Nat Microbiol       Date:  2017-11-13       Impact factor: 17.745

2.  Beyond research: a primer for considerations on using viral metagenomics in the field and clinic.

Authors:  Richard J Hall; Jenny L Draper; Fiona G G Nielsen; Bas E Dutilh
Journal:  Front Microbiol       Date:  2015-03-25       Impact factor: 5.640

3.  Ultrastructure and Viral Metagenome of Bacteriophages from an Anaerobic Methane Oxidizing Methylomirabilis Bioreactor Enrichment Culture.

Authors:  Lavinia Gambelli; Geert Cremers; Rob Mesman; Simon Guerrero; Bas E Dutilh; Mike S M Jetten; Huub J M Op den Camp; Laura van Niftrik
Journal:  Front Microbiol       Date:  2016-11-08       Impact factor: 5.640

4.  Discovering viral genomes in human metagenomic data by predicting unknown protein families.

Authors:  Mauricio Barrientos-Somarribas; David N Messina; Christian Pou; Fredrik Lysholm; Annelie Bjerkner; Tobias Allander; Björn Andersson; Erik L L Sonnhammer
Journal:  Sci Rep       Date:  2018-01-08       Impact factor: 4.379

Review 5.  Stable core virome despite variable microbiome after fecal transfer.

Authors:  Felix Broecker; Giancarlo Russo; Jochen Klumpp; Karin Moelling
Journal:  Gut Microbes       Date:  2016-12-09

6.  Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT.

Authors:  F A Bastiaan von Meijenfeldt; Ksenia Arkhipova; Diego D Cambuy; Felipe H Coutinho; Bas E Dutilh
Journal:  Genome Biol       Date:  2019-10-22       Impact factor: 13.583

Review 7.  From cultured to uncultured genome sequences: metagenomics and modeling microbial ecosystems.

Authors:  Daniel R Garza; Bas E Dutilh
Journal:  Cell Mol Life Sci       Date:  2015-08-09       Impact factor: 9.261

8.  Marine viruses discovered via metagenomics shed light on viral strategies throughout the oceans.

Authors:  Felipe H Coutinho; Cynthia B Silveira; Gustavo B Gregoracci; Cristiane C Thompson; Robert A Edwards; Corina P D Brussaard; Bas E Dutilh; Fabiano L Thompson
Journal:  Nat Commun       Date:  2017-07-05       Impact factor: 14.919

Review 9.  Computational approaches to predict bacteriophage-host relationships.

Authors:  Robert A Edwards; Katelyn McNair; Karoline Faust; Jeroen Raes; Bas E Dutilh
Journal:  FEMS Microbiol Rev       Date:  2015-12-09       Impact factor: 16.408

10.  Vast diversity of prokaryotic virus genomes encoding double jelly-roll major capsid proteins uncovered by genomic and metagenomic sequence analysis.

Authors:  Natalya Yutin; Disa Bäckström; Thijs J G Ettema; Mart Krupovic; Eugene V Koonin
Journal:  Virol J       Date:  2018-04-10       Impact factor: 4.099

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.