Literature DB >> 25673291

Environmental genes and genomes: understanding the differences and challenges in the approaches and software for their analyses.

Marie Lisandra Zepeda Mendoza, Thomas Sicheritz-Pontén, M Thomas P Gilbert.

Abstract

DNA-based taxonomic and functional profiling is widely used for the characterization of organismal communities across a rapidly increasing array of research areas that include the role of microbiomes in health and disease, biomonitoring, and estimation of both microbial and metazoan species richness. Two principal approaches are currently used to assign taxonomy to DNA sequences: DNA metabarcoding and metagenomics. When initially developed, each of these approaches mandated their own particular methods for data analysis; however, with the development of high-throughput sequencing (HTS) techniques they have begun to share many aspects in data set generation and processing. In this review we aim to define the current characteristics, goals and boundaries of each field, and describe the different software used for their analysis. We argue that an appreciation of the potential and limitations of each method can help underscore the improvements required by each field so as to better exploit the richness of current HTS-based data sets.

Entities: Chemical Disease Gene Species

Keywords: DNA metabarcoding; environment; genome; metagenomics; software development

Mesh：

Year: 2015 PMID： 25673291 PMCID： PMC4570204 DOI： 10.1093/bib/bbv001

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 11.622

Introduction

The wide range of ‘–omics’ data sets that can now be generated, thanks to rapid developments in high-throughput sequencing (HTS) technologies, have had a major impact on a particular pair of fields that work at the ‘meta’ scale—metagenomics and DNA metabarcoding. As indicated by the Greek preposition ‘meta’, the aim of these disciplines is to move beyond the identification of single species to the identification of the total biological entities within a complex sample. In this regard, a wide range of studies has attempted to biologically characterize particular environments through extraction and sequencing of DNA taken from subsamples of the environment of interest. In brief, metagenomics could be defined as the characterization of the vast number of genomes present in an environmental sample, using both a taxonomical and a functional analytical approach. DNA metabarcoding, on the other hand, principally focuses on taxonomically describing the species present within a sample. Given the increasing ease and reduced costs with which HTS data can be generated, these environments represent virtually any space from which a sample can be obtained, including Antarctic lakes [1, 2], hot springs [3-5], the human gut [6-10] or indeed any other part of the body [11] of any species [12-16]. Regardless of their origin, a fundamental characteristic of these samples is the complexity of the microbial communities that inhabit them and the difficulty this complexity poses for any downstream analyses [17]. Current interests focus around two kinds of analyses: (1) who is there?—the taxonomic identification of all the species present (bacteria, fungi, viruses, protozoa, mammals and plants) and (2) what are they doing?—the identification of the biological functions that those species present undertake within those particular environmental characteristics (e.g. high/low pH, extreme temperatures, salinity gradients, humidity, pressure, oxygen abundance, etc.) [18-21]. The information yielded from these two avenues can subsequently empower detailed comparisons of different microbial communities [22]. At the dawn of the fields of DNA metabarcoding and metagenomics, the respective techniques used to fulfill their goals were clearly distinguishable. DNA metabarcoding aimed to identify which species were present in a DNA extract by targeting and sequencing nucleotide barcodes, which are DNA marker sequences that have been argued as providing a unique genetic identity for each taxon of study [23-25]. Barcodes were originally detected by observing sequence alignments and locating a pair of conserved regions flanking a variable one. The most commonly used markers are 16S for bacteria, mt16S for mammals, CO1 for insects, ITS1 for fungi and rbcL, trnl and matK for plants, although non-conventional markers can also be used [24, 26, 27]. Metagenomics, on the other hand, is based around direct shotgun sequencing of DNA within an extract, and thus required the implementation of HTS technologies [28, 29] for its power to be fully exploited. A fundamental difference with DNA metabarcoding, is that the data generated using metagenomics provides additional genomic-scale information, thus enabling not only taxonomic identification, but also functional characterization of the environment [30] (Figure 1).

Figure 1

Environmental sample analysis framework. (A) A sample can come from any environment that contains DNA; e.g. one of the most studied environments to date is the human gut microbiome. (B) DNA is extracted from the sample and sequenced according to the intended analyses. Shotgun sequencing produces genomic reads from the species present in the sample, while targeted sequencing produces amplicons with the aim of identifying a specific group of organisms. (C) Depending on the initial aim, whether functional and taxonomic characterization or only taxonomic characterization, the appropriate data set needs to be generated to be analyzed with the appropriate software. Although initially distinct, recent sequencing technological developments have rapidly diminished the difference between the fields, principally because DNA metabarcoding now produces data sets with the same HTS techniques used for metagenomics [29, 31–33]. While conferring many benefits, a side effect has been a degree of confusion emerging within the research community, in particular through the labeling of some metabarcoding studies with the terms ‘metagenomics’ or ‘targeted metagenomics’ [34-39], simply due to the fact that HTS platforms are used to either decrease the cost of amplicon sequencing [40, 41] or generate barcode markers in a polymerase chain reaction (PCR)-free manner [32, 42–45]. The reason why this labeling is incorrect is due to the fact that the focus of the resulting analyses remains on barcode loci, and not on the genome as a whole. In this regard, a distinction should be made between ‘metagenomics’ and ‘DNA metabarcoding’ as research fields, and ‘metagenomic sequencing’ as a laboratory technique. Strictly speaking, shotgun HTS of an environmental sample is the sequencing of the metagenome of the sample, regardless of the research field and the computational approach used to analyze the data set. For this reason we prefer to name metabarcoding studies that use metagenomics sequencing data as ‘PCR-free single/multiple loci metabarcoding’. Given this spreading confusion, and that we can expect an increasing number of researchers to favor PCR-free shotgun approaches [29], thanks to the continual reductions in price per base pair of HTS, we advocate that it is timely to re-examine the pros and cons of the different approaches, and re-state the specific goals of each kind of study (Table 1).

Table 1

Methods comparison

Type of study and aimed characterization	Metagenomics: taxonomic and functional	Metabarcoding: taxonomic	Metabarcoding: taxonomic	Metabarcoding: taxonomic	Metabarcoding: taxonomic
Laboratory method	Shotgun sequencing	Shotgun sequencing	Shotgun sequencing	PCR based	PCR based
Target region	Genome-wide	Multi-loci	Single locus [32, 33]	Customized barcodes	Conventional barcodes, including 16S, COI, etc.
DNA quantity	Care should be taken for samples coming from a body part of a macro organism so that the shotgun sequencing is not mostly host DNA	The percentage of marker genes in shotgun data sets is small [46, 47]	Only a small fraction of the reads come from a specific marker gene	Lots of customized targeted genes can be obtained	Lots of amplicons from universally targeted genes can be obtained
Reference database	Databases of the entire genomes can be customized	The source of the reads is largely unknown and difficult to characterize with the currently existing databases, thus many reads will not be assigned a taxonomy [48]	Single marker genes can be extracted from the data set using a reference database	There are good databases for standard barcodes, however if another region is targeted there are few and mostly not curated reported sequences.	There are several large 16S and COI databases, some of them are well curated, such as Greengenes
Laboratory bias	May present library build biases due to e.g. genomic nucleotide composition	May present library build biases	May present library build biases	May present primer bias if primers target wide taxonomic distributions	May present primer bias if using ‘universal’ primers for marker gene
Taxonomic resolution	The identification of multiple loci (marker or not) can even recover almost entire genomes of species	The phylogenies of more than one gene can provide a better consensus of the species present in the sample	It can provide good taxonomic resolution up to the species level. The taxonomic accuracy increases [33]	Sequences other than marker genes may not provide satisfactory taxonomic resolution because one sequence can be assigned to more than one species	The completeness of the well-characterized marker gene databases can provide good taxonomic resolution up to the species level
Cost	Deals with various challenges due to the complexity of the mixture of DNA in the sample	It may be unattainable due to the computational requirements	The ratio of used and discarded sequences that do not come from the single mined marker gene is cost inefficient	Low cost when generated on HTS platforms	Generally low cost—especially when generated on HTS platforms

Note. Comparison of the advantages and disadvantages of various methods that are used to achieve the goals of the DNA metabarcoding and metagenomic fields.

Methods comparison Note. Comparison of the advantages and disadvantages of various methods that are used to achieve the goals of the DNA metabarcoding and metagenomic fields. Having discussed the differences of each laboratory method to study species diversity and biological functions in an environment, it is now important to describe the characteristics of the computational methods used to analyze the data sets produced by these different approaches. Through a reminder of the technical definitions of the goals and computational methods used in metagenomics and metabarcoding, and by stating the similarities and differences of the current approaches used to analyze data sets of environmental samples, it will be possible to attain a sound understanding of these two fields. In turn, this will enable oriented software development efforts that specifically target the key questions that researchers wish to ask. Furthermore, it will facilitate the development of studies that integrate the different types of data sets, while being aware of their differences, and exploiting the full information they can provide [49-52]. While the recently produced metagenomic data sets have provided a wealth of new insights into the intricacies of microbial communities, their descriptive power remains remarkably limited by the now widely acknowledged fact that culture-based descriptions of microbial diversity underestimate the true levels of biodiversity by orders of magnitude [53, 54]. Specifically, a major problem with many current metagenomics and metabarcoding studies is that their taxonomic identification processes rely heavily on the information available for previously described species. In light of this problem, other methods have been developed to characterize the diversity in a sample without reference data sets [55]. Both approaches offer different possibilities and limitations that need to be considered before undertaking analyses (Figure 2). We discuss these in the following text, with the intention of inspiring future analytical developments. Because taxonomic profiling is the only goal shared by metabarcoding and metagenomics, we principally focus on methods regarding this aspect, although functional characterization will also be superficially explored.

Figure 2

Considerations and challenges for metagenomics and DNA metabarcoding. Both fields face a variety of challenges that are ideal candidates for future software development. While some of such problems are specific to one of the fields (right and left boxes), others are common to both (middle boxes).

Metabarcoding reference-based characterization

At the dawn of amplicon sequencing, reads were predominantly produced by Sanger sequencing and the data sets were small. However, data set sizes have increased by orders of magnitude, thanks to sequencing technology platforms such as the Roche GS, Ion Torrent and the Illumina series [56], each with particular commercial characteristics (e.g. cost and sequencing data yield) [57]. Thus, sequence similarity searches can today only be effectively handled through computational toolkits [58-60] that perform the necessary basic processing of the raw data. Basic processing steps in such toolkits include trimming, screening and aligning sequences against a database, clustering of sequences into operational taxonomic units (OTUs), and comparison of the sequence composition between different samples. The alignment of the reads to the reference database is probably the most important step of the analysis workflow. Different programs can be chosen for this task, such as UCLUST [61], CD-HIT [62] and BLAST [63]. After the alignment, instead of simply parsing a BLAST output, taxonomy is assigned using a predefined taxonomy map in which a reference sequence is related to the corresponding taxonomy (Figure 3A). Other methods such as obiclean from OBITools [59] and SUMATRA+SUMACLUST [64] also include steps to model and detect PCR sequencing errors to avoid incorrect taxonomic assignations by the use of clustering algorithms as UCLUST [61] and CD-HIT [62] and the sequence record counts.

Figure 3

Metabarcoding approaches. (A) Although PCR-free data sets are typically large, usually only a small percentage of the sequence reads map to a reference database. In such database, each entry has an assigned taxonomy so that phylogenetic placing approaches can be used for the taxonomic assignation. (B) PCR-based data sets consist of amplicon sequences that can be analyzed with the use of a reference database or without the need of it. If no database is used, the sequences are compared among themselves and are clustered by a similarity threshold; a representative sequence can be drawn from each cluster to then be compared with a reference database. On the other hand, if a database is used, the sequences are compared against the database and are assigned the taxonomy of the sequence they match under a given similarity threshold. A colour version of this figure is available online at BIB online: http://bib.oxfordjournals.org. Based on the results of the reference database comparison, taxonomy assignation can be performed by alignment-based methods such as MEGAN [65] and MetaPhyler [66]. In this context, taxonomy is assigned against specific barcode loci databases, whether single loci such as 16S or CO1, or a set of a few phylogenetic marker loci drawn from across the genome. For example, a method called mOTU identifies what the authors call ‘metagenomics Operational Taxonomic Units’ [67], analogous to the molecular OTUs, by using 40 universal single copy phylogenetic marker genes. Some authors have referred to this approach of using multiple barcodes as metagenomics, due to the fact that the loci are drawn from the organism’s genome (nuclear plus organelle genes for eukaryotes) [66-70]. However, the total amount of sequence of the used loci is so small in comparison to the DNA content in an entire genome, that these databases cannot formally be considered genome-wide, especially for prokaryotes in which the exome is only a percentage of the genome. Generally, alignment-based methods for shotgun data sets use an approach that can be broadly described as follows. First, the HTS reads are aligned to a backbone alignment [71, 72], subsequently each query sequence is placed into a backbone tree [73] using an extended alignment, and finally taxonomy is assigned to each read using a phylogenetic placement approach, such as the Lowest Common Ancestor (LCA) [74]. Phylogenetic placement approaches use a database and a reference tree associated to the database. LCA is one of the most commonly used algorithms in phylogenetic placement; it implements steps to address this specific issue of taxonomies coming from different database sources. In the LCA algorithm, if the read has a hit specifically to one taxon it is assigned to it, but if it has hits to different taxa it is placed higher up in the taxonomy, and reads that hit ubiquitously may even be assigned to the root node of the tree.

Considerations on metabarcoding reference-based methods

Having briefly outlined current reference-based metabarcoding methods, we turn to their pros and cons. A major attraction of amplicon data sets is that they are a relatively economic way to monitor diversity, thus enabling comparison of the taxonomic composition between various environmental communities. Although marker gene databases have expanded and included genes other than 16S rDNA [26, 75, 76], the most comprehensive are for 16S rDNA [77], and to some degree for CO1. If using a relatively error-free database such as BOLD [78], SILVA [79] or Greengenes [80], the taxonomic identification can be reliable, especially if using long reads such as those from Roche GS sequencing. In contrast to metagenomic analyses, comparison of taxonomic composition can be automated for many different metabarcoded samples with the use of software such as Unifrac [22]. Despite the benefits of PCR-based methods, they face a number of challenges. Firstly, they must account for PCR and sequencing derived sequence errors, ultimately risking overestimation of biodiversity within samples [81]. Secondly, although primers used are often referred to as ‘universal’ or ‘generic’ for predetermined clades, their performance is difficult to predict on samples composed of largely unknown species, thus amplification biases may occur [82]. Other major limitations of reference-based approaches are both that of reference database incompleteness, and that different results can be obtained according to the database size. This is an aspect that has not received much attention within the majority of the taxonomic profiling studies. However, some software has been developed to deal with this issue; for example TANGO [83]. Despite these problems, considerable efforts have been made to develop phylogenetic placement methods for taxonomy profiling in metabarcoding, and various programs with different statistical bases are available [84]. PCR-free multi-locus methods represent a valuable first step through which the metabarcoding community can exploit more of the information present in shotgun data sets than is otherwise used by the mining of a single gene. However, the fact that they still largely ignore the majority of the sequence data raises the obvious challenge of better exploitation of this extra information. In this regard, it would be interesting to use composition or counts approaches to provide extra information for a more refined taxonomic assignation or to provide supporting information to the identified taxonomies. Another major challenge for the labeling of sequences with the traditional species concept is the identification of chimeric sequences. To this end, programs such as UCHIME [85] have been developed for chimeric amplicon identification, and this has already established itself as a de facto standard step. Furthermore, the presence of nuclear mitochondrial insertion (numt) sequences is a problem that should also be taken into account. As proven by Hojun Song et al. [86], DNA metabarcoding can overestimate the number of species when nuclear mitochondrial pseudogenes are co-amplified. Several steps are suggested to deal with numts, such as BLAST search, translation of the sequences to look for indels and stop codons, comparison of the marker gene to closely related published mitochondrial genomes and examination of nucleotide usage. However, these suggestions are not straightforward to implement, and no metabarcoding toolkit has yet a program for identification of numts. On a separate matter, although it is clear that data sets of barcode amplicons do not provide functional information, it is interesting to note the development of programs such as PICRUSt [87], which predicts the functional composition of a metagenome using marker gene data and a database of reference genomes.

Metabarcoding reference-free characterization

In classical sequence characterization approaches, where a label name is assigned to sequences, and the level of attained taxonomic resolution is the most important aspect to consider [88], reference databases are the cornerstone of the analyses. However, given that the overwhelming majority of microbial diversity remains to be characterized at the genetic level [53, 54], the concept of a molecular OTU has been applied for enabling improved descriptions of the taxonomic diversity present within a sample. In this reference-free approach, reads are first clustered by a similarity threshold and a representative sequence is obtained from each cluster (Figure 3B). These clusters are not assigned a taxonomic label, but sequences within the same cluster are expected to come from the same species. Because the methodological basis of this approach is the same as that used by some reference-based programs, most of such programs offer a reference-free mode. An example of such programs is a recently developed method called UPARSE [89]. The representative sequences of the clusters resulting from the reference-free modes can be used to assess the microbial diversity of the sample or to serve as input for other reference-based methods for their taxonomy assignment.

Considerations on metabarcoding reference-free methods

Metabarcoding was born when the identification of known species was enough to characterize an environment, and it is still the best option for studies such as biodiversity monitoring [90-93] of microorganisms, as well as macro organisms like mammals and plants. In particular, the monitoring of macro organisms [94, 95] can be benefited from improvements on both reference-free and reference-based methods because the currently used generic markers for their identification are often unable to provide high taxonomic resolution [96-98]. Reference-free methods possess the clear advantage of not needing any reference database for the taxonomy assignment. However, the taxonomic assignation without the use of a database in metabarcoding also poses the challenge of the molecular OTU concept not being yet widely accepted by the community, because so far OTUs without a taxonomic classification can be only used for environment richness comparison. A major challenge for determining the microbial species present in a sample without the need of a reference database is to use algorithms other than those also used by the methods that depend on a reference database. There is a yet underexplored alignment-free metabarcoding approach that works under a completely different methodological basis—the compression-based approach. This approach implements methods such as Universal Similarity Metric (USM) [99], an approximation of USM called Normalized Compression Distance [100] and Information-Based Distance [101] that can produce phylogenetic trees with good accuracy [99]. These kinds of analyses represent an interesting parameter and reference database free means of clustering sequences that should be further explored.

Metagenomic reference-based characterization

BLAST [63] is perhaps the most basic and widely employed method for identifying the best hit of the shotgun data set reads against databases containing taxonomically identified reference sequences. Once the BLAST output is generated, subsequent taxonomic assignation is performed using different strategies, depending on the software. BLAST is implemented in a variety of methods [65, 102, 103] that are able to undertake taxonomic and functional identification, as well as perform comparative analyses of different samples in a straightforward and interactive manner that can be clearly visualized. For example, MEGAN [65] applies the LCA algorithm on a BLAST output for taxonomic assignation. Although this approach sits on the fuzzy line that separates metabarcoding and metagenomics, when it uses reference databases consisting of complete genomic sequences and uses all the reads for comparison instead of initially fishing for marker barcodes it can be classified as a metagenomic method, otherwise it is classified as metabarcoding (Figure 4A). Shotgun Unifrac [104] as implemented in Qiime [58] is an alternate phylogenetic placement method that has proven useful for taxon identification for entities like viruses through the use of a reference database of full genomes.

Figure 4

Metagenomic approaches. (A) Metagenomic reference-based approaches start by mapping the reads to a genome database and then apply various algorithms to assign taxonomy, such as phylogenetic placement, or the use of unique mapping reads to the genome of a species in the database. (B) Alternatively, the reads can be de novo assembled and the scaffolds, or the open reading frames predicted on the scaffolds, can be searched against the database, thus reducing the search time. (C) Metagenomic reference-free methods usually start by de novo assembling the reads, then the number of reads mapping back to the assembled sequences (the scaffolds or the open reading frames predicted from the scaffolds) can be used to create a count matrix that can be further clustered, with each cluster representing a metagenomic species. A colour version of this figure is available online at BIB online: http://bib.oxfordjournals.org. ‘Composition-based’ approaches are an alternate strategy that exploits nucleotide usage information extracted from the reads to detect which taxonomic entities are present in the sample. Nucleotide usage is an interesting piece of information that can only be exploited if used in a metagenomic approach, because it requires information from many complete genomes. In general, composition-based methods can be considered as sitting on the interface between reference-free and reference-based methods, because they use statistical approaches such as Markov models [105], support vector machines [106], non-negative least squares [107] or mixture modeling [108]. Here however, we consider them as reference-based, simply because the database is the suite of Markov models or the required training sequence set. The methodological basis of composition-based methods can be generally explained as follows. First, models such as interpolated Markov models are generated to characterize variable-length oligonucleotides typical of a phylogenetic grouping. The models can be generated, for example, by training on chromosomes and plasmids from organisms collected from a database such as NCBI RefSeq [109]. Subsequently the model gives a score reflecting the probability of a query sequence to belong to the class of sequences on which the model was trained. There are also hybrid methods for taxonomic classification that combine the result of alignment-based and composition-based approaches in a complementary way. Other metagenomic methods also perform taxonomic assignation based on DNA sequences from nuclear as well as mitochondrial genomes by first performing a de novo assembly and then comparing the assembled sequences to a genome-wide genes database [110-112] (Figure 4B). These methods should be considered metagenomic methods because they apply genomic algorithms such as de novo assembly, and the database can have many sequences from coding DNA sequences (CDS), markers or not, as well as non-CDS, or can consist of entire genomes, compared with those used by metabarcoding alignment-based methods using short reads as input, which are restricted to few marker genes. If the de novo assembly derives from high depth data sets, assembling them can produce almost complete genomes of high quality even from rare species [113, 114]. Other methods such as CARMA3 [115] use hidden Markov models with the Pfam database [116] to match the short reads to protein domains. The approach also uses a small percentage of the total data set, but it is to be considered metagenomic because it uses CDS from every reported gene instead of limiting itself to marker genes only. Furthermore this classification allows for functional characterization. Functional characterization is usually performed by alignment of the sequences to already annotated proteins to find their homologous sequences [62, 117–119]. This relies on the assumption that sequence homology suggests shared function [120, 121], and it is also considered that there are different levels of functional similarity, such as pathways or protein families [122-124]. Function-oriented databases such as COG [125], Pfam [116] and TIGRFAM [126] are meant to be used for gene-level analyses, while others such as KEGG [127], MetaCyc [128] and SEED [129] are used for analyses at system or pathway level.

Considerations on metagenomics reference-based methods

Perhaps the most significant advantage of metagenomics methods that exploit reference databases is that (depending on the completeness of their reference database) they can provide reliable species identification. Furthermore, improvements to the class of methods that use a reference database for the training of the program while also allowing the discovery of new species are a significant first step into a more complete exploitation of all the information in a metagenomic data set. This aspect is of considerable importance, as it has recently been proven that the majority of microbes in the human gut (currently the best studied environment with metagenomics) are not represented by current genomic resources [9]. Although methods based on reference sequences can identify new relatives of characterized organisms, they are extremely limited when it comes to the discovery of species that remain largely uncharacterized. Furthermore, genome-based methods do not allow for genomic-scale grouping of sequences with certain characteristics most likely coming from closely related individuals, which can be used for reconstruction of species genomes. The computational time needed during the database comparison can be high, especially if using BLAST as the primary alignment tool. Another aspect to take into account with regard to the database comparison results, is that the decision of the threshold that should be used for reliably assigning a taxonomic level is somewhat arbitrary because it strongly varies for each read, and often the most reliable level is high (superkingdom or phylum). Furthermore, methods based on genome alignment require normalization by genome size in order to estimate taxonomic abundance without bias [7], something that is not possible to estimate for uncharacterized species. Lastly, metagenomic methods that start by assembling the reads into longer sequences require paired-end sequence data to perform a good-quality de novo assembly. Paired-end libraries are more expensive to generate than shotgun data sets, thus if used in this way there can be a relatively low efficiency to economic cost ratio. Although the reliability of reference-based species identification can be better than the reference-free methods from a conventional point of view—that in which an already described species name can be assigned to a group of reads—the best taxonomic identification methods have high precision but low sensitivity. This means that they make accurate assignments but fail to classify a large portion of the input sequences, even at high taxonomic levels [130]. The development of strategies focusing on working with the unclassified reads is of paramount importance, for example, through the use of different more relaxed search parameters can be used to allow for matches to more distant relatives that would not be identified with the current stringent methods that mind the taxonomic assignation specificity. Composition-based classifiers face more problems with regards to taxonomic assignment than alignment-based methods do, given that more reliable composition information can be obtained from longer reads [131], which is not the case of most of today’s shotgun data sets. Thus, there is room for improvement of metagenomic reference-based taxonomic assignation. Improving de novo assembly algorithms would yield long enough sequences to extract reliable composition information, but chimeric assemblies need to be identified first so as not to provide mixed up information to confound the taxonomic assignation. So far, a number of marker genes have proven useful for prokaryotic species delineation; however, other organisms such as fungi and virus have not been genetically characterized in terms of taxonomy as deeply as prokaryotes [132, 133]. Thus, more metagenomic methods not based on genome alignment need to be developed focusing on uncharacterized species and taking into account eukaryotic [134] and virus species [135-138]. This is especially important for phages, the most abundant biological entities on the planet [133]. Regarding functional characterization of a metagenome, homology annotation is widely used in metagenomics for functional profiling, but other methods for annotating the proteome of metagenomes should be explored. This becomes an issue of importance given the incompleteness of the databases, and the huge number of proteins reported that are either uncharacterized or only assigned putative functions. To this end, context-based methods represent an area that can be further explored to refine or enhance functional annotation. These kinds of methods integrate information from genomes and pathways [87, 139–141]. A method called pseudo amino acid composition (PseAAC) [142-145] has been extensively used in computational proteomics for predicting protein structures and functions, but has been yet unexplored in metagenomics. PseAAC is a machine-learning method that uses the 20 conventional amino acids and a combination of a set of discrete sequence correlation factors obtained by using a correlation function that reflects the sequence order correlation between all the top most contiguous residues along a protein chain [142]. PseAAC has been mainly used for prediction of protein cellular attributes such as which compartment of a cell it belongs to and how it is associated to the lipid bilayer of an organelle [142], protein structural classes [146], enzyme families [147], protein–protein interactions [148], among others [149]. These attributes are closely related to the biological function of the protein.

Metagenomic reference-free characterization

Metagenomic data sets differ principally from the classical DNA metabarcoding PCR-based amplicon data sets in that they are able to exploit a key piece of information: the number of times a sequence is present [150] (Figure 4C). Although this kind of approach has yet to be widely adopted, and thus could benefit from considerable improvement, it represents a promising way for novel species and gene finding [151-154]. Currently only a handful of methods based on the notion that abundance is constant across genetic entities such as genes in a chromosome have been published [155-157]. A recent example is the method proposed by H Bjørn Nielsen et al. [55], which exploits the co-abundance profiles across metagenomic data sets from a number of samples of the same type. This method extracts groups of genes that correlate in terms of abundance to randomly picked seed genes, calling these clusters co-abundance gene groups (CAGs). Segregating a metagenome into groups of genes that have similar abundance allows the identification of biological entities like prokaryotes and phages, as well as small genetic entities representing co-inherited clonal heterogeneity. The ability of the method to discriminate between strains of the same species, even within complex metagenomics samples, indicates the power of co-abundance to segregate closely related biological entities. Another method that uses count information is that proposed by Albertsen et al. [114]. In this method data sets from a given sample are produced using two different DNA extraction methods. The first steps are similar to those of the CAGs method, in which the reads are de novo assembled and subsequently a primary binning approach of clustering by similarity is used, thus making a non-redundant gene catalogue. Subsequently each data set’s reads are mapped to the assembled set of non-redundant scaffolds, and a normalized coverage for each scaffold is recorded. Afterwards the steps particular to this method include the binning of the scaffolds into population genomes by plotting the two coverage estimates of all scaffolds against each other. From the plot, scaffolds clustering together represent putative population genome bins.

Considerations on metagenomics reference-free methods

These methods are powerful in the sense that they allow the recovery of genomes from unknown and rare taxonomic variants, opening the possibility of finding novel enzymes and allowing a detailed functional characterization that is much broader and complete than the one that could be done on a reference-based approach. Such characterization in turn enables exploitation of a looser species definition such as that presented by the CAGs, which are of greater usefulness than the standard rigorous species definition in dealing with the widespread transfer of DNA across species boundaries. Similarly, the use of a functional assembly instead of a species taxonomic assignation to characterize an ecosystem and compare it with others represents an underexplored avenue containing considerable opportunities [9, 87]. This metagenomics species concept is useful for a complete metagenomic characterization of a sample; however, it faces the huge challenge of being understood and accepted by the scientific community that is slowly advancing to new concepts. Both reference-based and reference-free metagenomic methods face the problem that comparing many different samples requires manual inspection. Interesting methods such as the differential depth binning cannot be implemented automatically for many samples because it requires manual examination of the plot [158], and others like MEGAN [65] and MG-RAST [103] are based on visual and interactive analyses. However, the canopy clustering-based method [55] represents a first step into comparison of multiple samples in an automated way. On a separate subject, there is an approach similar to the PseAAC but applied to DNA/RNA sequences called pseudo K-tuple nucleotide composition (PseKNC) [159, 160]. PseKNC has been used for example to infer recombination spots [161, 162], promoters [163], nucleosome positioning [164], and also CDNA-related features as splicing sites [165], and translation initiation sites [166]. These kinds of methods could be used to annotate genomic features on genomes drawn from reference-free metagenomics methods. Finally, the complexity of the data set has a big impact on the accuracy of the results and the effect of the sequencing depth on the results has a significant impact on the results [113, 167]. Further exploration on the impact of sequencing depth and proper integration of data sets coming from different sequencing technologies, each one with different nucleotide miscalling problems, would provide more information on how to optimally exploit the information, and on where to delimit how much information one can draw from them.

Final remarks

Although it is difficult to accurately predict how many of the methodological challenges will be addressed in the future and which new tools will be developed, it is expected that reference-based methods will rapidly benefit from the accumulation of information in the data sets, while reference-free methods would be helped by an extended acceptance of a broader concept of species definition. Computational technological advances are also expected to play an impact on the kind of programs developed. For example, a wider usage of cloud computing [168] or the higher feasibility of acquiring computational resources with much more power would enable the processing of even larger data sets with more computationally demanding algorithms. Another important issue that remains to be discussed is the development of computational tools in concert with laboratory method development. For example, the laboratory method called Hi-C which was developed for generating chromatin-level contact probability maps [169] has been successfully applied to reconstruct individual genomes of microbial species present within a synthetic metagenome sample [170]. Although this method still needs to be further modified to be more widely applied to complex real metagenomics samples, computational method developers can influence in their refinement. The metabarcoding field has also been benefited from the development of refined laboratory methods that include the use of double tagged amplicons coupled to multiple PCR replicates, which are HTS sequenced in a multiplexed manner [92]. Although this protocol provides more information for distinguishing PCR/sequencing errors and chimeras, there is currently only one method developed to analyze specifically that kind of data sets (in the form Zepeda Mendoza ML, Carmona Baez A, Bohmann K, Gilbert MTP, submitted for publication). Other techniques such as HITChip [171] have benefited the large-scale taxonomic profiling in reference-based metabarcoding studies. However, development of programs to customize the design of chip probes based on non-standard resources, such as in-house databases, is still needed. In summary, the primary message that we hope to have conveyed with the description and definition of the methods boundaries presented here, is the need for future software development for metagenomics and DNA metabarcoding data analyses. Secondly, we seek to clarify the confusion regarding the mislabeling of some metabarcoding studies with the terms ‘metagenomics’ or ‘targeted metagenomics’. Thirdly, we intend to draw researchers’ attention to the challenges that the current methods face, and suggest avenues of method development for further exploration. For example, we believe that metagenomics would greatly benefit from a closer collaboration with the algorithms used in computational proteomics, such as PseAAC and PseKNC. Finally, we believe that consideration of the pros and cons of the different approaches, and the specific goals of the two ‘meta-scale’ research fields, will help researchers choose the appropriate methods to use to address the specific questions of their studies (Figure 5).

Figure 5

Method classification placement map. As observed in the placement of the methods, there is lack of software in some areas while there is wealth in others, especially at the borderlines where at first they might seem difficult to classify. (A) Metagenomic reference based. (B) Metagenomic reference free. (C) DNA metabarcoding reference based. (D) DNA metabarcoding reference free. Metabarcoding and metagenomics share many aspects of their software and this has led to a misunderstanding of their meaning and goals. To distinguish ‘metagenomics’ and ‘DNA metabarcoding’ as research fields, and ‘metagenomic sequencing’ as a laboratory technique, DNA metabarcoding can be subdivided by how many barcodes are used (single and multiple loci) and which sequencing technique is used (PCR-based or PCR-free). In general, metagenomics and DNA metabarcoding software can be divided based on whether they use a reference database or not, both types posing different challenges. Re-examination of the pros and cons of the different approaches in metagenomics and metabarcoding is important to decide on the method to use for the study. Method development in metagenomics and metabarcoding would benefit from considering recently emerging techniques in other disciplines.

Funding

This work was supported by the Lundbeck Foundation grant number R52-A5062.

163 in total

1. Individual genome assembly from complex community short-read metagenomic datasets.

Authors: Chengwei Luo; Despina Tsementzi; Nikos C Kyrpides; Konstantinos T Konstantinidis
Journal: ISME J Date: 2011-10-27 Impact factor: 10.302

2. Automatic detection of subsystem/pathway variants in genome analysis.

Authors: Yuzhen Ye; Andrei Osterman; Ross Overbeek; Adam Godzik
Journal: Bioinformatics Date: 2005-06 Impact factor: 6.937

3. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB.

Authors: T Z DeSantis; P Hugenholtz; N Larsen; M Rojas; E L Brodie; K Keller; T Huber; D Dalevi; P Hu; G L Andersen
Journal: Appl Environ Microbiol Date: 2006-07 Impact factor: 4.792

4. Pathway-based functional analysis of metagenomes.

Authors: Itai Sharon; Sivan Bercovici; Ron Y Pinter; Tomer Shlomi
Journal: J Comput Biol Date: 2011-03 Impact factor: 1.479

5. ITS1 versus ITS2 as DNA metabarcodes for fungi.

Authors: R Blaalid; S Kumar; R H Nilsson; K Abarenkov; P M Kirk; H Kauserud
Journal: Mol Ecol Resour Date: 2013-01-25 Impact factor: 7.090

6. Metagenomics of Kamchatkan hot spring filaments reveal two new major (hyper)thermophilic lineages related to Thaumarchaeota.

Authors: Laura Eme; Laila J Reigstad; Anja Spang; Anders Lanzén; Thomas Weinmaier; Thomas Rattei; Christa Schleper; Céline Brochier-Armanet
Journal: Res Microbiol Date: 2013-03-05 Impact factor: 3.992

7. Application of targeted metagenomics to explore abundance and diversity of CO₂-fixing bacterial community using cbbL gene from the rhizosphere of Arachis hypogaea.

Authors: Basit Yousuf; Jitendra Keshri; Avinash Mishra; Bhavanath Jha
Journal: Gene Date: 2012-07-02 Impact factor: 3.688

8. SEPP: SATé-enabled phylogenetic placement.

Authors: S Mirarab; N Nguyen; T Warnow
Journal: Pac Symp Biocomput Date: 2012

9. Estimation of viral richness from shotgun metagenomes using a frequency count approach.

Authors: Heather K Allen; John Bunge; James A Foster; Darrell O Bayles; Thaddeus B Stanton
Journal: Microbiome Date: 2013-02-04 Impact factor: 14.650

Review 10. Nonnegative matrix factorization: an analytical and interpretive tool in computational biology.

Authors: Karthik Devarajan
Journal: PLoS Comput Biol Date: 2008-07-25 Impact factor: 4.475

17 in total

1. Trypanosomatid Richness Among Rats, Opossums, and Dogs in the Caatinga Biome, Northeast Brazil, a Former Endemic Area of Chagas Disease.

Authors: Maria Augusta Dario; Carolina Furtado; Cristiane Varella Lisboa; Felipe de Oliveira; Filipe Martins Santos; Paulo Sérgio D'Andrea; André Luiz Rodrigues Roque; Samanta Cristina das Chagas Xavier; Ana Maria Jansen
Journal: Front Cell Infect Microbiol Date: 2022-06-20 Impact factor: 6.073

2. DNA Metabarcoding and Isolation by Baiting Complement Each Other in Revealing Phytophthora Diversity in Anthropized and Natural Ecosystems.

Authors: Federico La Spada; Peter J A Cock; Eva Randall; Antonella Pane; David E L Cooke; Santa Olga Cacciola
Journal: J Fungi (Basel) Date: 2022-03-22

3. A metabarcoding framework for facilitated survey of endolithic phototrophs with tufA.

Authors: Thomas Sauvage; William E Schmidt; Shoichiro Suda; Suzanne Fredericq
Journal: BMC Ecol Date: 2016-03-10 Impact factor: 2.964

4. MetaMeta: integrating metagenome analysis tools to improve taxonomic profiling.

Authors: Vitor C Piro; Marcel Matschkowski; Bernhard Y Renard
Journal: Microbiome Date: 2017-08-14 Impact factor: 14.650

5. Opening the treasure chest: A DNA-barcoding primer set for most higher taxa of Central European birds and mammals from museum collections.

Authors: Sylvia Schäffer; Frank E Zachos; Stephan Koblmüller
Journal: PLoS One Date: 2017-03-30 Impact factor: 3.240

6. MinION™ nanopore sequencing of environmental metagenomes: a synthetic approach.

Authors: Bonnie L Brown; Mick Watson; Samuel S Minot; Maria C Rivera; Rima B Franklin
Journal: Gigascience Date: 2017-03-01 Impact factor: 6.524

7. Is there foul play in the leaf pocket? The metagenome of floating fern Azolla reveals endophytes that do not fix N₂ but may denitrify.

Authors: Laura W Dijkhuizen; Paul Brouwer; Henk Bolhuis; Gert-Jan Reichart; Nils Koppers; Bruno Huettel; Anthony M Bolger; Fay-Wei Li; Shifeng Cheng; Xin Liu; Gane Ka-Shu Wong; Kathleen Pryer; Andreas Weber; Andrea Bräutigam; Henriette Schluepmann
Journal: New Phytol Date: 2017-10-30 Impact factor: 10.151

8. SUPER-FOCUS: a tool for agile functional analysis of shotgun metagenomic data.

Authors: Genivaldo Gueiros Z Silva; Kevin T Green; Bas E Dutilh; Robert A Edwards
Journal: Bioinformatics Date: 2015-10-09 Impact factor: 6.937

Review 9. Exploring the environmental diversity of kinetoplastid flagellates in the high-throughput DNA sequencing era.

Authors: Claudia Masini d'Avila-Levy; Carolina Boucinha; Alexei Kostygov; Helena Lúcia Carneiro Santos; Karina Alessandra Morelli; Anastasiia Grybchuk-Ieremenko; Linda Duval; Jan Votýpka; Vyacheslav Yurchenko; Philippe Grellier; Julius Lukeš
Journal: Mem Inst Oswaldo Cruz Date: 2015-11-24 Impact factor: 2.743

Review 10. Metagenomics and Bioinformatics in Microbial Ecology: Current Status and Beyond.

Authors: Satoshi Hiraoka; Ching-Chia Yang; Wataru Iwasaki
Journal: Microbes Environ Date: 2016-07-05 Impact factor: 2.912