| Literature DB >> 32380653 |
Daniela Becker1, Denny Popp1, Hauke Harms1, Florian Centler1.
Abstract
Metagenomics analysis revealing the composition and functional repertoire of complex microbial communities typically relies on large amounts of sequence data. Numerous analysis strategies and computational tools are available for their analysis. Fully integrated automated analysis pipelines such as MG-RAST or MEGAN6 are user-friendly but not designed for integrating specific knowledge on the biological system under study. In order to facilitate the consideration of such knowledge, we introduce a modular, adaptable analysis pipeline combining existing tools. We applied the novel pipeline to simulated mock data sets focusing on anaerobic digestion microbiomes and compare results to those obtained with established automated analysis pipelines. We find that the analysis strategy and choice of tools and parameters have a strong effect on the inferred taxonomic community composition, but not on the inferred functional profile. By including prior knowledge, computational costs can be decreased while improving result accuracy. While automated off-the-shelf analysis pipelines are easy to apply and require no knowledge on the microbial system under study, custom-made pipelines require more preparation time and bioinformatics expertise. This extra effort is minimized by our modular, flexible, custom-made pipeline, which can be adapted to different scenarios and can take available knowledge on the microbial system under study into account.Entities:
Keywords: anaerobic digestion; compositional analysis; functional profiling; metagenomics; microbial communities
Year: 2020 PMID: 32380653 PMCID: PMC7284732 DOI: 10.3390/microorganisms8050669
Source DB: PubMed Journal: Microorganisms ISSN: 2076-2607
Composition of metagenomic mock data sets.
| Species | NCBI Reference Sequence | Genome Size [bp] | Abundance [%] | ||
|---|---|---|---|---|---|
| MDS1 | MDS2 | MDS3 | |||
|
| NC_009051.1 | 2,478,101 | 5.03 | 5.08 | 16.22 |
|
| NC_015416.1 NC_015430.1 | 3,008,626 | 46.08 | 68.45 | 68.44 |
|
| NC_014011.1 | 1,980,592 | 1.35 | 1.35 | 1.34 |
|
| NC_008346.1 | 2,936,195 | 2.80 | 2.80 | 2.80 |
|
| NC_000913.3 | 4,641,652 | 44.74 | 22.37 | 11.19 |
1 Combination of complete genome (NC_015416.1) and a selected chromosome sequence (NC_015430.1).
Criteria to select tools for metagenomics data analysis [37].
| Criteria | Description |
|---|---|
| Availability | Open source or commercial version as download or web service. |
| Support | Availability of: manual/readme, version update, help functions, and/or developer contact. |
| Flexibility | Flexible usage (variety of input/output data formats). |
| Run time | Expenditure of time: Operating time. |
| Usability | Ease of tool usage (installation, parameter settings, algorithm applicability). |
Quality characteristics to compare tool output.
| Step | Feature | Description |
|---|---|---|
| Assembly | Number of contigs | Number of contigs generated |
| Largest contig | Length of the longest contig | |
| N50 | Median statistic whereby 50% of the entire assembly (in terms of number of bases) is contained in contigs longer or equal to the N50 value. Assemblies can only be compared if the assembly size is similar [ | |
| Total assembly length | Sum over all bases of all contigs generated | |
| Binning | Number of bins | Number of bins (clusters) |
| Completeness/contamination | CheckM compares completeness and contamination per bin depending on a defined number of marker genes for bacterial and archaeal genomes as a quality factor. |
Figure 1Metagenomic analysis by using MEGAN6 stand-alone or a custom-made pipeline approach. Custom-made pipeline to analyze metagenomic data by using prior information on expected microbial species (black arrows). Alternatively, the pre-mapping step can be skipped and de novo assembly (red arrow) or annotation by using MEGAN6 (purple arrow) be done directly using cleaned reads.
Comparing assembly and reassembly results.
| Criteria | MDS1 | MDS2 | MDS3 |
|---|---|---|---|
|
| |||
| number contigs | 66 | 167 | 966 |
| largest contig [bp] | 327,353 | 145,780 | 16,676 |
| total assembly length [bp] | 4,564,668 | 4,520,232 | 2,975,790 |
| N50 [bp] | 133,300 | 47,255 | 3,611 |
|
| |||
| number contigs | 72 | 75 | 672 |
| largest contig [bp] | 327,353 | 317,670 | 27,558 |
| total assembly length [bp] | 4,567,420 | 4,378,346 | 3,381,373 |
| N50 [bp] | 132,848 | 113,839 | 5,862 |
Figure 2Comparative community analysis. (a) Comparison of community compositions on the genus level predicted by different analysis strategies based on three mock communities, MDS1–3. The custom-made pipeline (CMP) integrates prior knowledge on the community and includes an assembly step, while MEGAN6 and MG-RAST represent two automated analysis pipelines without assembly; (b) Deviation of predictions from actual species fractions in the total community. The section “other” refers to identified genera that were not part of the initial community composition and unassigned reads.
Figure 3Functional potential of mock communities compared between analysis strategies. The relative abundances were calculated by using absolute read counts per sample and analysis approaches. Annotation was based on the EggNOG (CMP and MEGAN6) and COG databases (MG-RAST).
Figure 4Functional potential combined with taxonomical information using the custom-made pipeline. Counted reads per sample were used to calculate the relative abundance. The functional potential analysis uses the EggNOG database, level 1 classification. The row “other” refers to identified species that are not part of the initial community composition and unassigned reads (for detailed relative abundances see Supplementary File 2: worksheet 7).
Composition of bins compared between the CMP with and without pre-mapping for all three mock data sets (MDS) in relative abundances (%). Pre-mapping successfully removes reads associated with expected species, only leaving E. coli reads for assembly and binning.
| CMP with Pre-Mapping (Only Unexpected Reads) | CMP without Pre-Mapping(All Reads) | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MDS | Bins | MM | MC | AC | SW | EC | other | MM | MC | AC | SW | EC | other |
| 1 | nc | 0.00 | 0.00 | 0.00 | 0.00 | 90.30 | 9.70 | 16.25 | 3.48 | 19.33 | 59.35 | 0.20 | 1.39 |
| 1 | -- | -- | -- | -- | -- | -- | 0.00 | 2.22 | 0.00 | 0.00 | 88.25 | 9.53 | |
| 2 | -- | -- | -- | -- | -- | -- | 0.00 | 99.60 | 0.00 | 0.27 | 0.00 | 0.13 | |
| 2 | nc | 0.00 | 0.00 | 0.00 | 0.00 | 89.80 | 10.20 | 13.73 | 14.89 | 15.81 | 49.17 | 4.76 | 1.64 |
| 1 | 0.00 | 0.00 | 0.00 | 0.00 | 96.63 | 3.37 | 0.65 | 81.42 | 0.79 | 2.55 | 13.33 | 1.26 | |
| 2 | -- | -- | -- | -- | -- | -- | 0.79 | 50.57 | 0.95 | 3.10 | 40.27 | 4.32 | |
| 3 | nc | 0.00 | 0.00 | 0.00 | 0.00 | 89.21 | 10.79 | 0.96 | 6.52 | 13.19 | 39.82 | 35.36 | 4.16 |
| 1 | 0.00 | 0.00 | 0.00 | 0.00 | 94.38 | 5.62 | 3.26 | 82.57 | 1.05 | 3.47 | 8.85 | 0.80 | |
| 2 | 0.00 | 0.00 | 0.00 | 0.00 | 88.12 | 11.88 | 18.55 | 59.36 | 1.50 | 4.93 | 12.67 | 2.99 | |
| 3 | -- | -- | -- | -- | -- | -- | 3.46 | 45.54 | 1.15 | 3.77 | 41.53 | 4.55 | |
Unit = % | nc = noclass (contigs which could not be assigned to a bin) | MM = Methanoculleus marisnigri | MC = Methanosaeta concilii | AC = Aminobacterium colombiense | SW = Syntrophomonas wolfei | EC = Escherichia coli | other = all species beside the initial community are summarized | Bold numbers highlight the highest relative abundance per bin. | -- = No bins