| Literature DB >> 31749830 |
Jiarong Guo1, John F Quensen1, Yanni Sun2, Qiong Wang1, C Titus Brown3, James R Cole1, James M Tiedje1.
Abstract
Shotgun metagenomics has greatly advanced our understanding of microbial communities over the last decade. Metagenomic analyses often include assembly and genome binning, computationally daunting tasks especially for big data from complex environments such as soil and sediments. In many studies, however, only a subset of genes and pathways involved in specific functions are of interest; thus, it is not necessary to attempt global assembly. In addition, methods that target genes can be computationally more efficient and produce more accurate assembly by leveraging rich databases, especially for those genes that are of broad interest such as those involved in biogeochemical cycles, biodegradation, and antibiotic resistance or used as phylogenetic markers. Here, we review six gene-targeted assemblers with unique algorithms for extracting and/or assembling targeted genes: Xander, MegaGTA, SAT-Assembler, HMM-GRASPx, GenSeed-HMM, and MEGAN. We tested these tools using two datasets with known genomes, a synthetic community of artificial reads derived from the genomes of 17 bacteria, shotgun sequence data from a mock community with 48 bacteria and 16 archaea genomes, and a large soil shotgun metagenomic dataset. We compared assemblies of a universal single copy gene (rplB) and two N cycle genes (nifH and nirK). We measured their computational efficiency, sensitivity, specificity, and chimera rate and found Xander and MegaGTA, which both use a probabilistic graph structure to model the genes, have the best overall performance with all three datasets, although MEGAN, a reference matching assembler, had better sensitivity with synthetic and mock community members chosen from its reference collection. Also, Xander and MegaGTA are the only tools that include post-assembly scripts tuned for common molecular ecology and diversity analyses. Additionally, we provide a mathematical model for estimating the probability of assembling targeted genes in a metagenome for estimating required sequencing depth.Entities:
Keywords: MegaGTA; Xander; gene-centric assembly; gene-targeted assembly; microbial ecology
Year: 2019 PMID: 31749830 PMCID: PMC6843070 DOI: 10.3389/fgene.2019.00957
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Time and memory requirements for processing the synthetic data for rplB. Except for MEGAN BLAST/DIAMOND performed on MSU’s cluster, all times are for running on an HP ProBook 450 G5 with Intel i7-8550U CPU and 32 Gb RAM running Ubuntu 18.04 LTS.
| Program | Stage | Threads | Wall timehh:mm:ss | CPU timehh:mm:ss | Peak memory (KB) |
|---|---|---|---|---|---|
| Xander | Build | 1 | 00:03:52 | 00:03:57 | 736,860 |
| Find | 4 | 00:00:57 | 00:04:48 | 1,512,728 | |
| Search | 4 | 00:02:42 | 00:04:28 | 867,776 | |
| MegaGTA | Main | 8 | 00:10:06 | 01:15:02 | 1,133,248 |
| Post-processing | 4 | 00:00:47 | 00:02:16 | 729,624 | |
| FragGeneScan | 4 | 00:24:20 | 01:29:15 | 65,356 | |
| HMM-GRASPx | 4 | 00:05:28 | 00:05:28 | 8,159,504 | |
| SAT-Assembler | NA | 00:05:55 | 00:06:38 | 77,620 | |
| MEGAN | Diamond | 8 | 14:38:57 | 95:11:48 | 19,810,188 |
| Meganize | NA | 00:05:46 | 00:15:57 | 21,659,968 | |
| Assembly | NA | 00:00:03 | NA | NA | |
| GenSeed-HMM | 4 | 00:07:46 | 00:16:57 | 1,425,368 |
BLAST summary for rplB assembled from the synthetic data. There were 17 rplB sequences in the synthetic data. Entries in the % ID columns give the number of taxa matched over the number of contigs that match rplB by BLAST identity at the specified percentage.
| Method | Contigs | Length | Non-target | <97% | 97% | 98% | 99% | 100% |
|---|---|---|---|---|---|---|---|---|
| Xander | 28 | 807–828 | 0 | 1 | 17/27 | 15/23 | 12/16 | 12/12 |
| MegaGTA | 28 | 807–828 | 0 | 1 | 17/27 | 15/23 | 12/16 | 12/12 |
| HMM-GRASPx | 63 | 102–261 | 0 | 3 | 16/60 | 16/60 | 16/59 | 16/59 |
| HMM-GRASPx | 0 | > =450 | – | – | – | – | – | – |
| MEGAN1 | 55 | 204–3,822 | 32 | 0 | 16/23 | 16/23 | 16/23 | 16/23 |
| MEGAN2 | 20 | 453–3,822 | 11 | 0 | 9/9 | 9/9 | 9/9 | 9/9 |
| SAT-Assembler3 | 176 | 150–997 | 49 | 60 | 17/67 | 17/50 | 16/28 | 16/23 |
| SAT-Assembler4 | 106 | 465–997 | 0 | 58 | 16/48 | 15/33 | 13/14 | 11/11 |
| GenSeed-HMM5 | 97 | 32–1,340 | 4 | 0 | 17/93 | 17/93 | 17/93 | 17/93 |
| GenSeed-HMM6 | 9 | 724–1,340 | 1 | 0 | 8/8 | 8/8 | 8/8 | 8/8 |
MEGAN1: all contigs assembled. MEGAN2: contigs filtered to a minimum length of 450 bp. SAT-Assembler3: all contigs assembled with an overlap length of 40 bp. SAT-Assembler4: contigs were de-replicated, duplicates removed, and filtered to a minimum length of 450 bp. GenSeed-HMM5: all contigs assembled; GenSeed-HMM6: contigs were filtered to a minimum length of 450 bp.
BLAST summary for rplB contigs assembled from the mock data. There were 48 bacterial rplB sequences in the mock data set. Entries in the % ID columns give the number of taxa matched over the number of contigs that match rplB by BLAST identity at the specified percentage.
| Method | Contigs | Length | Non-target | <97% | 97% | 98% | 99% | 100% |
|---|---|---|---|---|---|---|---|---|
| Xander | 95 | 459–849 | 2 | 5 | 44/88 | 43/85 | 40/80 | 30/30 |
| MegaGTA | 94 | 453–849 | 2 | 6 | 46/86 | 44/83 | 42/80 | 32/32 |
| MEGAN1 | 93 | 201–1,611 | 45 | 1 | 39/47 | 39/47 | 38/46 | 35/39 |
| MEGAN2 | 50 | 450–1,611 | 16 | 1 | 33/33 | 33/33 | 32/32 | 28/28 |
| SAT-Assembler3 | 2,765 | 50–750 | 751 | 107 | 48/1,907 | 48/1,865 | 48/1,689 | 47/1,318 |
| SAT-Assembler4 | 61 | 458–750 | 1 | 18 | 29/42 | 27/37 | 25/31 | 13/13 |
| GenSeed-HMM5 | 408 | 31–1,360 | 60 | 7/9 | 47/339 | 47/330 | 46/187 | 43/183 |
| GenSeed-HMM6 | 44 | 450–1,360 | 11 | 1/1 | 28/32 | 28/32 | 27/31 | 23/27 |
1Data for all MEGAN contigs assembled from reads mapping to IPR005880 using default parameters. 2Data for MEGAN contigs filtered to a minimum length of 450 bp. 3All SAT-Assembler rplB contigs assembled from the mock data with an overlap length of 40 bp. Notice that the minimum length is one-half of the read length. 4SAT-Assembler contigs were assembled with an overlap length of 40 bp, de-replicated, duplicates removed, and filtered to a minimum length of 450 bp. HMM-GRASPx failed to complete with this data set. GenSeed-HMM5: all contigs assembled; GenSeed-HMM6: contigs were filtered to a minimum length of 450 bp.
BLAST summary for bacterial rplB contigs assembled from C1-50M aligned against NCBI-nr. Entries in the % ID columns give the number of taxa matched over the number of contigs that match rplB by BLAST identity at the specified percentage.
| Method | Contigs | Length | Non-target | <97% | 97% | 98% | 99% | 100% |
|---|---|---|---|---|---|---|---|---|
| Xander | 269 | 453–825 | 0 | 56/250 | 11/19 | 8/16 | 4/8 | 3/3 |
| MegaGTA | 316 | 450–825 | 0 | 82/290 | 13/26 | 12/19 | 8/11 | 4/4 |
| MEGAN1 | 30 | 207–705 | 11 | 2/2 | 14/17 | 11/14 | 9/12 | 9/12 |
| MEGAN2 | 3 | 462–705 | 1 | 2/2 | 2/2 | 2/2 | 2/2 | 2/2 |
| SAT-Assembler3 | 705 | 51–436 | 9 | 125/207 | 179/469 | 154/381 | 132/316 | 131/312 |
| SAT-Assembler4 | 0 | – | – | – | – | – | – | – |
| GenSeed-HMM5 | 4340 | 31–1,058 | 3109 | 334/596 | 311/635 | 284/562 | 277/535 | 273/535 |
| GenSeed-HMM6 | 4 | 458–1,058 | 0 | 2/2 | 2/2 | 2/2 | 1/1 | 1/1 |
MEGAN1: all contigs assembled. MEGAN2: contigs filtered to a minimum length of 450 bp. SAT-Assembler3: contigs assembled with an overlap length of 40 bp and de-replicated. SAT-Assembler4: contigs assembled with an overlap length of 40 bp were de-replicated and filtered to a minimum length of 450 bp. GenSeed-HMM5: all contigs assembled; GenSeed-HMM6: contigs were filtered to a minimum length of 450 bp.
Figure 1Relation between the probability of having a target gene from a species assembled and the relative abundance of the species at different sequencing depth. X axis is at log10 scale, the target gene length is set to 800 bp, and the minimum contig length is set to 550 bp.
Figure 2The effect of sequencing depth on the fold coverage of rplB or rpsC assembled. X axis is the number of subsamples C1 is evenly divided into. Y axis is rplB or rpsC fold coverage of a subsample divided by expected folded coverage as if it decreases linearly with sequencing depth (the fold coverage of original sample divided by number of even subsamples).