| Literature DB >> 25431440 |
Kengo Sato1, Yasubumi Sakakibara2.
Abstract
The assembly of multiple genomes from mixed sequence reads is a bottleneck in metagenomic analysis. A single-genome assembly program (assembler) is not capable of resolving metagenome sequences, so assemblers designed specifically for metagenomics have been developed. MetaVelvet is an extension of the single-genome assembler Velvet. It has been proved to generate assemblies with higher N50 scores and higher quality than single-genome assemblers such as Velvet and SOAPdenovo when applied to metagenomic sequence reads and is frequently used in this research community. One important open problem for MetaVelvet is its low accuracy and sensitivity in detecting chimeric nodes in the assembly (de Bruijn) graph, which prevents the generation of longer contigs and scaffolds. We have tackled this problem of classifying chimeric nodes using supervised machine learning to significantly improve the performance of MetaVelvet and developed a new tool, called MetaVelvet-SL. A Support Vector Machine is used for learning the classification model based on 94 features extracted from candidate nodes. In extensive experiments, MetaVelvet-SL outperformed the original MetaVelvet and other state-of-the-art metagenomic assemblers, IDBA-UD, Ray Meta and Omega, to reconstruct accurate longer assemblies with higher N50 scores for both simulated data sets and real data sets of human gut microbial sequences.Entities:
Keywords: de novo assembler; metagenomic; microbial community; short read; supervised learning
Mesh:
Year: 2014 PMID: 25431440 PMCID: PMC4379979 DOI: 10.1093/dnares/dsu041
Source DB: PubMed Journal: DNA Res ISSN: 1340-2838 Impact factor: 4.458
Figure 1.Chimeric nodes need to be split to obtain independent sub-graphs in a metagenomic assembly.
Figure 2.MetaVelvet-SL system consists of three major procedures: (i) construction of a de Bruijn graph; (ii) classification of chimeric nodes and (iii) final assembly tasks.
Figure 3.Chimeric nodes fall into two classes. Nodes of the same colour represent the same species. The number in each node represents the coverage value of the node. A contig sequence is also attached to each node.
Statistics of assembly results for simulated data sets
| MetaVelvet-SL (+MetaPhlAn) | MetaVelvet | IDBA-UD | SOAPdenovo2 | Ray Meta | Omega | |
|---|---|---|---|---|---|---|
| Order | ||||||
| Nm50 (bp) | 222,972 | 243,336 | 10,529 | 154,899 | 29,668 | |
| Maximum length (bp) | 1,312,990 | 1,700,200 | 141,628 | 927,783 | 405,783 | |
| Total scaffold length (bp) | 69,383,955 | 70,711,008 | 70,743,471 | 72,886,145 | 72,067,809 | 71,331,763 |
| Number of scaffolds | 4,755 | 2,889 | 1,395 | 37,391 | 1,184 | 17,075 |
| Required CPU time (s) | 26,735 | 12,606 | 283,378 | 29,980 | 401,873 | 59,275 |
| Family | ||||||
| Nm50 (bp) | 227,243 | 251,915 | 6,751 | 167,523 | 42,500 | |
| Maximum length (bp) | 1,570,565 | 1,247,435 | 167,532 | 1,181,121 | 521,402 | |
| Total scaffold length (bp) | 83,890,270 | 76,369,071 | 81,721,590 | 86,524,823 | 84,549,043 | 84,756,027 |
| Number of scaffolds | 8,809 | 5,456 | 1,884 | 58,952 | 1,655 | 18,704 |
| Required CPU time (s) | 35,685 | 14,855 | 379,757 | 17,155 | 544,821 | 27,233 |
| Genus | ||||||
| Nm50 (bp) | 100,132 | 121,196 | 4,642 | 91,637 | 16,533 | |
| Maximum length (bp) | 2,099,603 | 1,246,124 | 85,991 | 1,212,747 | 212,138 | |
| Total scaffold length (bp) | 83,281,358 | 84,636,187 | 79,218,358 | 81,965,701 | 83,171,453 | 73,537,116 |
| Number of scaffolds | 10,555 | 14,802 | 10,362 | 97,463 | 6,822 | 24,244 |
| Required CPU time (s) | 188,170 | 19,514 | 306,073 | 35,648 | 1,259,371 | 97,573 |
| Species | ||||||
| Nm50 (bp) | 91,159 | 74,670 | 4,469 | 80,592 | 13,053 | |
| Maximum length (bp) | 1,878,401 | 2,107,202 | 103,314 | 702,714 | 193,065 | |
| Total scaffold length (bp) | 81,524,460 | 82,381,332 | 65,980,631 | 85,892,445 | 74,075,828 | 67,422,938 |
| Number of scaffolds | 22,440 | 29,472 | 18,864 | 132,284 | 17,077 | 24,096 |
| Required CPU time (s) | 195,454 | 17,610 | 353,353 | 20,082 | 352,521 | 208,417 |
All computations were executed using Intel(R) Xeon(R) E5540 processors (2.53 GHz), with 96-GB physical memory, except for a few cases. Top performances are shown in bold.
The number of chimeric node candidates in de Bruijn graph constructed from each assembly data set
| Positive | Negative | Total no. of chimeric node candidates | ||
|---|---|---|---|---|
| Class 1 | Class 2 | Class 3 | ||
| Order | 82 | 0 | 1,515 | 1,597 |
| Family | 146 | 1 | 2,456 | 2,603 |
| Genus | 2,918 | 731 | 8,505 | 12,154 |
| Species | 3,074 | 246 | 14,589 | 17,909 |
Statistics of assembly results of MetaVelvet-SL using different training data sets
| MetaVelvet-SL | ||||
|---|---|---|---|---|
| (+MetaPhlAn) | Genus-level training data set | Family-level training data set | Order-level training data set | |
| Order | ||||
| Nm50 (bp) | 695,261 | 672,952 | 686,074 | 695,557 |
| Maximum length (bp) | 3,546,677 | 3,415,875 | 3,547,025 | 3,818,061 |
| Total scaffold length (bp) | 69,383,955 | 69,881,185 | 69,288,924 | 69,387,743 |
| Number of scaffolds | 4,755 | 4,379 | 4,747 | 4,829 |
| Required CPU time (s) | 26,735 | 26,773 | 26,612 | 26,660 |
| Family | ||||
| Nm50 (bp) | 375,942 | 377,604 | 384,795 | 384,795 |
| Maximum length (bp) | 1,875,576 | 2,326,125 | 2,326,197 | 1,927,551 |
| Total scaffold length (bp) | 83,890,270 | 83,888,560 | 83,877,454 | 83,877,321 |
| Number of scaffolds | 8,809 | 8,777 | 8,687 | 8,679 |
| Required CPU time (s) | 35,685 | 35,477 | 35,577 | 35,469 |
| Genus | ||||
| Nm50 (bp) | 226,033 | 233,924 | 266,018 | 257,292 |
| Maximum length (bp) | 2,259,591 | 2,888,749 | 2,974,950 | 2,843,963 |
| Total scaffold length (bp) | 83,281,358 | 82,525,756 | 83,385,233 | 83,634,261 |
| Number of scaffolds | 10,555 | 11,450 | 9,376 | 8,512 |
| Required CPU time (s) | 188,170 | 187,963 | 188,285 | 188,325 |
| Species | ||||
| Nm50 (bp) | 174,495 | 166,528 | 158,509 | 167,722 |
| Maximum length (bp) | 3,808,921 | 3,292,179 | 3,292,250 | 3,292,179 |
| Total scaffold length (bp) | 81,524,460 | 81,141,481 | 81,447,218 | 81,414,639 |
| Number of scaffolds | 22,440 | 19,114 | 21,073 | 22,735 |
| Required CPU time (s) | 195,454 | 195,512 | 195,534 | 195,460 |
Classification results for chimeric nodes
| MetaVelvet-SL | MetaVelvet | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| (Training: Genus) | (Training: Family) | (Training: Order) | (+MetaPhlAn) | ||||||||||||
| Sen | Acc | BA | Sen | Acc | BA | Sen | Acc | BA | Sen | Acc | BA | Sen | Acc | BA | |
| Order | 84.15 | 94.49 | 89.60 | 54.88 | 96.12 | 76.61 | 57.32 | 95.49 | 77.44 | 42.68 | 96.62 | 71.11 | 42.68 | 92.42 | 67.74 |
| Family | 80.27 | 97.54 | 89.42 | 63.27 | 97.08 | 81.80 | 55.10 | 96.85 | 77.23 | 65.99 | 97.35 | 82.61 | 44.90 | 93.32 | 69.21 |
| Genus | 62.76 | 76.92 | 72.88 | 50.07 | 65.03 | 60.76 | 54.29 | 67.88 | 64.00 | 58.98 | 81.90 | 75.35 | 33.13 | 62.37 | 46.91 |
| Species | 52.80 | 54.84 | 54.05 | 36.83 | 70.60 | 57.32 | 33.10 | 82.19 | 63.23 | 40.24 | 83.15 | 66.58 | 15.21 | 71.61 | 48.09 |
Sen (%) means the percentage of sensitivity; Acc (%) means the percentage of accuracy and BA (%) means the percentage of balanced accuracy.
Assembly results for the real human gut microbial data sets
| MetaVelvet-SL (+ MetaPhlAn) | MetaVelvet | IDBA-UD | SOAPdenovo2 | Ray Meta | Omega | |
|---|---|---|---|---|---|---|
| MH0006 (ERS006497) | ||||||
| Maximum length (bp) | 82,400 | 424,786 | 248,752 | 245,285 | 293,858 | |
| Total scaffold length (bp) | 228,356,028 | 293,629,444 | 314,842,356 | 211,199,449 | 134,644,249 | |
| Number of scaffold | 927,151 | 387,193 | 197,401 | 521,577 | 609,062 | 150,907 |
| AUC of N-len(x) | 909,250 | 6,002,739 | 3,042,215 | 2,260,838 | 2,527,198 | |
| MH0012 (ERS006494) | ||||||
| Maximum length (bp) | 119,936 | 594,225 | 792,429 | 512,973 | 1,144,479 | |
| Total scaffold length (bp) | 255,566,175 | 290,340,811 | 325,057,612 | 272,663,103 | 170,102,775 | |
| Number of scaffold | 718,438 | 327,103 | 198,771 | 482,983 | 635,814 | 125,383 |
| AUC of N-len(x) | 2,129,027 | 10,344,620 | 8,856,698 | 6,977,480 | 10,229,304 | |
| MH0047 (ERS006592) | ||||||
| Maximum length (bp) | 69,475 | 185,593 | 44,319 | 137,473 | 52,084 | |
| Total scaffold length (bp) | 75,290,864 | 75,032,143 | 88,092,865 | 50,174,724 | 29,134,928 | |
| Number of scaffold | 374,148 | 210,477 | 89,786 | 263,713 | 141,466 | 31,961 |
| AUC of N-len(x) | 237,568 | 802,594 | 201,366 | 544,742 | 208,223 | |
| SRS017227 | ||||||
| Maximum length (bp) | 108,476 | 372,927 | 227,256 | 199,208 | 217,259 | |
| Total scaffold length (bp) | 370,496,571 | 250,969,598 | 349,934,212 | 273,595,801 | 206,705,202 | |
| Number of scaffold | 602,463 | 485,307 | 282,097 | 802,952 | 536,708 | 217,259 |
| AUC of N-len(x) | 1,064,102 | 4,010,530 | 2,227,194 | 2,501,039 | 1,617,896 | |
| SRS018661 | ||||||
| Maximum length (bp) | 111,404 | 511,735 | 426,297 | 274,042 | 180,946 | |
| Total scaffold length (bp) | 71,339,406 | 109,507,232 | 107,557,997 | 75,351,327 | 47,619,933 | |
| Number of scaffold | 284,036 | 195,950 | 109,860 | 274,896 | 212,267 | 34,244 |
| AUC of N-len(x) | 253,138 | 1,296,606 | 848,798 | 1,004,371 | 611,636 | |
Top performances are shown in bold. MetaVelvet-SL, MetaVelvet and SOAPdenovo2 set the k-mer size at 37 for the MH0006 and MH0047 data sets, 43 for the MH0012 data set and 51 for SRS017227 and SRS018661.
Figure 4.The N-len(x) plots for the MH0006 data set of human gut microbial data.
The number of species in the taxonomic profile predicted by MetaPhlAn and the taxonomic profile based on assembly results of MetaVelvet-SL using BLAST
| Number of species predicted by both | Number of species predicted only by MetaPhlAn | Number of species predicted only by assembly | |
|---|---|---|---|
| MH0006 (ERS006497) | 99 | 5 | 2,932 |
| MH0012 (ERS006494) | 124 | 9 | 2,872 |
| MH0047 (ERS006592) | 65 | 2 | 2,137 |
| SRS017227 | 83 | 3 | 2,992 |
| SRS018661 | 81 | 8 | 1,529 |
The first column represents the number of species predicted by MetaPhlAn and predicted from assembly results by BLAST (intersection). The second column represents the number of species only predicted by MetaPhlAn and not predicted from assembly results by BLAST. The third column represents the number of species only predicted from assembly results by BLAST and not predicted by MetaPhlAn.