| Literature DB >> 32824429 |
Abstract
Most current approach to metagenomic classification employ short next generation sequencing (NGS) reads that are present in metagenomic samples to identify unique genomic regions. NGS reads, however, might not be long enough to differentiate similar genomes. This suggests a potential for using longer reads to improve classification performance. Presently, longer reads tend to have a higher rate of sequencing errors. Thus, given the pros and cons, it remains unclear which types of reads is better for metagenomic classification. We compared two taxonomic classification protocols: a traditional assembly-free protocol and a novel assembly-based protocol. The novel assembly-based protocol consists of assembling short-reads into longer reads, which will be subsequently classified by a traditional taxonomic classifier. We discovered that most classifiers made fewer predictions with longer reads and that they achieved higher classification performance on synthetic metagenomic data. Generally, we observed a significant increase in precision, while having similar recall rates. On real data, we observed similar characteristics that suggest that the classifiers might have similar performance of higher precision with similar recall with longer reads. We have shown a noticeable difference in performance between assembly-based and assembly-free taxonomic classification. This finding strongly suggests that classifying species in metagenomic environments can be achieved with higher overall performance simply by assembling short reads. Further, it also suggests that long-read technologies might be better for species classification.Entities:
Keywords: metagenomic assembly; metagenomic classification; short-read sequencing
Mesh:
Year: 2020 PMID: 32824429 PMCID: PMC7465921 DOI: 10.3390/genes11080946
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Workflow of metagenomic classification: (A) original workflow, which uses short reads, (B) modified workflow, which uses assembled reads. Metegenomic classifiers are Kaiju, CLARK, Kraken, MetaCache, MetaPhlAn2, DUDes and GOTTCHA. Metagenomic assemblers are MEGAHIT, metaSPAdes and Ray.
Precision, recall, F-1 of species-level classification of four metagenomic classifiers on three synthetic short read datasets, which are, respectively, not assembled and assembled by three assemblers: MEGAHIT (MH), metaSPAdes (MS), and Ray.
| Kaiju | CLARK | Kraken | MetaCache | MetaPhlAn2 | DUDes | GOTTCHA | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Pre | Rec | F1 | Pre | Rec | F1 | Pre | Rec | F1 | Pre | Rec | F1 | Pre | Rec | F1 | Pre | Rec | F1 | Pre | Rec | F1 | ||
| 10 s | n/a | 0.02 | 1.0 | 0.04 | 0.02 | 1.0 | 0.05 | 0.03 | 1.0 | 0.06 | 0.20 | 1.0 | 0.33 | 1.0 | 1.0 | 1.0 | 1.0 | 0.90 | 0.94 | 1.0 | 1.0 | 1.0 |
| MH | 0.50 | 0.90 | 0.64 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.90 | 1.0 | 0.95 | 1.0 | 0.40 | 0.57 | 0.90 | 1.0 | 0.95 | 1.0 | 1.0 | 1.0 | |
| MS | 0.50 | 0.90 | 0.64 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.66 | 1.0 | 0.80 | 1.0 | 0.20 | 0.33 | 0.76 | 1.0 | 0.87 | 1.0 | 1.0 | 1.0 | |
| Ray | 0.39 | 0.90 | 0.54 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.83 | 1.0 | 0.91 | 1.0 | 0.70 | 0.82 | 0.83 | 1.0 | 0.90 | 1.0 | 1.0 | 1.0 | |
| 100 s | n/a | 0.18 | 0.87 | 0.29 | 0.21 | 0.98 | 0.35 | 0.21 | 0.84 | 0.34 | 0.47 | 0.97 | 0.63 | 0.92 | 0.87 | 0.89 | 0.98 | 0.84 | 0.91 | 0.97 | 0.89 | 0.93 |
| MH | 0.35 | 0.87 | 0.50 | 0.88 | 0.99 | 0.93 | 0.67 | 0.86 | 0.75 | 0.78 | 0.99 | 0.87 | 0.93 | 0.79 | 0.85 | 0.99 | 0.82 | 0.90 | 0.97 | 0.89 | 0.93 | |
| MS | 0.35 | 0.87 | 0.50 | 0.69 | 0.99 | 0.81 | 0.63 | 0.86 | 0.73 | 0.73 | 0.99 | 0.84 | 0.93 | 0.80 | 0.86 | 0.99 | 0.84 | 0.90 | 0.97 | 0.89 | 0.93 | |
| Ray | 0.25 | 0.87 | 0.38 | 0.98 | 0.99 | 0.98 | 0.75 | 0.86 | 0.80 | 0.83 | 0.99 | 0.90 | 0.94 | 0.86 | 0.90 | 0.97 | 0.85 | 0.91 | 0.97 | 0.89 | 0.93 | |
| 400 s | n/a | 0.84 | 0.88 | 0.86 | 0.95 | 0.99 | 0.97 | 0.95 | 0.83 | 0.88 | 0.91 | 0.97 | 0.94 | 0.97 | 0.88 | 0.93 | 0.98 | 0.69 | 0.81 | 0.99 | 0.88 | 0.93 |
| MH | 0.88 | 0.88 | 0.88 | 0.99 | 0.99 | 0.99 | 0.98 | 0.84 | 0.90 | 0.97 | 0.99 | 0.98 | 0.98 | 0.83 | 0.90 | 0.99 | 0.70 | 0.82 | 0.99 | 0.89 | 0.94 | |
| MS | 0.87 | 0.88 | 0.87 | 0.98 | 0.99 | 0.99 | 0.98 | 0.84 | 0.90 | 0.96 | 0.99 | 0.98 | 0.98 | 0.84 | 0.91 | 0.99 | 0.68 | 0.81 | 0.99 | 0.89 | 0.94 | |
| Ray | 0.95 | 0.85 | 0.90 | 0.99 | 0.99 | 0.99 | 0.99 | 0.84 | 0.91 | 0.99 | 0.98 | 0.99 | 0.90 | 0.02 | 0.04 | 0.98 | 0.46 | 0.63 | 1.0 | 0.22 | 0.36 | |
Assembly statistics for all assemblers on simulated (10 s, 100 s, 400 s) and real (ERR2017411, ERR2017412) data.
| Statistics | Dataset | MEGAHIT | metaSPAdes | Ray |
|---|---|---|---|---|
|
| ||||
| number of contigs | 10 s | 1069 | 1156 | 3256 |
| largest contig | 10 s | 835,563 | 1,436,250 | 294,361 |
| avg contig | 10 s | 31,529.53 | 29,211.97 | 10,307.24 |
| n50 | 10 s | 131,416 | 234,206 | 31,735 |
| number of contigs | 100 s | 156,074 | 210,765 | 717,512 |
| largest contig | 100 s | 573,139 | 190,202 | 14,995 |
| avg contig | 100 s | 1936.78 | 1448.98 | 189.24 |
| n50 | 100 s | 3051 | 2732 | 177 |
| number of contigs | 400 s | 488,142 | 901,182 | 59,663 |
| largest contig | 400 s | 21,914 | 13,618 | 3367 |
| avg contig | 400 s | 377.24 | 323.58 | 149.72 |
| n50 | 400 s | 361 | 319 | 138 |
|
| ||||
| number of contigs | ERR2017411 | 85,426 | 165,252 | 252,974 |
| largest contig | ERR2017411 | 516,770 | 394,993 | 278,191 |
| avg contig | ERR2017411 | 1606.59 | 981.96 | 443.97 |
| n50 | ERR2017411 | 4063 | 2820 | 1620 |
| number of contigs | ERR2017412 | 67,750 | 141,689 | 201,038 |
| largest contig | ERR2017412 | 212,503 | 264,186 | 192,118 |
| avg contig | ERR2017412 | 1360.63 | 807.96 | 340.48 |
| n50 | ERR2017412 | 2720 | 1816 | 432 |
Figure 2Contig length distribution compared to PacBio and ONT long read length distribution. Contigs were assembled across datasets (from left to right): 10s, 100s, 400s, ERR2017411, ERR2017412 and by different assemblers (from top to bottom): MEGAHIT (MH), metaSPAdes (MS), Ray. The bottom subfigures are PacBio (left) and ONT (right) read length distribution.
Number of species predicted by each classifiers.
| Kaiju | CLARK | Kraken | MetaCache | MetaPhlAn2 | DUDes | GOTTCHA | |
|---|---|---|---|---|---|---|---|
| 26,666,674 paired-end reads (10 s) length of 75 bp | |||||||
| n/a | 3553 | 372 | 346 | 50 | 10 | 9 | 10 |
| MEGAHIT | 25 | 10 | 10 | 11 | 5 | 11 | 10 |
| MetaSPAdes | 31 | 10 | 10 | 15 | 3 | 13 | 10 |
| Ray | 36 | 10 | 10 | 12 | 8 | 12 | 10 |
| 26,667,004 paired-end reads (100 s) length of 75 bp | |||||||
| n/a | 3659 | 394 | 380 | 176 | 87 | 73 | 84 |
| MEGAHIT | 1258 | 95 | 125 | 108 | 80 | 71 | 84 |
| MetaSPAdes | 1328 | 122 | 131 | 115 | 81 | 72 | 84 |
| Ray | 2109 | 86 | 107 | 101 | 86 | 74 | 84 |
| 26,665,698 paired-end reads (400 s) length of 75 bp | |||||||
| n/a | 3707 | 416 | 405 | 426 | 402 | 282 | 390 |
| MEGAHIT | 2024 | 403 | 394 | 411 | 370 | 284 | 388 |
| MetaSPAdes | 2522 | 405 | 396 | 416 | 375 | 277 | 389 |
| Ray | 754 | 398 | 392 | 394 | 10 | 188 | 99 |
| 17,853,919 paired-end reads (ERR2017411) length of 90 bp | |||||||
| n/a | 3654 | 3140 | 3638 | 1071 | 79 | 29 | 37 |
| MEGAHIT | 2071 | 1477 | 1537 | 718 | 29 | 47 | 25 |
| MetaSPAdes | 2618 | 1782 | 1867 | 797 | 32 | 33 | 25 |
| Ray | 2679 | 1630 | 1731 | 515 | 31 | 40 | 23 |
| 17,793,507 paired-end reads (ERR2017412) length of 90 bp | |||||||
| n/a | 3647 | 3075 | 3651 | 1044 | 82 | 48 | 45 |
| MEGAHIT | 1653 | 1035 | 1058 | 611 | 23 | 33 | 26 |
| MetaSPAdes | 2312 | 1387 | 1423 | 679 | 39 | 42 | 29 |
| Ray | 2192 | 1203 | 1297 | 448 | 21 | 21 | 22 |
Pairwise similarity of a method to other methods.
| Kaiju | CLARK | Kraken | MetaCache | MetaPhlAn2 | DUDes | GOTTCHA | |
|---|---|---|---|---|---|---|---|
| 17,853,919 paired-end reads (ERR2017411) length of 90 bp | |||||||
| n/a | 0.66 | 0.69 | 0.66 | 0.68 | 0.65 | 0.82 | 0.80 |
| MEGAHIT | 0.51 | 0.63 | 0.62 | 0.60 | 0.76 | 0.81 | 0.80 |
| MetaSPAdes | 0.53 | 0.65 | 0.64 | 0.63 | 0.73 | 0.81 | 0.81 |
| Ray | 0.50 | 0.64 | 0.62 | 0.63 | 0.74 | 0.81 | 0.80 |
| 17,793,507 paired-end reads (ERR2017412) length of 90 bp | |||||||
| n/a | 0.66 | 0.69 | 0.65 | 0.68 | 0.71 | 0.82 | 0.82 |
| MEGAHIT | 0.51 | 0.63 | 0.62 | 0.60 | 0.76 | 0.81 | 0.80 |
| MetaSPAdes | 0.53 | 0.65 | 0.64 | 0.63 | 0.73 | 0.81 | 0.81 |
| Ray | 0.50 | 0.64 | 0.62 | 0.63 | 0.74 | 0.81 | 0.80 |