| Literature DB >> 24564377 |
Yi Wang, Henry Leung, Siu Yiu, Francis Chin.
Abstract
BACKGROUND: Taxonomic annotation of reads is an important problem in metagenomic analysis. Existing annotation tools, which rely on the approach of aligning each read to the taxonomic structure, are unable to annotate many reads efficiently and accurately as reads (~100 bp) are short and most of them come from unknown genomes. Previous work has suggested assembling the reads to make longer contigs before annotation. More reads/contigs can be annotated as a longer contig (in Kbp) can be aligned to a taxon even if it is from an unknown species as long as it contains a conserved region of that taxon. Unfortunately existing metagenomic assembly tools are not mature enough to produce long enough contigs. Binning tries to group reads/contigs of similar species together. Intuitively, reads in the same group (cluster) should be annotated to the same taxon and these reads altogether should cover a significant portion of the genome alleviating the problem of short contigs if the quality of binning is high. However, no existing work has tried to use binning results to help solve the annotation problem. This work explores this direction.Entities:
Mesh:
Year: 2014 PMID: 24564377 PMCID: PMC4046714 DOI: 10.1186/1471-2164-15-S1-S12
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Annotation result on high-coverage dataset A1
| Methods | Species-reference (~16.7 million reads) | Order-reference (~20.0 million reads) | ||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
| MetaCluster-TA |
|
|
|
|
|
| 22.6% |
|
| MEGAN4 (contigs) | 60.7% | 12.9% | 22.3% | 4.1% | 12.3% | 13.1% | 65.3% | 9.3% |
| MEGAN4 (reads) | 57.7% | 14.8% | 4.6% | 22.8% | 0.7% | 0.7% |
| 95.2% |
* "Accurate" corresponds to the percentage of species-reference/order-reference reads annotated correctly, i.e., their correct species/order names of the target genomes; "Higher" corresponds to the percentage of species-reference/order-reference reads that are correctly annotated, but to taxonomy of higher levels than species/order of the target genomes (e.g. reads of E. coli-reference annotated with family name Enterobacteriaceae); "Incorrect" corresponds to the percentage of reads which are annotated incorrectly; "Unassigned" corresponds to the percentage of reads that cannot be annotated to any taxonomy.
* Running time of MetaCluster-TA is about 8 hours; running time of MEGAN4 (reads) is about 4 days; running time of MEGAN4 (contigs) is about 1 day.
* About 80% reads can be aligned to contigs of length > 500 bp with <5% mismatches.
Further comparison between MEGAN4 (contigs) and MetaCluster-TA on species-reference of A1
| MetaCluster-TA | |||||
|---|---|---|---|---|---|
|
|
|
|
| ||
| MEGAN4 | Accurate | 56.3% | 3.1% | 1.3% | 0.0% |
| Higher | 3.2% | 8.4% | 1.3% | 0.0% | |
| Incorrect | 1.3% | 19.6% | 1.4% | 0.0% | |
| Unassigned | 0.1% | 1.8% | 0.0% | 2.2% | |
Further comparison between MEGAN4 (contigs) and MetaCluster-TA on order-reference of A1
| MetaCluster-TA | |||||
|---|---|---|---|---|---|
|
|
|
|
| ||
| MEGAN4 | Accurate | 9.3% | 1.3% | 1.7% | 0.0% |
| Higher | 1.3% | 8.7% | 3.1% | 0.0% | |
| Incorrect | 20.5% | 27.2% | 17.6% | 0.0% | |
| Unassigned | 0.7% | 0.9% | 0.2% | 7.5% | |
Annotation result on dataset A2
| Methods | Species-reference (~16.7 million reads) | Order-reference (~20.0 million reads) | ||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
| MetaCluster-TA |
|
|
|
|
|
| 14.4% |
|
| MEGAN4 (contigs) | 54.2% | 1.8% | 3.8% | 40.1% | 13.5% | 4.3% | 42.0% | 40.3% |
| MEGAN4 (reads) | 57.3% | 5.9% | 4.3% | 32.6% | 1.1% | 0.5% |
| 94.5% |
*Running time of MetaCluster-TA is about 5 hours; running time of MEGAN4 (reads) is about 1.5 days; running time of MEGAN4 (contigs) is about 10 hours.
* About 40% reads can be aligned to contigs of length > 500 bp with <5% mismatches.
Further comparison between MEGAN4 (contigs) and MetaCluster-TA on species-reference of A2
| MetaCluster-TA | |||||
|---|---|---|---|---|---|
|
|
|
|
| ||
| MEGAN4 | Accurate | 51.1% | 0.7% | 2.4% | 0.0% |
| Higher | 0.8% | 1.0% | 0.0% | 0.0% | |
| Incorrect | 2.5% | 0.9% | 0.4% | 0.0% | |
| Unassigned | 4.3% | 6.3% | 1.0% | 28.6% | |
Further comparison between MEGAN4 (contigs) and MetaCluster-TA on order-reference of A2
| MetaCluster-TA | |||||
|---|---|---|---|---|---|
|
|
|
|
| ||
| MEGAN4 | Accurate | 12.2% | 1.0% | 0.3% | 0.0% |
| Higher | 1.6% | 0.6% | 2.1% | 0.0% | |
| Incorrect | 21.3% | 11.9% | 8.7% | 0.0% | |
| Unassigned | 0.6% | 1.6% | 3.3% | 34.8% | |
Clustering performance on dataset B
| Species discovered | Sensitivity | Overall performance | |||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| |
| MetaCluster 5.0 | 47 | 11 | 0 | 0.84 | 0.78 | 0.80 | 0.79 | 35 GB | 101 min |
| MetaCluster-TA | 47 | 14 | 0 | 0.89 | 0.84 | 0.86 | 0.87 | 30 GB | 250 min |
Figure 1Workflow of MetaCluster-TA.