| Literature DB >> 22496776 |
Vineet K Sharma1, Naveen Kumar, Tulika Prakash, Todd D Taylor.
Abstract
Taxonomic assignment of sequence reads is a challenging task in metagenomic data analysis, for which the present methods mainly use either composition- or homology-based approaches. Though the homology-based methods are more sensitive and accurate, they suffer primarily due to the time needed to generate the Blast alignments. We developed the MetaBin program and web server for better homology-based taxonomic assignments using an ORF-based approach. By implementing Blat as the faster alignment method in place of Blastx, the analysis time has been reduced by severalfold. It is benchmarked using both simulated and real metagenomic datasets, and can be used for both single and paired-end sequence reads of varying lengths (≥45 bp). To our knowledge, MetaBin is the only available program that can be used for the taxonomic binning of short reads (<100 bp) with high accuracy and high sensitivity using a homology-based approach. The MetaBin web server can be used to carry out the taxonomic analysis, by either submitting reads or Blastx output. It provides several options including construction of taxonomic trees, creation of a composition chart, functional analysis using COGs, and comparative analysis of multiple metagenomic datasets. MetaBin web server and a standalone version for high-throughput analysis are available freely at http://metabin.riken.jp/.Entities:
Mesh:
Year: 2012 PMID: 22496776 PMCID: PMC3319535 DOI: 10.1371/journal.pone.0034030
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1ORF-based approach for the taxonomic assignment of reads of different lengths derived from different regions of the genomic DNA.
Read derived from intergenic region (A), read containing the small 5′ region of an ORF (B), read containing two partial ORFs at the 5′and 3′ terminals and a complete ORF in the middle (C), read containing only a single complete ORF (D), read containing a long partial ORF at one end (E), read obtained from within an ORF (F), read with sequencing error causing a single ORF to split into two smaller ORFs (G). X, Y, Z, K, L, and M are the genomes to which the ORFs showed matches. The taxonomic IDs of the species of these genomes are used for making the taxonomic assignments, and for creating the taxonomic bins.
Figure 2Flowchart of MetaBin algorithm.
ID and POS refer to %Identity and %Positives, respectively, as provided in the Blastx or Blat output. COV refers to the % coverage of the query with the hit (reference protein).
Summary of results using MetaBin, MEGAN and SOrt-ITEMS on simulated bacterial read datasets for different sequencing technologies.
| Read Length (bp) | Method | Complete NR Database (NR) | NR with genus deleted (NRminusGenus) | NR with family deleted (NRminusFamily) | |||||||||
| Genus | Family | Phylum | Sens | PPV | Family | Phylum | Sens | PPV | Phylum | Sens | PPV | ||
|
|
| 93.49 | 97.97 | 99.15 | 99.18 | 99.95 | 33.68 | 61.57 | 64.18 | 85.84 | 49.35 | 51.91 | 80.13 |
|
| 93.46 | 97.92 | 99.1 | 99.19 | 99.96 | 28.68 | 52.89 | 66.04 | 80.91 | 42.07 | 56.47 | 75.19 | |
|
| 92.53 | 97.46 | 98.61 | 98.67 | 99.92 | 33.16 | 60.46 | 63.04 | 85.73 | 49.31 | 51.88 | 79.94 | |
|
| 52.62 | 68.29 | 94.61 | 96.01 | 97.7 | 6.25 | 48.01 | 49.38 | 84.59 | 35.68 | 36.89 | 78.42 | |
|
|
| 88.03 | 92.87 | 94.71 | 94.47 | 99.92 | 24.84 | 45.89 | 47.71 | 83.03 | 33.88 | 35.65 | 75.05 |
|
| 83.14 | 87.97 | 90.46 | 93.83 | 99.85 | 15.78 | 28.96 | 49.69 | 78.91 | 20.44 | 41.81 | 72.77 | |
|
| 87.73 | 92.72 | 94.49 | 94.28 | 99.88 | 24.32 | 45.16 | 46.89 | 82.92 | 34.69 | 36.34 | 75.81 | |
|
| 34.41 | 67.62 | 91.53 | 91.7 | 98.1 | 8.35 | 42.84 | 45.19 | 78.68 | 32.86 | 35.29 | 68.89 | |
|
|
| 86.94 | 92.14 | 94.73 | 94.79 | 99.88 | 21.63 | 39.39 | 40.86 | 81.19 | 27.89 | 29.29 | 72.71 |
|
| 85.71 | 90.75 | 93.53 | 93.57 | 99.89 | 14.63 | 26.45 | 58.41 | 78.36 | 18.22 | 50.25 | 71.76 | |
|
| 86.33 | 91.73 | 94.28 | 94.43 | 99.77 | 21.24 | 38.84 | 40.48 | 81.17 | 28.05 | 29.29 | 73.11 | |
|
| 41.63 | 59.8 | 78.06 | 78.31 | 97.5 | 8.12 | 30.18 | 31.41 | 79 | 22.46 | 23.48 | 69.16 | |
Summary of results using MetaBin, MEGAN and SOrt-ITEMS on simulated archaeal read datasets for different sequencing technologies.
| Read Length (bp) | Method | Complete NR Database (NR) | NR with genus deleted (NRminusGenus) | NR with family deleted (NRminusFamily) | |||||||||
| Genus | Family | Phylum | Sens | PPV | Family | Phylum | Sens | PPV | Phylum | Sens | PPV | ||
|
|
| 97.81 | 98.44 | 99.69 | 99.69 | 100 | 28.35 | 77.96 | 80.33 | 95.95 | 56.85 | 59.8 | 81.18 |
|
| 97.86 | 98.65 | 99.58 | 99.63 | 99.95 | 21.63 | 64.72 | 80.49 | 93.99 | 48.15 | 63.58 | 79.14 | |
|
| 97.86 | 98.54 | 99.64 | 99.63 | 100 | 28.09 | 77.75 | 80.61 | 95.16 | 56.54 | 59.97 | 78.75 | |
|
| 60.24 | 75.09 | 99.11 | 99.11 | 100 | 2.24 | 60.03 | 61.23 | 96.65 | 42.78 | 44 | 85.21 | |
|
|
| 92.09 | 93.19 | 94.95 | 95.07 | 99.87 | 19.23 | 58.66 | 59.85 | 95.92 | 42.05 | 42.98 | 79.26 |
|
| 87.19 | 88.39 | 90.94 | 94.9 | 99.92 | 9.69 | 33.17 | 65.59 | 93.13 | 24.64 | 51.51 | 77.27 | |
|
| 92.34 | 93.65 | 95.48 | 95.75 | 99.81 | 19.85 | 59.38 | 61.38 | 95.37 | 42.25 | 43.91 | 78.81 | |
|
| 38.86 | 73.82 | 95.23 | 95.79 | 99.39 | 5.76 | 53.22 | 55.67 | 91.84 | 37.14 | 39.21 | 73.57 | |
|
|
| 91.54 | 92.75 | 95.16 | 95.19 | 99.96 | 15.65 | 47.87 | 49.12 | 94.32 | 34.32 | 35.33 | 78.19 |
|
| 90.74 | 91.97 | 94.48 | 98.39 | 99.94 | 8.1 | 26.78 | 73.57 | 92.15 | 19.69 | 57.81 | 75.69 | |
|
| 91.86 | 93.31 | 95.71 | 95.9 | 99.85 | 16.47 | 49.96 | 51.91 | 93.1 | 35.43 | 36.88 | 75.49 | |
|
| 52.54 | 73.85 | 95.3 | 95.47 | 99.81 | 5.09 | 34.84 | 36.34 | 88.95 | 23.8 | 24.81 | 72.19 | |
The above tables show the percentage of total reads correctly assigned at different taxonomic levels such as Genus, Family or Phylum. ‘Sens’ refers to %average sensitivity and ‘PPV’ refers to %average positive predictive value.