| Literature DB >> 22574964 |
Adam L Bazinet1, Michael P Cummings.
Abstract
BACKGROUND: A fundamental problem in modern genomics is to taxonomically or functionally classify DNA sequence fragments derived from environmental sampling (i.e., metagenomics). Several different methods have been proposed for doing this effectively and efficiently, and many have been implemented in software. In addition to varying their basic algorithmic approach to classification, some methods screen sequence reads for 'barcoding genes' like 16S rRNA, or various types of protein-coding genes. Due to the sheer number and complexity of methods, it can be difficult for a researcher to choose one that is well-suited for a particular analysis.Entities:
Mesh:
Year: 2012 PMID: 22574964 PMCID: PMC3428669 DOI: 10.1186/1471-2105-13-92
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Program attributes and characteristics
| CARMA | BLAST, HMM | | | | command line, web-based |
| FACS | other | | | | command line |
| jMOTU/Taxonerator | BLAST, other | | | multiple alignment | command line |
| MARTA | BLAST | LCA-like | | command line | |
| MEGAN | BLAST | LCA-like | | GUI | |
| MetaPhyler | BLAST | | | marker genes | command line |
| MG-RAST | BLAST | | | marker genes | web-based |
| MTR | BLAST | LCA-like | | command line | |
| SOrt-ITEMS | BLAST | LCA-like | | command line | |
| Naive Bayes Classifier | NBC | supervised | other | | command line, web-based |
| PhyloPythiaS | other | supervised | | | command line, web-based |
| PhymmBL | IMM | supervised | other | | command line |
| RAIphy | other | semi-supervised | | | GUI |
| RDP | k-means/kNN, NBC | supervised | bootstrap | 16S rRNA | command line, web-based |
| Scimm | IMM | semi-supervised | | | command line |
| TACOA | k-means/kNN | supervised | | | command line |
| EPA | ML | bootstrap, other | multiple alignment | command line, web-based | |
| FastTree | other | bootstrap | multiple alignment | command line | |
| green genes (NAST, Simrank) | other | | | 16S rRNA | web-based |
| pplacer | ML, Bayesian | posterior probability, other | multiple alignment | command line | |
| SPHINX | BLAST | k-means/kNN | supervised | | web-based |
| AMPHORA | HMM | other | bootstrap | marker genes | command line |
| MLTreeMap | BLAST, HMM | ML | bootstrap, other | marker genes | command line, web-based |
| SAP | BLAST | Bayesian, other | posterior probability, other | | command line |
| Treephyler | HMM | other | bootstrap | marker genes | command line |
Figure 1Program clustering. A neighbor-joining tree that clusters the classification programs based on their similar attributes.
Performance of alignment-based programs
| CARMA | 29.0 | 93.6 | 68.7 | 61.3 | 63.2 |
| MEGAN | 48.4 | 88.2 | 90.5 | 62.2 | 72.3 |
| MetaPhyler | 0.2 | 80.9 | 0.5 | 0.6 | 20.6 |
| MG-RAST | 27.1 | 29.8 | 80.2 | 70.5 | 51.9 |
| CARMA | 26.7 | 93.4 | 68.5 | 59.8 | 62.1 |
| MEGAN | 42.5 | 87.9 | 90.3 | 61.0 | 70.4 |
| MetaPhyler | 0.1 | 80.7 | 0.5 | 0.5 | 20.5 |
| MG-RAST | 25.0 | 29.7 | 80.1 | 67.2 | 50.5 |
| CARMA | 92.0 | 99.7 | 99.7 | 97.4 | 97.2 |
| MEGAN | 78.1 | 99.7 | 99.8 | 98.1 | 93.9 |
| MetaPhyler | 84.0 | 99.7 | 100.0 | 83.8 | 91.9 |
| MG-RAST | 92.4 | 99.8 | 99.9 | 95.3 | 96.9 |
| CARMA1,2 | 290880 | 77340 | 74950 | 360107 | 200819 |
| MEGAN1,2 | 288020 | 72060 | 72010 | 351060 | 195788 |
| MetaPhyler3 | 10 | 20 | 2 | 28 | 15 |
| MG-RAST4 | 60 | 10080 | 20160 | 12960 | 10815 |
| CARMA | 100 | 100 | 100 | 120 | 105 |
| MEGAN | 1024 | 1024 | 1024 | 1410 | 1121 |
| MetaPhyler | 5734 | 5734 | 5734 | 5734 | 5734 |
| MG-RAST5 | - | - | - | - | - |
Measurements of sensitivity, precision, and resource consumption on four simulated data sets.
1analysis performed on a 2.66 GHz Intel Core i7 MacBook Pro running Mac OS X 10.7.1 with 8 GB 1067 MHz DDR3 RAM.
2BLAST v2.2.18 analysis performed using ∼200 Opteron 2425 HE (2.1GHz) cores; each node has 48G RAM.
3analysis performed on an AMD Opteron 250 (2.4 GHz) Sun Fire V40z with 32 GB RAM.
4used web service; recorded value is number of minutes to receive results, not actual CPU runtime.
5used web service; memory usage was unable to be determined.
Results for the FACS simHC metagenomic data set (105sequences, 269 bp)
| percentage of sequence classified | | 29.0 | 54.4 | 0.2 | 27.1 | |
| Eukaryota | 73.0 | 30.3 | 42.0 | 0.0 | 21.0 | |
| Bacteria | 25.6 | 62.8 | 52.0 | 84.0 | 71.5 | |
| Viruses | 1.5 | 0.0 | 0.3 | 0.0 | 0.1 | |
| Archaea | 0.0 | 6.9 | 5.7 | 16.0 | 7.3 | |
| percentage of sequence misclassified | | 8.0 | 12.2 | 16.0 | 7.6 | |
| correlation coefficient | 0.45 | 0.72 | -0.09 | 0.26 |
The actual distribution of sequences compared to the distribution inferred by the alignment-based programs.
Results for the MetaPhyler simulated metagenomic data set (73,086 sequences, 300 bp)
| percentage of sequence classified | | 93.6 | 88.2 | 80.9 | 29.8 | |
| Proteobacteria | 47.0 | 47.6 | 44.5 | 48.3 | 46.7 | |
| Firmicutes | 21.9 | 22.2 | 24.0 | 21.8 | 23.1 | |
| Actinobacteria | 9.7 | 8.7 | 8.8 | 9.1 | 9.3 | |
| Bacteroidetes | 4.8 | 4.5 | 4.8 | 4.3 | 4.4 | |
| Cyanobacteria | 3.9 | 3.6 | 3.8 | 3.9 | 3.7 | |
| Tenericutes | 2.2 | 2.5 | 2.7 | 2.4 | 2.3 | |
| Spirochaetes | 1.9 | 2.4 | 2.6 | 2.3 | 2.2 | |
| Chlamydiae | 1.3 | 1.9 | 2.0 | 1.8 | 1.8 | |
| Thermotogae | 0.9 | 1.2 | 1.2 | 1.1 | 1.2 | |
| Chlorobi | 0.9 | 1.4 | 1.5 | 1.3 | 1.4 | |
| percentage of sequence misclassified | | 0.3 | 0.3 | 0.3 | 0.2 | |
| correlation coefficient | ≈ 1.0 | ≈ 1.0 | ≈ 1.0 | ≈ 1.0 |
The actual distribution of sequences compared to the distribution inferred by the alignment-based programs.
Results for the CARMA 454 simulated metagenomic data set (25,000 sequences, 265 bp)
| percentage of sequence classified | | 68.7 | 90.5 | 0.5 | 80.2 | |
| Proteobacteria | 73.0 | 73.2 | 73.0 | 69.2 | 73.2 | |
| Firmicutes | 12.9 | 13.2 | 12.8 | 17.3 | 12.9 | |
| Cyanobacteria | 7.8 | 7.3 | 7.8 | 6.8 | 7.6 | |
| Actinobacteria | 5.2 | 5.0 | 5.3 | 2.3 | 5.4 | |
| Chlamydiae | 1.0 | 1.2 | 1.1 | 4.5 | 0.9 | |
| percentage of sequence misclassified | | 0.3 | 0.2 | 0.0 | 0.1 | |
| correlation coefficient | ≈ 1.0 | ≈ 1.0 | ≈ 1.0 | ≈ 1.0 |
The actual distribution of sequences compared to the distribution inferred by the alignment-based programs.
Performance of composition-based programs
| NBC | 100 | 100 | 100 | 100 | |
| PhyloPythiaS | 3.5 | 3.1 | 3.3 | 3.3 | |
| PhymmBL | 100 | 99.7 | 100 | 99.9 | |
| RAIphy | 100 | 100 | 100 | 100 | |
| NBC | 95.4 | 97.5 | 99.4 | 97.4 | |
| PhyloPythiaS | 3.1 | 1.8 | 2.2 | 2.4 | |
| PhymmBL | 48.4 | 96.8 | 81.9 | 75.7 | |
| RAIphy | 54.8 | 31.8 | 48.0 | 44.9 | |
| NBC | 95.4 | 97.5 | 99.4 | 97.4 | |
| PhyloPythiaS | 88.1 | 58.5 | 66.1 | 70.9 | |
| PhymmBL | 48.4 | 97.0 | 81.9 | 75.8 | |
| RAIphy | 54.8 | 31.8 | 48.0 | 44.9 | |
| NBC1 | 13496 | 3595 | 17573 | 11555 | 1217 |
| PhyloPythiaS2 | 297 | 180 | 506 | 328 | 4320 |
| PhymmBL1 | 15600 | 1035 | 23508 | 13381 | 2880 |
| RAIphy3 | 105 | 25 | 122 | 84 | 30 |
| NBC | 200 | 200 | 200 | 200 | |
| PhyloPythiaS4 | 100 | 100 | 100 | 100 | |
| PhymmBL4 | 100 | 100 | 100 | 100 | |
| RAIphy | 500 | 335 | 400 | 412 |
Measurements of sensitivity, precision, and resource consumption on three simulated data sets.
1analysis performed on an AMD Opteron 250 (2.4 GHz) Sun Fire V40z with 32 GB RAM.
2analysis perfomed on an AMD Opteron 248 (2.2 GHz) workstation with 8 GB RAM.
3analysis performed on a 2.66 GHz Intel Core i7 MacBook Pro running Mac OS X 10.7.1 with 8 GB 1067 MHz DDR3 RAM.
4input sequences were broken up into smaller files.
Performance of phylogenetic-based programs
| MLTreeMap1 | 0.9 | 0.8 | 81.4 | 3344 |
| Treephyler1 | 6.6 | 6.3 | 95.7 | 7444 |
Measurements of sensitivity, precision, and resource consumption on the PhyloPythia 961 bp data set.
1analysis performed on an AMD Opteron 250 (2.4 GHz) Sun Fire V40z with 32 GB RAM.