| Literature DB >> 24456583 |
Alexander Statnikov1, Mikael Henaff, Varun Narendra, Kranti Konganti, Zhiguo Li, Liying Yang, Zhiheng Pei, Martin J Blaser, Constantin F Aliferis, Alexander V Alekseyenko.
Abstract
BACKGROUND: Recent advances in next-generation DNA sequencing enable rapid high-throughput quantitation of microbial community composition in human samples, opening up a new field of microbiomics. One of the promises of this field is linking abundances of microbial taxa to phenotypic and physiological states, which can inform development of new diagnostic, personalized medicine, and forensic modalities. Prior research has demonstrated the feasibility of applying machine learning methods to perform body site and subject classification with microbiomic data. However, it is currently unknown which classifiers perform best among the many available alternatives for classification with microbiomic data.Entities:
Year: 2013 PMID: 24456583 PMCID: PMC3960509 DOI: 10.1186/2049-2618-1-11
Source DB: PubMed Journal: Microbiome ISSN: 2049-2618 Impact factor: 14.650
Characteristics of microbiomic datasets used in this study
| 552 | 6,979 | 6 | Classify body habitats: Skin (357), Oral Cavity (46), External Auditory Canal (44), Hair (14), Nostril (46), Feces (45) | 64.7 | |
| 140 | 2,543 | 7 | Classify 7 subjects by microbiota (20/20/20/20/20/20/20) | 14.3 | |
| 357 | 4,793 | 12 | Classify skin sites: external nose (14), forehead (32), glans penis (8), labia minora (6), axilla (28), pinna (27), palm (64), palmar index finger (28), plantar foot (64), popliteal fossa (46), volar forearm (28), umbilicus (12) | 17.9 | |
| 104 | 1,217 | 3 | Classify 3 subjects by microbiota (40/33/31) | 38.5 | |
| 98 | 1,217 | 6 | Classify by subject and left/right hand (20/18/17/14/16/13) | 20.4 | |
| 151 | 13,503 | 3 | Classify as Control (49), Psoriasis Normal (51), Psoriasis Lesion (51) | 33.8 | |
| 200 | 74,018 | 4 | Classify as Normal (28), Reflux Esophagitis (36), Barrett's Esophagus (84), Esophageal Adenocarcinoma (52) | 42.0 | |
| 200 | 74,018 | 4 | Classify body site: Oral Cavity (51), Esophagus (51), Stomach (48), Stool (50) | 25.5 |
Values of parameters of the preprocessing methods[2,11,12]
| uclust | uclust, creates ‘seeds’ of sequences which generate clusters based on percent identity. | |
| furthest | Clustering algorithm for mothur otu picking method. Valid choices are: furthest, nearest, average. | |
| 400 | Maximum available memory to cd-hit-est (via the program’s -M option) for cdhit OTU picking method (units of Mbyte) | |
| None | Path to reference sequences to search against when using -m blast, -m uclust_ref, or -m usearch_ref | |
| None | Pre-existing database to blast against when using -m blast | |
| 0.97 | Sequence similarity threshold (for cdhit, uclust, uclust_ref, or usearch) | |
| 1.00E-10 | Max E-value when clustering with BLAST | |
| None | Prefilter data so seqs with identical first prefix_prefilter_length are automatically grouped into a single OTU | |
| FALSE | Prefilter data so seqs which are identical prefixes of a longer seq are automatically grouped into a single OTU | |
| 50 | Prefix length when using the prefix_suffix otu picker | |
| 50 | Suffix length when using the prefix_suffix otu picker | |
| FALSE | Pass the -optimal flag to uclust for uclust otu picking. | |
| FALSE | Pass the -exact flag to uclust for uclust otu picking. | |
| FALSE | Do not assume input is sorted by length | |
| FALSE | Suppress presorting of sequences by abundance when picking OTUs with uclust or uclust_ref | |
| FALSE | Suppress creation of new clusters using seqs that don’t match reference when using -m uclust_ref or -m usearch_ref | |
| FALSE | Do not pass -stable-sort to uclust | |
| 20 | Max_accepts value to uclust and uclust_ref | |
| 500 | Max_rejects value to uclust and uclust_ref | |
| 12 | W value to usearch, uclust, and uclust_ref. Set to 64 for usearch. | |
| 20 | Stepwords value to uclust and uclust_ref | |
| FALSE | Do not collapse exact matches before calling uclust |
Parameters and software implementations of the classification algorithms
| 1 | libsvm [ | ||
| optimized over (0.01, 0.1, 1, 10, 100) | |||
| optimized over (0.01, 0.1, 1, 10, 100) | |||
| optimized over (1, 2, 3) | |||
| optimized over (0.01, 0.1, 1, 10, 100) | |||
| γ (determines RBF width) | optimized over (0.01, 0.1, 1, 10, 100)/number of variables | ||
| optimized over (10-10, 10-9, …, 1) | clop [ | ||
| optimized over (1, 2, 3) | |||
| optimized over (10-10, 10-9, …, 1) | |||
| γ (determines RBF width) | optimized over (0.01, 0.1, 1, 10, 100)/number of variables | ||
| 1 | Matlab Statistics Toolbox ( | ||
| 5 | |||
| optimized over (1, …, 50) | |||
| optimized over (0.01, 0.02, …, 1) | Matlab Neural Network Toolbox ( | ||
| 1 | liblinear [ | ||
| optimized over (0.01, 0.1, 1, 10, 100) | |||
| 1 | |||
| optimized over (0.01, 0.1, 1, 10, 100) | |||
| automatically determined in the software by cross-validation | bbr ( | ||
| automatically determined in the software by cross-validation | |||
| 500 | R package randomForest (cran.r-project.org/) | ||
| optimized over (500, 1000, 2000) | |||
Figure 1Accuracies of all classification algorithms averaged over eight datasets. Panels: (a) Proportion of correct classifications (PCC) without feature selection, (b) Relative classifier information (RCI) without feature selection, (c) PCC with feature selection, and (d) RCI with feature selection. The nominally best performing method and methods whose performance cannot be deemed statistically worse than the nominally best performing method are shown as shaded bars; all other methods are shown as empty bars. See text for definition of the PCC and RCI metrics and details of statistical comparison.
Classification accuracy without feature/operational taxonomic unit (OTU) selection, measured by proportion of correct classifications (PCC)
| 0.920 | 0.911 | 0.583 | 0.940 | 0.598 | 0.354 | 0.468 | 0.695 | 0.684 | 0.022 | |
| 0.920 | 0.911 | 0.622 | 0.980 | 0.585 | 0.383 | 0.485 | 0.709 | 0.699 | 0.038 | |
| 0.920 | 0.911 | 0.622 | 0.980 | 0.585 | 0.383 | 0.484 | 0.709 | 0.699 | 0.036 | |
| 0.909 | 0.904 | 0.575 | 0.973 | 0.575 | 0.379 | 0.451 | 0.700 | 0.683 | 0.021 | |
| 0.913 | 0.918 | 0.581 | 0.954 | 0.598 | 0.377 | 0.482 | 0.709 | 0.692 | 0.027 | |
| 0.923 | 0.904 | 0.618 | 0.967 | 0.632 | 0.366 | 0.467 | 0.709 | 0.698 | 0.030 | |
| 0.496 | 0.360 | 0.195 | 0.451 | 0.305 | 0.249 | 0.419 | 0.291 | 0.346 | 0.002 | |
| 0.713 | 0.339 | 0.188 | 0.397 | 0.281 | 0.331 | 0.393 | 0.300 | 0.368 | 0.001 | |
| 0.714 | 0.377 | 0.192 | 0.325 | 0.273 | 0.340 | 0.409 | 0.379 | 0.376 | 0.001 | |
| 0.743 | 0.321 | 0.216 | 0.522 | 0.332 | 0.325 | 0.167 | 0.247 | 0.359 | 0.000 | |
| 0.934 | 0.939 | 0.628 | 0.982 | 0.628 | 0.380 | 0.725 | 0.716 | 0.084* | ||
| 0.933 | 0.938 | 0.623 | 0.978 | 0.618 | 0.383 | 0.502 | 0.725 | 0.712 | 0.067* | |
| 0.929 | 0.801 | 0.559 | 0.975 | 0.700 | 0.422 | 0.384 | 0.673 | 0.680 | 0.018* | |
| 0.928 | 0.903 | 0.561 | 0.981 | 0.690 | 0.445 | 0.412 | 0.692 | 0.702 | 0.039 | |
| 0.932 | 0.955 | 0.673 | 0.744 | 0.508 | 0.424 | 0.730 | 0.746 | 0.270* | ||
| 0.927 | 0.927 | 0.634 | 0.962 | 0.622 | 0.387 | 0.452 | 0.727 | 0.705 | 0.042 | |
| 0.921 | 0.736 | 0.480 | 0.966 | 0.631 | 0.354 | 0.410 | 0.635 | 0.642 | 0.008 |
The nominally best performing classifier on average over all datasets is marked with bold, and P values of methods whose performance cannot be deemed statistically worse than the nominally best performing method are marked with “*”. The accuracy of the nominally best performing method for each dataset is underlined.
Classification accuracy without feature/operational taxonomic unit (OTU) selection, measured by relative classifier information (RCI)
| 0.769 | 0.918 | 0.674 | 0.882 | 0.749 | 0.158 | 0.228 | 0.602 | 0.623 | 0.165* | |
| 0.771 | 0.915 | 0.674 | 0.958 | 0.751 | 0.157 | 0.241 | 0.607 | 0.634 | 0.294* | |
| 0.771 | 0.915 | 0.674 | 0.958 | 0.751 | 0.162 | 0.241 | 0.607 | 0.635 | 0.299* | |
| 0.689 | 0.907 | 0.631 | 0.942 | 0.731 | 0.156 | 0.202 | 0.561 | 0.602 | 0.059* | |
| 0.765 | 0.927 | 0.671 | 0.911 | 0.758 | 0.157 | 0.230 | 0.612 | 0.629 | 0.206* | |
| 0.774 | 0.913 | 0.675 | 0.935 | 0.759 | 0.163 | 0.598 | 0.632 | 0.265* | ||
| 0.344 | 0.329 | 0.377 | 0.163 | 0.355 | 0.167 | 0.074 | 0.078 | 0.236 | 0.003 | |
| 0.178 | 0.359 | 0.277 | 0.102 | 0.203 | 0.056 | 0.092 | 0.062 | 0.166 | 0.002 | |
| 0.337 | 0.402 | 0.354 | 0.028 | 0.207 | 0.089 | 0.122 | 0.196 | 0.217 | 0.003 | |
| 0.325 | 0.292 | 0.411 | 0.236 | 0.342 | 0.041 | 0.070 | 0.089 | 0.226 | 0.002 | |
| 0.772 | 0.941 | 0.670 | 0.964 | 0.778 | 0.161 | 0.236 | 0.644 | 0.575* | ||
| 0.782 | 0.939 | 0.680 | 0.958 | 0.778 | 0.163 | 0.228 | 0.624 | 0.644 | 0.626* | |
| 0.769 | 0.825 | 0.635 | 0.949 | 0.779 | 0.163 | 0.191 | 0.565 | 0.610 | 0.089* | |
| 0.910 | 0.664 | 0.960 | 0.790 | 0.174 | 0.209 | 0.599 | 0.638 | 0.439* | ||
| 0.767 | 0.957 | 0.671 | 0.803 | 0.173 | 0.087 | 0.618 | 0.634 | 0.253* | ||
| - | ||||||||||
| 0.759 | 0.932 | 0.679 | 0.922 | 0.770 | 0.166 | 0.090 | 0.619 | 0.617 | 0.085* | |
| 0.744 | 0.759 | 0.496 | 0.930 | 0.736 | 0.077 | 0.014 | 0.529 | 0.536 | 0.008 |
The nominally best performing classifier on average over all datasets is marked with bold, and P values of methods whose performance cannot be deemed statistically worse than the nominally best performing method are marked with “*”. The accuracy of the nominally best performing method for each dataset is underlined.
Classification accuracy with feature/operational taxonomic unit (OTU) selection, measured by proportion of correct classifications (PCC)
| 0.900 | 0.941 | 0.610 | 0.965 | 0.719 | 0.524 | 0.759 | 0.747 | 0.319* | |||
| 0.952 | 0.935 | 0.631 | 0.985 | 0.754 | 0.534 | 0.553 | 0.763 | 0.535* | |||
| 0.950 | 0.929 | 0.633 | 0.987 | 0.742 | 0.528 | 0.551 | 0.754 | 0.759 | 0.460* | ||
| 0.941 | 0.918 | 0.617 | 0.987 | 0.693 | 0.518 | 0.523 | 0.727 | 0.741 | 0.179* | ||
| 0.909 | 0.933 | 0.623 | 0.949 | 0.749 | 0.547 | 0.514 | 0.713 | 0.742 | 0.199* | ||
| 0.929 | 0.939 | 0.634 | 0.970 | 0.737 | 0.537 | 0.504 | 0.714 | 0.745 | 0.248* | ||
| 0.930 | 0.760 | 0.563 | 0.971 | 0.623 | 0.421 | 0.443 | 0.596 | 0.663 | 0.011 | ||
| 0.930 | 0.724 | 0.529 | 0.943 | 0.656 | 0.434 | 0.434 | 0.609 | 0.657 | 0.009 | ||
| 0.935 | 0.754 | 0.552 | 0.963 | 0.648 | 0.422 | 0.432 | 0.620 | 0.666 | 0.011 | ||
| 0.906 | 0.781 | 0.560 | 0.956 | 0.623 | 0.130 | 0.449 | 0.604 | 0.626 | 0.006 | ||
| 0.934 | 0.939 | 0.628 | 0.982 | 0.628 | 0.380 | 0.515 | 0.725 | 0.716 | 0.047 | ||
| 0.921 | 0.948 | 0.650 | 0.836 | 0.739 | 0.499 | 0.464 | 0.711 | 0.721 | 0.089* | ||
| 0.922 | 0.818 | 0.589 | 0.968 | 0.706 | 0.449 | 0.395 | 0.687 | 0.692 | 0.020 | ||
| 0.934 | 0.909 | 0.611 | 0.993 | 0.710 | 0.442 | 0.418 | 0.697 | 0.714 | 0.048 | ||
| - | |||||||||||
| 0.695 | 0.746 | 0.548 | 0.479 | 0.741 | 0.764 | 0.498* | |||||
| 0.946 | 0.929 | 0.639 | 0.991 | 0.521 | 0.537 | 0.739 | 0.758 | 0.465* | |||
| 0.926 | 0.856 | 0.557 | 0.980 | 0.728 | 0.525 | 0.426 | 0.701 | 0.713 | 0.043 |
The nominally best performing classifier on average over all datasets is marked with bold, and P values of methods whose performance cannot be deemed statistically worse than the nominally best performing method are marked with “*”. The accuracy of the nominally best performing method for each dataset is underlined.
Classification accuracy with feature/ operational taxonomic unit (OTU) selection, measured by relative classifier information (RCI)
| 0.719 | 0.952 | 0.691 | 0.929 | 0.813 | 0.681 | 0.191* | |||||
| - | |||||||||||
| 0.845 | 0.941 | 0.716 | 0.969 | 0.316 | 0.323 | 0.644 | 0.699 | 0.369* | |||
| 0.813 | 0.925 | 0.683 | 0.972 | 0.813 | 0.286 | 0.290 | 0.611 | 0.674 | 0.089* | ||
| 0.759 | 0.939 | 0.683 | 0.931 | 0.800 | 0.297 | 0.290 | 0.626 | 0.666 | 0.061* | ||
| 0.807 | 0.935 | 0.687 | 0.944 | 0.801 | 0.297 | 0.316 | 0.633 | 0.677 | 0.097* | ||
| 0.830 | 0.779 | 0.657 | 0.939 | 0.736 | 0.168 | 0.251 | 0.510 | 0.609 | 0.015 | ||
| 0.774 | 0.744 | 0.625 | 0.884 | 0.736 | 0.153 | 0.224 | 0.522 | 0.583 | 0.008 | ||
| 0.829 | 0.773 | 0.652 | 0.914 | 0.736 | 0.179 | 0.221 | 0.531 | 0.604 | 0.014 | ||
| 0.726 | 0.798 | 0.629 | 0.907 | 0.730 | 0.167 | 0.227 | 0.516 | 0.587 | 0.012 | ||
| 0.772 | 0.941 | 0.670 | 0.964 | 0.778 | 0.161 | 0.236 | 0.628 | 0.644 | 0.027 | ||
| 0.780 | 0.940 | 0.692 | 0.837 | 0.811 | 0.234 | 0.257 | 0.612 | 0.645 | 0.034 | ||
| 0.742 | 0.836 | 0.642 | 0.934 | 0.771 | 0.183 | 0.213 | 0.584 | 0.613 | 0.011 | ||
| 0.786 | 0.914 | 0.696 | 0.985 | 0.784 | 0.166 | 0.238 | 0.598 | 0.646 | 0.033 | ||
| 0.840 | 0.952 | 0.712 | 0.982 | 0.819 | 0.266 | 0.213 | 0.648 | 0.679 | 0.179* | ||
| 0.842 | 0.714 | 0.810 | 0.264 | 0.216 | 0.649 | 0.681 | 0.196* | ||||
| 0.822 | 0.932 | 0.692 | 0.982 | 0.824 | 0.317 | 0.318 | 0.640 | 0.691 | 0.313* | ||
| 0.761 | 0.855 | 0.625 | 0.968 | 0.770 | 0.208 | 0.202 | 0.570 | 0.620 | 0.018 |
The nominally best performing classifier on average over all datasets is marked with bold, and P values of methods whose performance cannot be deemed statistically worse than the nominally best performing method are marked with “*”. The accuracy of the nominally best performing method for each dataset is underlined.
Number of features/operational taxonomic units (OTUs) selected on average across ten data splits and ten cross-validation training sets
| 6979 | 259 | 1191 | 20 | 50 | 285 | 1359 | |
| 2543 | 469 | 370 | 215 | 805 | 474 | 370 | |
| 4793 | 896 | 935 | 51 | 211 | 1166 | 1262 | |
| 1217 | 9 | 8 | 8 | 8 | 9 | 8 | |
| 1217 | 101 | 128 | 38 | 108 | 126 | 142 | |
| 13503 | 453 | 416 | 37 | 204 | 1223 | 1276 | |
| 74018 | 5127 | 8308 | 136 | 492 | 8164 | 7400 | |
| 74018 | 2633 | 4347 | 89 | 687 | 4568 | 12188 | |