| Literature DB >> 29773078 |
Nicholas A Bokulich1, Benjamin D Kaehler2, Jai Ram Rideout3, Matthew Dillon3, Evan Bolyen3, Rob Knight4, Gavin A Huttley5, J Gregory Caporaso6,7.
Abstract
BACKGROUND: Taxonomic classification of marker-gene sequences is an important step in microbiome analysis.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29773078 PMCID: PMC5956843 DOI: 10.1186/s40168-018-0470-z
Source DB: PubMed Journal: Microbiome ISSN: 2049-2618 Impact factor: 14.650
Mock communities currently integrated in tax-credit
| Study IDa | Target geneb | Platform | Species | Strains | Citation |
|---|---|---|---|---|---|
| mock-1 | 16S | HiSeq | 46 | 48 | [ |
| mock-2 | 16S | MiSeq | 46 | 48 | [ |
| mock-3 | 16S | MiSeq | 21 | 21 | [ |
| mock-4 | 16S | MiSeq | 21 | 21 | [ |
| mock-5 | 16S | MiSeq | 21 | 21 | [ |
| mock-7 | 16S | HiSeq | 67 | 67 | [ |
| mock-8 | 16S | HiSeq | 67 | 67 | [ |
| mock-9 | ITS | HiSeq | 13 | 16 | [ |
| mock-10 | ITS | HiSeq | 13 | 16 | [ |
| mock-12 | 16S | MiSeq | 26 | 27 | [ |
| mock-16 | 16S | MiSeq | 56 | 59 | [ |
| mock-18 | 16S | MiSeq | 15 | 15 | [ |
| mock-19 | 16S | MiSeq | 15 | 27 | [ |
| mock-20 | 16S | MiSeq | 20 | 20 | [ |
| mock-21 | 16S | MiSeq | 20 | 20 | [ |
| mock-22 | 16S | MiSeq | 20 | 20 | [ |
| mock-23 | 16S | MiSeq | 20 | 20 | [ |
| mock-24 | ITS | MiSeq | 8 | 8 | [ |
| mock-26 | ITS | FLX Titanium | 11 | 11 | [ |
aAll studies are available on mockrobiota [14] at https://github.com/caporaso-lab/mockrobiota/tree/master/data/[studyID]
bAbbreviations: 16S, 16S rRNA gene; HiSeq, Illumina HiSeq; MiSeq, Illumina MiSeq
Fig. 1Classifier performance on mock community datasets for 16S rRNA gene sequences (left column) and fungal ITS sequences (right column). a Average F-measure for each taxonomy classification method (averaged across all configurations and all mock community datasets) from class to species level. Error bars = 95% confidence intervals. b Average F-measure for each optimized classifier (averaged across all mock communities) at species level. c Average taxon accuracy rate for each optimized classifier (averaged across all mock communities) at species level. d Average Bray-Curtis distance between the expected mock community composition and its composition as predicted by each optimized classifier (averaged across all mock communities) at species level. Violin plots show median (white point), quartiles (black bars), and kernel density estimation (violin) for each score distribution. Violins with different lower-case letters have significantly different means (paired t test false detection rate-corrected P < 0.05)
Fig. 2Classifier performance on cross-validated sequence datasets. Classification accuracy of 16S rRNA gene V4 subdomain (first row), V1–3 subdomain (second row), full-length 16S rRNA gene (third tow), and fungal ITS sequences (fourth row). a Average F-measure for each taxonomy classification method (averaged across all configurations and all cross-validated sequence datasets) from class to species level. Error bars = 95% confidence intervals. b Average F-measure for each optimized classifier (averaged across all cross-validated sequence datasets) at species level. Violins with different lower-case letters have significantly different means (paired t-test false detection rate-corrected P < 0.05). c correlation between F-measure performance for each method/configuration classification of V4 subdomain (x axis), V1–3 subdomain (y axis), and full-length 16S rRNA gene sequences (z axis). Inset lists the Pearson R2 value for each pairwise correlation; each correlation is significant (P < 0.001)
Fig. 3Classifier performance on novel-taxa simulated sequence datasets for 16S rRNA gene sequences (left column) and fungal ITS sequences (right column). a–f, Average F-measure (a), precision (b), recall (c), overclassification (d), underclassification (e), and misclassification (f) for each taxonomy classification method (averaged across all configurations and all novel taxa sequence datasets) from phylum to species level. Error bars = 95% confidence intervals. b Average F-measure for each optimized classifier (averaged across all novel taxa sequence datasets) at species level. Violins with different lower-case letters have significantly different means (paired t test false detection rate-corrected P < 0.05)
Fig. 4Classification accuracy comparison between mock community, cross-validated, and novel taxa evaluations. Scatterplots show mean F-measure scores for each method configuration, averaged across all samples, for classification of 16S rRNA genes at genus level (a) and species level (b), and fungal ITS sequences at genus level (c) and species level (d)
Optimized methods configurations for standard operating conditions
| Mock | Cross-validated | Novel taxa | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Target | Condition | Method | Parameters |
|
|
|
|
|
|
|
|
| Threshold |
| 16S rRNA gene | Balanced | NB-bespoke | [6,6]:0.9 | 0.705 | 0.98 | 0.582 | 0.827 | 0.931 | 0.744 | 0.165 | 0.243 | 0.125 | |
| [6,6]:0.92 | 0.705 | 0.98 | 0.581 | 0.825 | 0.936 | 0.737 | 0.165 | 0.251 | 0.123 | ||||
| [6,6]:0.94 | 0.703 | 0.98 | 0.579 | 0.822 | 0.942 | 0.729 | 0.162 | 0.259 | 0.118 | ||||
| [7,7]:0.92 | 0.712 | 0.978 | 0.592 | 0.831 | 0.931 | 0.751 | 0.151 | 0.221 | 0.115 | ||||
| [7,7]:0.94 | 0.708 | 0.978 | 0.586 | 0.829 | 0.936 | 0.743 | 0.157 | 0.239 | 0.117 | ||||
| Naive-Bayes | [7,7]:0.7 | 0.495 | 0.797 | 0.38 | 0.819 | 0.886 | 0.761 | 0.115 | 0.138 | 0.099 | |||
| rdp | 0.6 | 0.564 | 0.798 | 0.457 | 0.815 | 0.868 | 0.768 | 0.102 | 0.128 | 0.084 | |||
| 0.7 | 0.55 | 0.799 | 0.438 | 0.812 | 0.892 | 0.746 | 0.124 | 0.173 | 0.096 | ||||
| Uclust | 0.51:0.9:3 | 0.498 | 0.746 | 0.392 | 0.846 | 0.876 | 0.817 | 0.154 | 0.201 | 0.126 | |||
| Precision | NB-bespoke | [6,6]:0.98 | 0.676 | 0.987 | 0.537 | 0.803 | 0.956 | 0.692 | 0.163 | 0.303 | 0.111 | ||
| [7,7]:0.98 | 0.687 | 0.98 | 0.551 | 0.815 | 0.951 | 0.713 | 0.164 | 0.283 | 0.115 | ||||
| rdp | 1 | 0.239 | 0.941 | 0.16 | 0.632 | 0.968 | 0.469 | 0.12 | 0.457 | 0.069 | |||
| Recall | NB-bespoke | [12,12]:0.5 | 0.754 | 0.8 | 0.721 | 0.815 | 0.83 | 0.801 | 0.053 | 0.058 | 0.049 | ||
| [14,14]:0.5 | 0.758 | 0.802 | 0.726 | 0.811 | 0.826 | 0.797 | 0.052 | 0.057 | 0.048 | ||||
| [16,16]:0.5 | 0.755 | 0.785 | 0.732 | 0.808 | 0.825 | 0.792 | 0.052 | 0.058 | 0.047 | ||||
| [18,18]:0.5 | 0.772 | 0.803 | 0.748 | 0.805 | 0.823 | 0.789 | 0.055 | 0.061 | 0.05 | ||||
| [32,32]:0.5 | 0.937 | 0.966 | 0.913 | 0.788 | 0.818 | 0.76 | 0.054 | 0.067 | 0.045 | ||||
| Naive-Bayes | [11,11]:0.5 | 0.567 | 0.77 | 0.479 | 0.793 | 0.82 | 0.768 | 0.059 | 0.065 | 0.055 | |||
| [12,12]:0.5 | 0.567 | 0.769 | 0.479 | 0.79 | 0.816 | 0.765 | 0.059 | 0.064 | 0.055 | ||||
| [18,18]:0.5 | 0.564 | 0.764 | 0.477 | 0.779 | 0.807 | 0.753 | 0.057 | 0.063 | 0.051 | ||||
| rdp | 0.5 | 0.577 | 0.791 | 0.48 | 0.816 | 0.848 | 0.787 | 0.068 | 0.079 | 0.06 | |||
| Novel | Blast+ | 10:0.51:0.8 | 0.436 | 0.723 | 0.325 | 0.816 | 0.896 | 0.749 | 0.225 | 0.332 | 0.171 | ||
| Uclust | 0.76:0.9:5 | 0.467 | 0.775 | 0.348 | 0.84 | 0.938 | 0.76 | 0.219 | 0.358 | 0.158 | |||
| VSEARCH | 10:0.51:0.8 | 0.45 | 0.74 | 0.342 | 0.814 | 0.891 | 0.75 | 0.226 | 0.333 | 0.171 | |||
| 10:0.51:0.9 | 0.45 | 0.74 | 0.342 | 0.82 | 0.896 | 0.755 | 0.219 | 0.338 | 0.162 | ||||
| Fungi | Balanced | Naive-Bayes | [6,6]:0.94 | 0.874 | 0.935 | 0.827 | 0.481 | 0.57 | 0.416 | 0.374 | 0.438 | 0.327 | |
| [6,6]:0.96 | 0.874 | 0.935 | 0.827 | 0.495 | 0.597 | 0.423 | 0.399 | 0.473 | 0.344 | ||||
| [6,6]:0.98 | 0.874 | 0.935 | 0.827 | 0.505 | 0.629 | 0.423 | 0.426 | 0.52 | 0.361 | ||||
| [7,7]:0.98 | 0.874 | 0.935 | 0.827 | 0.485 | 0.596 | 0.409 | 0.388 | 0.47 | 0.33 | ||||
| NB-bespoke | [6,6]:0.94 | 0.928 | 0.968 | 0.915 | 0.48 | 0.567 | 0.416 | 0.371 | 0.433 | 0.325 | |||
| [6,6]:0.96 | 0.928 | 0.968 | 0.915 | 0.491 | 0.59 | 0.42 | 0.393 | 0.466 | 0.34 | ||||
| [6,6]:0.98 | 0.927 | 0.97 | 0.913 | 0.504 | 0.624 | 0.422 | 0.421 | 0.512 | 0.358 | ||||
| [7,7]:0.98 | 0.935 | 0.97 | 0.921 | 0.487 | 0.596 | 0.412 | 0.386 | 0.466 | 0.329 | ||||
| rdp | 0.7 | 0.929 | 0.939 | 0.922 | 0.479 | 0.572 | 0.413 | 0.382 | 0.451 | 0.332 | |||
| 0.8 | 0.924 | 0.939 | 0.915 | 0.507 | 0.633 | 0.422 | 0.434 | 0.534 | 0.366 | ||||
| 0.9 | 0.922 | 0.937 | 0.913 | 0.517 | 0.698 | 0.411 | 0.47 | 0.617 | 0.379 | ||||
| Precision | Naive-Bayes | [6,6]:0.98 | 0.874 | 0.935 | 0.827 | 0.505 | 0.629 | 0.423 | 0.426 | 0.52 | 0.361 | ||
| NB-bespoke | [6,6]:0.98 | 0.927 | 0.97 | 0.913 | 0.504 | 0.624 | 0.422 | 0.421 | 0.512 | 0.358 | |||
| rdp | 0.8 | 0.924 | 0.939 | 0.915 | 0.507 | 0.633 | 0.422 | 0.434 | 0.534 | 0.366 | |||
| 0.9 | 0.922 | 0.937 | 0.913 | 0.517 | 0.698 | 0.411 | 0.47 | 0.617 | 0.379 | ||||
| 1 | 0.821 | 0.943 | 0.742 | 0.461 | 0.81 | 0.322 | 0.459 | 0.774 | 0.327 | ||||
| Recall | NB-bespoke | [6,6]:0.92 | 0.938 | 0.971 | 0.924 | 0.467 | 0.544 | 0.409 | 0.353 | 0.407 | 0.312 | ||
| [6,6]:0.94 | 0.928 | 0.968 | 0.915 | 0.48 | 0.567 | 0.416 | 0.371 | 0.433 | 0.325 | ||||
| [6,6]:0.96 | 0.928 | 0.968 | 0.915 | 0.491 | 0.59 | 0.42 | 0.393 | 0.466 | 0.34 | ||||
| [6,6]:0.98 | 0.927 | 0.97 | 0.913 | 0.504 | 0.624 | 0.422 | 0.421 | 0.512 | 0.358 | ||||
| [7,7]:0.96 | 0.935 | 0.969 | 0.921 | 0.47 | 0.56 | 0.404 | 0.357 | 0.422 | 0.31 | ||||
| [7,7]:0.98 | 0.935 | 0.97 | 0.921 | 0.487 | 0.596 | 0.412 | 0.386 | 0.466 | 0.329 | ||||
| rdp | 0.7 | 0.929 | 0.939 | 0.922 | 0.479 | 0.572 | 0.413 | 0.382 | 0.451 | 0.332 | |||
| 0.8 | 0.924 | 0.939 | 0.915 | 0.507 | 0.633 | 0.422 | 0.434 | 0.534 | 0.366 | ||||
| 0.9 | 0.922 | 0.937 | 0.913 | 0.517 | 0.698 | 0.411 | 0.47 | 0.617 | 0.379 | ||||
| Novel | Naive-Bayes | [6,6]:0.98 | 0.874 | 0.935 | 0.827 | 0.505 | 0.629 | 0.423 | 0.426 | 0.52 | 0.361 | ||
| NB-bespoke | [6,6]:0.98 | 0.927 | 0.97 | 0.913 | 0.504 | 0.624 | 0.422 | 0.421 | 0.512 | 0.358 | |||
| rdp | 0.8 | 0.923 | 0.939 | 0.915 | 0.507 | 0.633 | 0.422 | 0.434 | 0.534 | 0.366 | |||
| 0.9 | 0.921 | 0.937 | 0.913 | 0.517 | 0.698 | 0.411 | 0.47 | 0.617 | 0.379 | ||||
aF, F-measure; P, precision; R, recall
bNaive Bayes parameters: k-mer range, confidence
cRDP parameters: confidence
dBLAST+/VSEARCH parameters: max accepts, minimum consensus, minimum percent identity
eUCLUST parameters: minimum consensus, similarity, max accepts
fThreshold describes the score cut-offs used to define optimal method ranges, in the following format: [metric = (mock score, cross-validated score, novel-taxa score)]. If two cut-offs are given, the second indicates a higher cut-off used to select parameters for the developmental NB-bespoke method, and the configurations listed are the union of the two cutoffs: the second cutoff for selecting NB-bespoke, the first for selecting all other methods
Fig. 5Runtime performance comparison of taxonomy classifiers. Runtime (s) for each taxonomy classifier either varying the number of query sequences and keeping a constant 10,000 reference sequences (a) or varying the number of reference sequences and keeping a constant 1 query sequence (b)
Naive Bayes broad grid search parameters
| Step | Parameter | Values |
|---|---|---|
| sklearn.feature_extraction.text.HashingVectorizer | n_features | 1024, 8192, 65,536 |
| ngram_range | [4,4], [8, 8], [16, 16], [4,16] | |
| sklearn.feature_extraction.text.TfidfTransformer | norm | l1, l2, None |
| usd_idf | True, False | |
| sklearn.naive_bayes.MultinomialNB | alpha | 0.001, 0.01, 0.1 |
| class_prior | None, array of class weights | |
| post processing | confidence | 0, 0.2, 0.4, 0.6, 0.8 |