| Literature DB >> 32414415 |
Isabel F Escapa1,2,3, Yanmei Huang1,2, Tsute Chen1,2, Maoxuan Lin1, Alexis Kokaras1, Floyd E Dewhirst1,2, Katherine P Lemon4,5,6,7.
Abstract
BACKGROUND: The low cost of 16S rRNA gene sequencing facilitates population-scale molecular epidemiological studies. Existing computational algorithms can resolve 16S rRNA gene sequences into high-resolution amplicon sequence variants (ASVs), which represent consistent labels comparable across studies. Assigning these ASVs to species-level taxonomy strengthens the ecological and/or clinical relevance of 16S rRNA gene-based microbiota studies and further facilitates data comparison across studies.Entities:
Keywords: 16S rRNA gene; Aerodigestive tract; Habitat-specific database; Microbiome; Nasal; Naïve Bayesian RDP Classifier; Species-level taxonomy; Training set; V1–V3; eHOMD
Mesh:
Substances:
Year: 2020 PMID: 32414415 PMCID: PMC7291764 DOI: 10.1186/s40168-020-00841-w
Source DB: PubMed Journal: Microbiome ISSN: 2049-2618 Impact factor: 14.650
Fig. 1Relationships between the datasets, databases, and training sets in constructing training sets for a specific habitat: the human aerodigestive tract. a Datasets gathered from public repositories or obtained by sequencing of new samples are used to explore the 16S rRNA gene diversity of the habitat of interest. These include both 16S rRNA full-length sequences and region-specific short-read sequences used for method validation or benchmarking. b A curated habitat-specific full-length 16S rRNA gene reference database is assembled and expanded in an iterative way by selecting from those datasets representative sequences for both named and as-yet unnamed or uncultivated species (i.e., HMTs in eHOMD), and placing them in a phylogenetic tree (See Figure 1 in [20]). c Training sets are derived from the taxonomical hierarchy of the habitat-specific database and enhanced by the following steps: compiling multiple 16S rRNA gene sequences to represent the natural sequence variability of each taxon, trimming the training set to match the sequenced region/s, and placing species sharing closely related sequences into a supraspecies taxonomic level. Datasets in gray are the specific examples used for the construction of the eHOMD derived training sets described here. Solid arrows indicate where the sequences described come from and dotted arrows indicate when datasets were used for validation or benchmarking
Fig. 2Schematic representation of the steps to generate sequential habitat-specific training sets. a The FL_eHOMDrefs_TS training set contains all full-length eHOMDrefs (thick lines) from eHOMDv15.1 together with their respective taxonomic assignment. When only one read represents each taxon (M = 1), a given distinguishing k-mer (green fragment) can only be either present (1) or absent (0). b A higher number of sequences per taxon (M) allows for better resolution on the assignment, with the presence of a given distinguishing k-mer (w) across each cluster of reads (green fragments) being represented as a proportion (m) out of the total number of reads in that taxon (M). Therefore, to better represent the known sequence diversity of the 16S rRNA gene(s) for each taxon, the training set FL_Compilation_TS includes clusters of sequences (thin lines) recovered from the NCBI nonredundant nucleotide (nr/nt) database that matched with 99% identity and ≥ 98% coverage (see methods) to each eHOMDref (thick line). c The training set V1V3_Raw_TS is a V1–V3 trimmed version of the FL_Compilation_TS training set. The schematic illustrates how trimming to this region leads to identical reads (purple lines) having two different taxonomic designations. Here, G is genus and species are labeled as A or B. d To construct the V1V3_Curated_TS training set, identical V1–V3 sequences in the V1V3_Raw_TS training set were collapsed into one. If identical sequences came from more than one taxon (purple), species-level names of all taxa involved were concatenated (AB). e The V1V3_Supraspecies_TS training set includes the same sequences that the V1V3_Raw_TS training set; however, the headers in the fasta file include the supraspecies taxon (AB) as an extra level between the genus (G) and species taxonomic levels (A, B, or AB), as illustrated here
Fig. 3The FL_Compilation_TS training set provides higher classification percentages with a lower error rate. a The percentage of eHOMD-derived simulated reads classified using the FL_eHOMDrefs_TS training set (purple) versus the FL_Compilation_TS training set (orange). b The percentage of classified reads that were misclassified (i.e., reads for which the assigned taxonomic identity was different than the known identity of the original sequence from which the simulated read was derived). The naïve Bayesian RDP Classifier was used with bootstrap values ranging from 50 to 100
Fig. 4Trimming the training set to the specific sequenced region further reduces the error rate. a The percentage of eHOMD-derived simulated reads classified at species level using the FL_Compilation_TS (orange) training set compared to subsequent trimmed versions V1V3_Raw_TS (green) and V1V3_Curated_TS (red). b The percentage of classified reads that were misclassified with each of these three training sets. c This graph, which is specific to the eHOMD training set construction (V1V3_eHOMDSim_250N100 dataset), indicates how researchers can determine the bootstrap value to use with the naïve Bayesian RDP Classifier by deciding an acceptable level of the % of reads misclassified (blue line; e.g., 0.5%) and/or of the % of reads that are not classified (red line). The naïve Bayesian RDP Classifier was used with bootstrap values ranging from 50 to 100
Fig. 5Addition of a supraspecies level to the training set increases the percentage of classified reads. a The percentage of eHOMD-derived simulated reads classified at species/supraspecies level using the V1V3_Curated_TS training set (red) versus the FL_Supraspecies_TS training set (blue). b The percentage of classified reads that were misclassified with each of these training sets. The naïve Bayesian RDP Classifier was used with bootstrap values ranging from 50 to 100
The eHOMD training sets performed well for species/supraspecies taxonomy assignment of an independent long-read 16S rRNA gene sinonasal dataset
| # ASVs | # Reads | % ASVs | % Reads | |
|---|---|---|---|---|
| NCBI 16S Microbial blastn | 178 | 140,455 | 87.3 | 97.3 |
| eHOMD blastn | 188 | 143,274 | 92.2 | 99.3 |
| eHOMD FL_Compilation_TS | 194 | 142,843 | 95.1 | 99.0 |
| eHOMD V1V3_Supraspecies_TS | 201 | 144,279 | 98.5 | 99.9 |
Columns three and four indicate the % of ASVs and reads, respectively, assigned subgenus-level taxonomy using each of the four approaches tested
The eHOMD training set is superior for assigning species/supraspecies-level taxonomy to short- and long-read human aerodigestive tract datasets
| V1V3_hADT_CL | V1V3_HMPnares_ ASVb | FL_sinonasal_SMRT_ASV | ||||
|---|---|---|---|---|---|---|
| (% reads) | (% ASVs) | (% reads) | (% ASVs) | (% reads) | ||
| eHOMD | Genus | 100.0 | 95.5 | 98.9 | 99.5 | 100.0 |
| Species | 100.0 | 93.9 | 98.5 | 95.1 | 99.0 | |
| SILVA | Genus | 96.1 | 94.7 | 97.6 | 96.6 | 98.9 |
| Species | 44.7a | 4.1a | 29.9a | 18.6a | 71.9a | |
| RDP | Genus | 93.2 | 90.2 | 92.2 | 94.1 | 98.5 |
| Species | 38.5a | 3.1a | 27.5a | 13.2a | 60.6a | |
aExact match algorithm
bASVs derived from the HMP nares V1–V3 dataset, as described in [20], constitute the V1V3_HMPnares_ASV dataset (Additional file 10)