| Literature DB >> 30336775 |
Adélaïde Roguet1, A Murat Eren2, Ryan J Newton1, Sandra L McLellan3.
Abstract
BACKGROUND: Clostridiales and Bacteroidales are uniquely adapted to the gut environment and have co-evolved with their hosts resulting in convergent microbiome patterns within mammalian species. As a result, members of Clostridiales and Bacteroidales are particularly suitable for identifying sources of fecal contamination in environmental samples. However, a comprehensive evaluation of their predictive power and development of computational approaches is lacking. Given the global public health concern for waterborne disease, accurate identification of fecal pollution sources is essential for effective risk assessment and management. Here, we use random forest algorithm and 16S rRNA gene amplicon sequences assigned to Clostridiales and Bacteroidales to identify common fecal pollution sources. We benchmarked the accuracy, consistency, and sensitivity of our classification approach using fecal, environmental, and artificial in silico generated samples.Entities:
Keywords: 16S rRNA gene; Bacteroidales; Clostridiales; High-throughput sequencing; Microbial source tracking; Random forest classification
Mesh:
Substances:
Year: 2018 PMID: 30336775 PMCID: PMC6194674 DOI: 10.1186/s40168-018-0568-3
Source DB: PubMed Journal: Microbiome ISSN: 2049-2618 Impact factor: 14.650
Fig. 1Design of the classifiers. (a) Determination of the bacterial group profiles for each of the source samples. (b) Analysis of the bacterial profile discrepancies between the samples that belong to the targeted source and all other samples using random forest (one classifier per source). (c) Selection of the relevant representative sequences based on mean decrease Gini values to form the classifier. Classifiers are then trained using the selected representative sequences. Abbreviations: MDG mean decrease Gini
Fig. 2Schematic flow chart to characterize source-specific fecal contamination within an unknown environmental sample. (a) Determination of the bacterial profile of an environmental sample. (b) Selection of sequences in common with the classifier. (c) Comparison of the relative abundance of the selected sequences with the classifier. An unknown sample is considered contaminated by a source if the relative abundance for the majority of the selected sequences is similar to the one of the classifier
Fig. 3Distribution of the sequences within the bacterial groups. Relative abundance per source of the sequences belonging to the respective source classifier (dark red), to another classifier (light red) or not belonging to any of the classifiers (blue) for the (a) Clostridiales and (b) Bacteroidales groups
Fig. 4Distribution of amplicon sequence variants (ASVs) selected among the Clostridiales classifiers. (a) Mean and distribution of the number of ASVs belonging to the different classifiers for each source of fecal samples. (b) Heatmap representing the relative abundance of the ASVs selected within the eight classifiers (represented on the horizontal axis) within the samples (listed on the right) used to build the classifiers. Samples were clustered using the UPGMA algorithm based on the Bray–Curtis dissimilarity matrix
Prediction of the fecal source contamination for animal fecal and sewage samples
| Unknown sample ID |
|
|
|---|---|---|
| Cat_PU15 | Cata (88)–Peta (100) | Peta (90) |
| Cow_PU75 | Cowa (68)–Ruminanta (12) | Cowa (93)–Ruminanta (97) |
| Deer_PU11 | Deera (93)–Ruminanta (16) | Deera (100)–Ruminanta (100) |
| Dog_PU12 | Doga (22)–Peta (100) | Doga (64)–Peta (100) |
| Dog_PU17 | Doga (86)–Peta (100) | Cata (7)–Peta (99) |
| Pig_PU156 | Piga (48) | Piga (99) |
| Pig_PU159 | Piga (46) | Piga (99) |
| Cow_PU70&Deer_PU91 | Cowa (38)–Deera (29)–Ruminanta (10) | Cowa (88)–Deera (12)–Ruminanta (100) |
| Sewage_Duncansville_161 | Sewagea (83) | Sewagea (93) |
| Sewage_Duncansville_52 | Sewagea (86) | Sewagea (97) |
| Sewage_Milwaukee_JI199 | Sewagea (85) | Sewagea (94) |
| Sewage_Milwaukee_SS200 | Sewagea (87) | Sewagea (96) |
| Sewage_ReusSpain_224 | Sewagea (88) | Sewagea (99) |
| Sewage_ReusSpain_80 | Sewagea (95) | Sewagea (99) |
| OtherSource_Goose_PU126 | – | – |
| OtherSource_Goose_PU97 | Petb (64) | – |
| OtherSource_Rabbit_PU26 | – | – |
| OtherSource_Rabbit_PU27 | – | – |
| OtherSource_Rabbit_PU9 | – | – |
| OtherSource_Raccoon_PU100 | – | Doga (85)–Peta (72) |
| OtherSource_Raccoon_PU101 | Petb (98) | – |
| OtherSource_Raccoon_PU102 | Dogc (59)–Peta (71) | – |
| OtherSource_Raccoon_PU52 | – | Dogc (91) |
Values representing the proportion of sequences that belong to a given classifier among the total number of sequences from all classifiers
aIndex representing the percentage of the vote by the trees higher than the majority (50%)
bIndex representing the percentage of the vote by the trees between 45 and 50%
cIndex representing the percentage of the vote by the trees between 40 and 45%
Random forest classification of 25 freshwater samples with different level of fecal contamination
| Random forest classifications† | ||||||
|---|---|---|---|---|---|---|
| Environmental sample ID | Type of sample | Major type of contamination | Level of fecal indicator bacteria‡ | Level of qPCR human marker‡‡ |
|
|
| FMRMN73_092 | Stormwater | HC | High | High | Sewagea (98) | Sewagea (99) |
| FMRMN73_29 | Stormwater | HC | High | High | Sewagea (84) | Sewagea (89) |
| FMRHC33_42 | Stormwater | HC | High | Medium | – | Sewagec (100) |
| FMRMN60_100 | Stormwater | HC | High | High | – | – |
| FMRMN29_108 | Stormwater | HC | High | High | Sewagea (91) | Sewagea (95) |
| MKE_162 | River | HC | High | Medium | Sewageb (85) | Sewagea (98) |
| MNE_163 | River | HC | Medium | Medium | Sewageb (57) | Sewageb (86) |
| KK_160 | River | HC | Medium | High | Sewagec (77) | Sewagea (99) |
| MNE_159 | River | HC | Medium | Medium | Sewagec (68) | Sewageb (98) |
| MKE_158 | River | HC | Medium | Medium | – | Sewageb (98) |
| Gap_51 | Harbor | HC | Medium | High | Sewagea (82) | Sewagea (97) |
| Junction_54 | Harbor | HC | Low | Medium | – | Sewageb (78) |
| Gap_55 | Harbor | HC | Low | Medium | Sewagea (55) | Sewagec (94) |
| Junction_52 | Harbor | HC | Low | Medium | Sewagec (64) | – |
| FMRMN53_26 | Stormwater | NHC | High | Inconclusive | – | – |
| SHC12A_10 | Stormwater | NHC | High | Inconclusive | Sewagec (90) | – |
| SMN17A_20 | Stormwater | NHC | High | Inconclusive | – | Sewagec (100) |
| FMRHC43_43 | Stormwater | NHC | High | Not detected | – | – |
| FMRHAC22_38 | Stormwater | NHC | Medium | Not detected | – | – |
| Gap_53 | Harbor | NHC | Low | Not detected | – | Sewageb (99) |
| 1_mile | Lake | NC | Not detected | Not tested | – | – |
| 2_miles | Lake | NC | Not detected | Not tested | – | – |
| DocIn_155 | Lake | NC | Not detected | Not tested | – | – |
| DocMid_156 | Lake | NC | Not detected | Not tested | – | – |
| DocOut_157 | Lake | NC | Not detected | Not tested | – | – |
HC human contamination (fecal indicator bacteria and human marker detected), NHC non-human contamination (fecal indicator detected and human markers not detected or inconclusive reflecting potential for low levels of human contamination), NC not fecal contaminated (fecal indicator not detected)
†Values in parentheses represent the proportion of sequences that belong to a given classifier among the total number of sequences from all classifiers
‡Density levels of the fecal indicator E. coli and enterococci: not detected, 0; low, > 0–250; medium, 250–1000; high, > 1000 CFU/100 mL
‡‡Quantification levels of the markers human Bacteroides, Lachno2, and Lachno3 when tested: Not detected, 0; not quantifiable, > 0–15; low, > 15–100; medium, 100–10,000; high, > 10,000 gene copies/100 mL. In case of divergence between the human Bacteroides, Lachno2, and/or Lachno3 human markers, results were considered to be inconclusive. See Additional file 1 for details
aIndex representing the percentage of the vote by the trees higher than the majority (50%)
bIndex representing the percentage of the vote by the trees between 45 and 50%
cIndex representing the percentage of the vote by the trees between 40 and 45%
Fig. 5Random forest classifications performed on 16 artificial bacterial assemblages generated in silico. Red dots show the proportion of sequences within the total assemblage belonging to a source (1 dot = 1%). The percentage listed in the freshwater column corresponds to the proportion of sequences from the non-contaminated freshwater sample within the total assemblages (per sample, red dots + freshwater percentages = 100%). Bold values associated with red dots indicate the proportion of contamination expected for the different sources. Predictions of the Clostridiales and Bacteroidales classifications are indicated in the white rows. Green circles indicate the classifier detected the source signature in the sample. Orange circles indicate the classifier did not detect the source signature when it was expected. Blue circles indicate the classifier did not detect a signature when a source not included in the classifiers was included in the assemblage. The proportion of sequences matching the source classifier is associated with the green circles. See Additional file 2 for more details
V4V5 and V6 classifier predictions for animal fecal, sewage, and freshwater samples
| Unknown sample ID | V4V5 region | V6 region | ||
|---|---|---|---|---|
|
|
|
|
| |
| Cat_PU15 | Peta (9) | Catc (86)–Peta (84) | Cata (70)–Peta (72) | Peta (1) |
| Cow_PU75 | Cowa (29)–Ruminanta (99) | Cowa (88)–Ruminanta (97) | Cowa (21)–Ruminanta (53) | Cowa (76)–Ruminanta (97) |
| Deer_PU11 | Deera (59)–Ruminanta (92) | Deera (100)–Ruminanta (100) | Deera (36)–Ruminanta (52) | Deera (93)–Ruminanta (100) |
| Dog_PU17 | Doga (65)–Peta (46) | – | Doga (94)–Peta (91) | Cata (5)–Peta (56) |
| Pig_PU159 | Piga (89) | Piga (97) | Piga (27) | Piga (96) |
| Cow_PU70&Deer_PU91 | Ruminanta (99) | Cowa (77)–Deerc (20)–Ruminanta (96) | Cowa (10)–Deerc (5)–Ruminanta (44) | Cowa (67)–Ruminanta (100) |
| Sewage_Duncansville_52 | Sewagea (94) | Sewagea (90) | Sewagea (89) | Sewagea (86) |
| Sewage_MilwaukeeJI_199 | Sewagea (93) | Sewagea (76) | Sewagea (88) | Sewagea (68) |
| Sewage_MilwaukeeSS_200 | Sewagea (94) | Sewagea (81) | Sewagea (87) | Sewagea (80) |
| Sewage_ReusSpain_224 | Sewageb (96) | Sewageb (99) | Sewagea (86) | Sewagea (97) |
| Sewage_ReusSpain_80 | Sewagea (97) | Sewageb (99) | Sewagea (91) | Sewagea (95) |
| OtherSource_Rabbit_PU26 | Petb (75) | – | – | – |
Values in parentheses represent the proportion of sequences that belong to a given classifier among the total number of sequences from all classifiers
aIndex representing the percentage of the vote by the trees higher than the majority (50%)
bIndex representing the percentage of the vote by the trees between 45 and 50%
cIndex representing the percentage of the vote by the trees between 40 and 45%