| Literature DB >> 27822541 |
Maxime Galan1, Maria Razzauti1, Emilie Bard2, Maria Bernard3, Carine Brouat4, Nathalie Charbonnel1, Alexandre Dehne-Garcia1, Anne Loiseau1, Caroline Tatard1, Lucie Tamisier1, Muriel Vayssier-Taussat5, Helene Vignes6, Jean-François Cosson7.
Abstract
The human impact on natural habitats is increasing the complexity of human-wildlife interactions and leading to the emergence of infectious diseases worldwide. Highly successful synanthropic wildlife species, such as rodents, will undoubtedly play an increasingly important role in transmitting zoonotic diseases. We investigated the potential for recent developments in 16S rRNA amplicon sequencing to facilitate the multiplexing of the large numbers of samples needed to improve our understanding of the risk of zoonotic disease transmission posed by urban rodents in West Africa. In addition to listing pathogenic bacteria in wild populations, as in other high-throughput sequencing (HTS) studies, our approach can estimate essential parameters for studies of zoonotic risk, such as prevalence and patterns of coinfection within individual hosts. However, the estimation of these parameters requires cleaning of the raw data to mitigate the biases generated by HTS methods. We present here an extensive review of these biases and of their consequences, and we propose a comprehensive trimming strategy for managing these biases. We demonstrated the application of this strategy using 711 commensal rodents, including 208 Mus musculus domesticus, 189 Rattus rattus, 93 Mastomys natalensis, and 221 Mastomys erythroleucus, collected from 24 villages in Senegal. Seven major genera of pathogenic bacteria were detected in their spleens: Borrelia, Bartonella, Mycoplasma, Ehrlichia, Rickettsia, Streptobacillus, and Orientia. Mycoplasma, Ehrlichia, Rickettsia, Streptobacillus, and Orientia have never before been detected in West African rodents. Bacterial prevalence ranged from 0% to 90% of individuals per site, depending on the bacterial taxon, rodent species, and site considered, and 26% of rodents displayed coinfection. The 16S rRNA amplicon sequencing strategy presented here has the advantage over other molecular surveillance tools of dealing with a large spectrum of bacterial pathogens without requiring assumptions about their presence in the samples. This approach is therefore particularly suitable to continuous pathogen surveillance in the context of disease-monitoring programs. IMPORTANCE Several recent public health crises have shown that the surveillance of zoonotic agents in wildlife is important to prevent pandemic risks. High-throughput sequencing (HTS) technologies are potentially useful for this surveillance, but rigorous experimental processes are required for the use of these effective tools in such epidemiological contexts. In particular, HTS introduces biases into the raw data set that might lead to incorrect interpretations. We describe here a procedure for cleaning data before estimating reliable biological parameters, such as positivity, prevalence, and coinfection, using 16S rRNA amplicon sequencing on an Illumina MiSeq platform. This procedure, applied to 711 rodents collected in West Africa, detected several zoonotic bacterial species, including some at high prevalence, despite their never before having been reported for West Africa. In the future, this approach could be adapted for the monitoring of other microbes such as protists, fungi, and even viruses.Entities:
Keywords: West Africa; bacteria; emerging infectious diseases; high-throughput sequencing; metabarcoding; molecular epidemiology; next-generation sequencing; rodents; zoonoses
Year: 2016 PMID: 27822541 PMCID: PMC5069956 DOI: 10.1128/mSystems.00032-16
Source DB: PubMed Journal: mSystems ISSN: 2379-5077 Impact factor: 6.496
Sources of bias during the experimental and bioinformatic steps of 16S rRNA amplicon sequencing: consequences for data interpretation and solutions for mitigating these biases
| Experimental step(s) | Source(s) of errors | Consequence(s) | Solution(s) |
|---|---|---|---|
| Sample collection | Cross-contamination between individuals ( | False-positive samples | Rigorous processing (decontamination of the instruments, cleaning of the autopsy table, use of sterile bacterium-free consumables, gloves, masks) |
| Negative controls during sampling (e.g., organs of healthy mice during dissection) | |||
| Collection and storage conditions ( | False-positive and false-negative samples | Use of appropriate storage conditions/buffers; use of unambiguously identified samples; double-checking of tube labeling during sample collection | |
| DNA extraction | Cross-contamination between samples ( | False-positive samples | Rigorous processing (separation of pre- and post-PCR steps, use of sterile hood and filter tips and sterile bacterium-free consumables) |
| Reagent contamination with bacterial DNA ( | False-positive samples | Negative controls for extraction (extraction without sample) | |
| Small amounts of DNA ( | False-negative samples | Use of an appropriate DNA extraction protocol; discarding of samples with a low DNA concentration | |
| Target DNA region and primer design | Target DNA region efficacy ( | False-negative samples due to poor taxonomic identification | Selection of an appropriate target region and design of effective primers for the desired taxonomic resolution |
| Primer design ( | False-negative samples due to biases in PCR amplification for some taxa | Checking of the universality of the primers with reference sequences | |
| Tag/index design and preparation | False assignments of sequences due to cross-contamination between tags/indices ( | False-positive samples | Rigorous processing (use of sterile hood and filter tips and sterile bacterium-free consumables, brief centrifugation before the opening of index storage tubes, separation of pre- and post-PCR steps) |
| Negative controls for tags/indices (empty wells without PCR reagents for particular tags or index combinations) | |||
| Positive controls for alien DNA, i.e., a bacterial strain highly unlikely to infect the samples studied (e.g., a host-specific bacterium unable to persist in the environment) to estimate false-assignment rate | |||
| False assignments of sequences due to inappropriate tag/index design ( | False-positive samples | Fixing of a minimum number of substitutions between tags or indices; each nucleotide position in the sets of tags or indices should display about 25% occupation by each base for Illumina sequencing | |
| PCR amplification | Cross-contamination between PCRs ( | False-positive samples | Rigorous processing (brief centrifugation before opening the index storage tubes, separation of pre- and post-PCR steps) |
| Negative controls for PCR (PCR without template), with microtubes left open during sample processing | |||
| Reagent contamination with bacterial DNA ( | False-positive samples | Rigorous processing (use of sterile hood and filter tips and sterile bacterium-free consumables) | |
| Negative controls for PCRs (PCR without template), with microtubes closed during sample processing | |||
| Chimeric recombinations by jumping PCR ( | False-positive samples due to artifactual chimeric sequences | Increasing the elongation time and decreasing the number of cycles; use of a bioinformatic strategy to remove the chimeric sequences (e.g., Uchime program) | |
| Poor or biased amplification ( | False-negative samples | Increasing the amount of template DNA; optimizing the PCR conditions (reagents and program) | |
| Use of technical replicates to validate sample positivity | |||
| Positive controls for PCR (extraction from infected tissue and/or bacterial isolates) | |||
| Library preparation | Cross-contamination between PCRs/libraries ( | False-positive samples | Rigorous processing (use of sterile hood and filter tips and sterile bacterium-free consumables, electrophoresis and gel excision with clean consumables, separation of pre- and post-PCR steps) |
| Use of a protocol with an indexing step during target amplification | |||
| Negative controls for indices (changing well positions between library preparation sessions) | |||
| Chimeric recombinations by jumping PCR ( | False-positive samples due to interindividual recombinations | Avoiding PCR library enrichment of pooled samples | |
| Positive controls for alien DNA, i.e., DNA from a bacterial strain that should not be identified in the sample (e.g., a host-specific bacterium unable to persist in the environment) | |||
| MiSeq sequencing (Illumina) | Sample sheet errors ( | False-positive and negative samples | Negative controls (wells without PCR reagents for a particular index combination) |
| Run-to-run carryover (Illumina technical support note no. 770-2013-046) | False-positive samples | Washing of the MiSeq with dilute sodium hypochlorite solution | |
| Poor quality of reads due to flow cell overloading ( | False-negative samples due to low quality of sequences | qPCR quantification of the library before sequencing | |
| Poor quality of reads due to low-diversity libraries (Illumina technical support note no. 770-2013-013) | Decreasing cluster density; creation of artificial sequence diversity at the flow cell surface (e.g., by adding 5%–10% PhiX DNA control library) | ||
| Small number of reads per sample ( | False-negative samples due to low depth of sequencing | Decreasing the level of multiplexing | |
| Discarding the sample with a low number of reads | |||
| Too-short overlapping read pairs ( | False-negative samples due to low quality of sequences | Increasing paired-end sequence length or decreasing the length of the target sequence | |
| Mixed clusters on the flow cell ( | False-positive samples due to false index pairing | Use of a single barcode sequence for both the i5 and i7 indices for each sample (when possible, e.g., with a small number of samples) | |
| Positive controls for alien DNA, i.e., DNA from a bacterial strain highly unlikely to be found in the rodents studied (e.g., a host-specific bacterium unable to persist in the environment) | |||
| Bioinformatics and taxonomic classification | Poor quality of reads | False-negative samples due to poor taxonomic resolution | Removal of low-quality reads |
| Errors during processing (sequence trimming, alignment) ( | False-positive and false-negative samples | Use of standardized protocols and reproducible workflows | |
| Incomplete reference sequence databases ( | False-negative samples | Selection of an appropriate database for the selected target region and testing of the database for bacteria of particular interest | |
| Error of taxonomic classification ( | False-positive samples | Positive controls for PCRs (extraction from infected tissue and/or bacterial isolates and/or mock communities) | |
| Checking of taxonomic assignments by other methods (e.g., blast analyses using different databases) |
FIG 1 Workflow of the wet laboratory, bioinformatics, and data filtering procedures in the process of data filtering for 16S rRNA amplicon sequencing. Reagent contaminants were detected by analyzing the sequences in the NCext and NCPCR controls. Sequence number thresholds for correcting for cross-contamination (TCC) are OTU and run dependent and were estimated by analyzing the sequences in the NCmus, NCext, NCPCR, and PCindex controls. Sequence number thresholds for correcting for false-index-pairing (TFA) values are OTU and run dependent and were estimated by analyzing the sequences in the NCindex and PCalien controls. A result was considered positive if the number of sequences was >TCC and >TFA. Samples were considered positive if a positive result was obtained for both PCR replicates. *, see Kozich et al. (18) for details on the sequencing.
Numbers of sequences for 12 pathogenic OTUs observed in wild rodents, negative controls, and positive controls, together with TCC and TFA threshold values
| OTU | Total no. of sequences | Wild rodents ( | Negative controls | Positive controls | Threshold | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NCPCR | NCext | NCmus | PCBartonnela_t | PCBorrelia_b | PCMycoplasma_m | ||||||||||||
| Total no. of sequences | Maximum no. of sequences in one PCR | Total no. of sequences | Maximum no. of sequences in one PCR | Total no. of sequences | Maximum no. of sequences in one PCR | Total no. of sequences | Maximum no. of sequences in one PCR | Total no. of sequences | Maximum no. of sequences in one PCR | Total no. of sequences | Maximum no. of sequences in one PCR | Total no. of sequences | Maximum no. of sequences in one PCR | TCC | TFA | ||
| Run 1 | |||||||||||||||||
| Whole data set | 7,960,533 | 7,149,444 | 64,722 | 45,900 | 8,002 | 39,308 | 8,741 | 68,350 | 26,211 | 137,424 | 73,134 | 239,465 | 120,552 | 280,642 | 82,933 | / | / |
| | 1,410,218 | 1,410,189 | 61,807 | 2 | 1 | 3 | 2 | 9 | 5 | 3 | 3 | 8 | 6 | 4 | 3 | 6 | 282 |
| | 507,376 | 507,369 | 36,335 | 2 | 1 | 0 | 0 | 0 | 0 | 2 | 2 | 1 | 1 | 2 | 2 | 2 | 101 |
| | 649,451 | 649,423 | 63,137 | 4 | 2 | 3 | 2 | 7 | 4 | 1 | 1 | 1 | 1 | 12 | 6 | 6 | 130 |
| | 345,873 | 345,845 | 28,528 | 4 | 4 | 7 | 4 | 9 | 4 | 1 | 1 | 0 | 0 | 7 | 3 | 4 | 69 |
| | 279,965 | 279,957 | 29,503 | 1 | 1 | 4 | 1 | 0 | 0 | 2 | 2 | 0 | 0 | 1 | 1 | 2 | 56 |
| | 202,127 | 67,973 | 16,145 | 1 | 1 | 1 | 1 | 1 | 1 | 134,124 | 71,163 | 7 | 4 | 20 | 9 | 9 | 40 |
| PC | 280,151 | 338 | 28 | 0 | 0 | 0 | 0 | 2 | 2 | 34 | 20 | 24 | 18 | 279,753 | 82,767 | / | / |
| PC | 238,772 | 420 | 43 | 0 | 0 | 0 | 0 | 0 | 0 | 38 | 21 | 238,238 | 119,586 | 76 | 23 | / | / |
| Run 2 | |||||||||||||||||
| Whole data set | 6,687,060 | 6,525,107 | 42,326 | 61,231 | 9,145 | 53,334 | 7,669 | / | / | 12,142 | 7,518 | 13,378 | 7,164 | 21,868 | 6,520 | / | / |
| | 155,486 | 155,486 | 7,703 | 0 | 0 | 0 | 0 | / | / | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 31 |
| | 1,036,084 | 1,035,890 | 23,588 | 1 | 1 | 192 | 115 | / | / | 0 | 0 | 0 | 0 | 1 | 1 | 115 | 207 |
| | 127,591 | 127,590 | 5,072 | 1 | 1 | 0 | 0 | / | / | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 26 |
| | 85,596 | 85,583 | 20,146 | 0 | 0 | 13 | 13 | / | / | 0 | 0 | 0 | 0 | 0 | 0 | 13 | 17 |
| | 56,324 | 56,324 | 10,760 | 0 | 0 | 0 | 0 | / | / | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 |
| | 13,356 | 13,356 | 1,482 | 0 | 0 | 0 | 0 | / | / | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
| | 74,017 | 74,017 | 19,651 | 0 | 0 | 0 | 0 | / | / | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 15 |
| | 21,636 | 21,636 | 3,085 | 0 | 0 | 0 | 0 | / | / | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
| | 307 | 307 | 181 | 0 | 0 | 0 | 0 | / | / | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| | 1,559,028 | 1,547,652 | 14,515 | 1 | 1 | 2 | 2 | / | / | 11,297 | 6,714 | 2 | 2 | 74 | 59 | 59 | 312 |
| | 32,399 | 32,399 | 6,245 | 0 | 0 | 0 | 0 | / | / | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 |
| | 589 | 589 | 329 | 0 | 0 | 0 | 0 | / | / | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| PC | 16,854 | 2 | 1 | 0 | 0 | 0 | 0 | / | / | 0 | 0 | 0 | 0 | 16,852 | 5,766 | / | / |
| PC | 12,197 | 0 | 0 | 0 | 0 | 0 | 0 | / | / | 0 | 0 | 12,197 | 6,426 | 0 | 0 | / | / |
Threshold TCC data are based on the maximum number of sequences observed in a negative or positive control for a particular OTU in each run.
Threshold TFA data are based on the false-assignment rate (0.02%) weighted by the total number of sequences of each OTU in each run.
Mycoplasma mycoides and Borrelia burgdorferi bacterial isolates were added as positive controls for PCR and indexing (i.e., PCalien) (see Fig. 1).
Data are given for the two MiSeq runs separately. NCPCR, negative controls for PCR; NCext, negative controls for extraction; NCmus, negative controls for dissection; PCBartonella_t, positive controls for PCR; PCBorrelia_b and PCMycoplsma_m, positive controls for PCR and positive controls for indexing; TCC and TFA, thresholds for positivity for a particular bacterium according to bacterial OTU and MiSeq run (see also Fig. 1).
FIG 2 Taxonomic assignment of the V4 16S rRNA sequences in wild rodents and in negative controls for extraction and PCR. The histograms show the percentages of sequences for the most abundant bacterial genera in the two MiSeq runs combined. Notice the presence in the controls of several bacterial genera, which was likely due to the inherent contamination of laboratory reagents by bacterial DNA (termed “contaminant genera”). These contaminant genera are also present (to a lesser extent) in the rodent samples. The insertions represent the proportion of sequences from rodent samples which were incorrectly assigned to the controls. See Fig. S1 for separate histograms for the two MiSeq runs.
FIG 3 Numbers of rodents yielding positive results (positive rodents), and of sequences from positive rodents, removed for each OTU at each step in data filtering. These findings demonstrate that the positive rodents filtered out corresponded to only a very small number of sequences. (Left panel) The histogram shows the number of positive rodents discarded because of likely cross-contamination, false index pairing, and failure to replicate in both PCRs, as well as the positive results retained at the end of data filtering (in green). (Right panel) The histogram shows the number of sequences corresponding to the same class of positive rodents. Note that several positive results may be recorded for the same rodent in cases of coinfection.
FIG 4 Plots of the number of sequences [log (x + 1) scale] from bacterial OTUs in both PCR replicates (PCR1 and PCR2) of the 348 wild rodents analyzed in the first MiSeq run. Note that each rodent was tested with two replicate PCRs. Green points correspond to rodents with two positive results after filtering; red points correspond to rodents with one positive result and one negative result; and blue points correspond to rodents with two negative results. The light blue area and lines correspond to threshold values used for the data filtering: samples below the lines were filtered out. See Fig. S3 for plots corresponding to the second MiSeq run.
Detection of 12 bacterial OTUs in the four wild-rodent species sampled in Senegal: biology and pathogenicity of the corresponding bacterial genus
| OTU of interest (genus level) | Closest species | No. of positive wild rodents ( | Biology and epidemiology | |||
|---|---|---|---|---|---|---|
| Undetermined | 60 | 73 | 1 | 6 | ||
| 21 | 0 | 8 | 6 | |||
| “ | 40 | 0 | 12 | 8 | The genus | |
| 28 | 42 | 30 | 1 | |||
| 0 | 0 | 0 | 90 | |||
| 93 | 40 | 1 | 1 | |||
| 0 | 0 | 0 | 18 | |||
| 3 | 8 | 0 | 0 | |||
| 3 | 13 | 0 | 0 | |||
| 0 | 2 | 46 | 0 | |||
| 1 | 0 | 0 | 1 | |||
| 10 | 1 | 0 | 5 | |||
Based on phylogenetic analysis; see Fig. S4 in the supplemental material.
n, number of rodents screened and analyzed.
FIG 5 Prevalence of Mycoplasma lineages in Senegalese rodents, by site, and phylogenetic associations between Mycoplasma lineages and rodent species. (A) Comparison of phylogenetic trees based on the 16S rRNA V4 sequences of Mycoplasma and on the mitochondrial cytochrome b gene and the two nuclear gene fragments (IRBP exon 1 and GHR) for rodents (the tree was drawn based on data from reference 92). Lines link the Mycoplasma lineages detected in the various rodent species (for a minimum site prevalence exceeding 10%). The numbers next to the branches are bootstrap values (shown only if >70%). (B) Plots of OTU prevalences, with 95% confidence intervals calculated by Sterne’s exact method (93) according to rodent species and site (see reference 69 for more information about site codes and their geographic locations). The gray bars on the x axis indicate sites from which the rodent species concerned is absent.