Petra Pjevac1,2, Bela Hausmann1,3, Jasmin Schwarz1,3, Gudrun Kohl1,2, Craig W Herbold2, Alexander Loy1,2, David Berry1,2. 1. Joint Microbiome Facility of the Medical University of Vienna and the University of Vienna, Vienna, Austria. 2. Division of Microbial Ecology, Department of Microbiology and Ecosystem Science, Centre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, Austria. 3. Department of Laboratory Medicine, Medical University of Vienna, Vienna, Austria.
Abstract
In microbiome research, phylogenetic and functional marker gene amplicon sequencing is the most commonly-used community profiling approach. Consequently, a plethora of protocols for the preparation and multiplexing of samples for amplicon sequencing have been developed. Here, we present two economical high-throughput gene amplification and sequencing workflows that are implemented as standard operating procedures at the Joint Microbiome Facility of the Medical University of Vienna and the University of Vienna. These workflows are based on a previously-published two-step PCR approach, but have been updated to either increase the accuracy of results, or alternatively to achieve orders of magnitude higher numbers of samples to be multiplexed in a single sequencing run. The high-accuracy workflow relies on unique dual sample barcoding. It allows the same level of sample multiplexing as the previously-published two-step PCR approach, but effectively eliminates residual read missasignments between samples (crosstalk) which are inherent to single barcoding approaches. The high-multiplexing workflow is based on combinatorial dual sample barcoding, which theoretically allows for multiplexing up to 299,756 amplicon libraries of the same target gene in a single massively-parallelized amplicon sequencing run. Both workflows presented here are highly economical, easy to implement, and can, without significant modifications or cost, be applied to any target gene of interest.
In microbiome research, phylogenetic and functional marker gene amplicon sequencing is the most commonly-used community profiling approach. Consequently, a plethora of protocols for the preparation and multiplexing of samples for amplicon sequencing have been developed. Here, we present two economical high-throughput gene amplification and sequencing workflows that are implemented as standard operating procedures at the Joint Microbiome Facility of the Medical University of Vienna and the University of Vienna. These workflows are based on a previously-published two-step PCR approach, but have been updated to either increase the accuracy of results, or alternatively to achieve orders of magnitude higher numbers of samples to be multiplexed in a single sequencing run. The high-accuracy workflow relies on unique dual sample barcoding. It allows the same level of sample multiplexing as the previously-published two-step PCR approach, but effectively eliminates residual read missasignments between samples (crosstalk) which are inherent to single barcoding approaches. The high-multiplexing workflow is based on combinatorial dual sample barcoding, which theoretically allows for multiplexing up to 299,756 amplicon libraries of the same target gene in a single massively-parallelized amplicon sequencing run. Both workflows presented here are highly economical, easy to implement, and can, without significant modifications or cost, be applied to any target gene of interest.
Sequencing of phylogenetic and functional marker gene amplicons is a widely-used and cost-effective approach for characterizing the composition and dynamics of microbial communities from virtually any type of clinical or environmental sample (Hamady and Knight, 2009; Gilbert et al., 2010). The rapid advances in both quality and yield of short-read sequencing technologies has made amplicon sequencing accessible and affordable to the broader scientific community, and has led to the ubiquitous implementation of amplicon-based microbial community profiling approaches in microbiome research. Simultaneously, the increased accessibility and the ever-growing application scope of amplicon sequencing has led to a surge in attempts to develop standardized procedures and best practices for the generation and analysis of sequencing data in microbiome studies (reviewed in Knight et al., 2018). However, standardization of such workflows is not trivial. The main reason for this is the number of individual steps in an amplicon sequencing workflow: sample collection, preservation and storage, nucleic acids extraction, target gene selection and amplification, sequencing, and finally in silico data processing, analysis and interpretation. For each step of the workflow several options are available, resulting in a multitude of possible workflow variations and combinations. How different approaches for many of these steps (e.g., nucleic acids extraction method or primer selection for target gene amplification) affect amplicon sequencing results has been critically reviewed (e.g., Sinha et al., 2017; Pollock et al., 2018; Allaband et al., 2019). Notably, a concern in many microbiome studies is so-called batch effects, a term commonly used to refer to artifacts and confounding factors originating from biological, technical, and computational sources of variation in handling and processing of sample or sample groups that are ultimately compared to each other (Wang and LêCao, 2020). It is thus essential for any workflow to minimize possibilities for the introduction of such batch effects.While a universal “best practice guideline” to amplicon sequencing based microbiome analysis is not feasible or even desirable—in part due to varying sample- and study-specific requirements, but also due to conflicting experimental data (reviewed in Pollock et al., 2018)—one consensus does exist: methodological consistency and minimized variance in sample processing within a single study is of utmost importance to the validity of any dataset, and is essential in order to avoid or minimize erroneous interpretations of microbiome data.As amplicon sequencing has become the most widely-used method for microbial community profiling, economical workflows that allow for very high throughput are in demand. Simultaneously, the decreasing costs of sequencing and rapid advances in sequence quality achieved in recent years has put into focus the need for improvements in amplicon sequencing workflows that ensure maximal data quality. Considering sequence data processing, the improved quality of data has led to the transition to sequence data analysis on the level of biologically more-meaningful amplicon sequence variants (ASVs) (Callahan et al., 2017), which are gradually replacing the widely-used but largely arbitrarily-defined operational taxonomic unit (OTU) approach. On the side of data generation, the issue of index or barcode crosstalk, which can lead to misassignment of sequences in massively parallel sequencing approaches, has received considerable attention (summarized in MacConaill et al., 2018). As millions of sequence reads can be generated in a single sequencing run, large numbers of samples are typically pooled prior to sequencing to assure cost-efficient and scalable workflows. Prior to pooling, individual samples are tagged with sample specific DNA fragments, referred to as sample indexes or barcodes. These indices, or barcodes, are used to computationally separate (demultiplex) pooled samples after sequencing. However, a risk of sequence read misassignment during de-multiplexing is inherent to these approaches. The exact source of crosstalk can in most cases not be precisely identified, as crosstalk is usually caused by a combination of barcode or index oligonucleotide contamination at synthesis or during handling (Quail et al., 2014), sequencing errors, and to a lesser extent barcode or index hopping (i.e., switching of barcodes or indexes between different sequencing libraries due to unknown causes). The risk of crosstalk can be further decreased by applying sample barcoding strategies which incorporate multiple differences that discriminate between samples. As a consequence, dual indexing approaches—where each sample is tagged with two unique indexes or barcodes—are now considered the gold standard for applications in which tens to hundreds of sequencing libraries are multiplexed in a single sequencing run (Esling et al., 2015). While challenging to quantify, the magnitude of sample crosstalk in single barcoding approaches (based on erroneous pairing for barcodes and indexes in dual indexing approaches) has been reported to reach up to 0.3% (Kircher et al., 2012). This frequency of sequence misassignment in amplicon sequencing workflows employing single barcoding is not a major concern if the primary aim of the amplicon analysis is community profiling. However, such missasignments do become a considerable risk if low abundance amplicon sequence variants (i.e., the rare biosphere) are of scientific interest to a study.In most amplicon sequencing workflows, sample-specific barcodes are introduced at the step of sequencing library preparation, meaning that each amplicon sample is an individual sequencing library in a multiplexed run. However, as the number of DNA sequences that can be used as indices is limited (Faircloth and Glenn, 2012), so is the number of samples that can be multiplexed by this approach. Furthermore, sequencing library preparation is one of the more costly and time-consuming steps of an amplicon sequencing workflow, and thus is a limiting step for economical sample indexing or barcoding approaches. A more economical and flexible approach of sample barcoding is a second PCR step, performed after the target gene amplification PCR and before library preparation, that has been developed and successfully applied to a variety of samples (Herbold et al., 2015). Here, we further optimized this two-step PCR approach to enable either highly-accurate assignment of sequences to source samples via unique dual barcoding (UDB), or to allow for magnitudes higher levels of sample multiplexing via combinatorial dual barcoding (CDB).
Materials and Methods
Mock and Environmental Reproducibility Controls
A commercially-available mock community (ZymoMock, ZymoBIOMICS Microbial Community DNA Standard II, D6311; Supplementary Table 1) and a complex environmental sample (Soil) was used to evaluate the different sample barcoding setups with respect to extent of barcode crosstalk, and the effects of variability in amplicon preparation, library preparation, and sequencing on the quality and reproducibility of amplicon sequencing results (i.e., batch effects). The mock community was ordered as a DNA standard, while environmental DNA was extracted from a peatland soil using a phenol-chloroform based method after mechanical cell lysis (bead-beating) (Hausmann et al., 2016).
Target Gene (First-Step) PCR
In the first amplification step, the V4 or V3–V4 regions of the 16S rRNA genes were amplified with the 515F/806R or 341F/785R primers (Table 1), modified to contain either the same 16 nt head sequence (H1: 5′-GCT ATG CGC GAG CTG C-3′; Herbold et al., 2015) at the 5′ end of both the forward and reverse primer, or the H1 sequence at the 5′ end of forward primer, and a newly designed 16 nt head sequence (H2: 5′-TAG CGC ACA CCT GGT A-3′) at the 5′ end of the reverse primer (Figure 1). PCR reactions for each sample were set up as triplicate 25 μl reactions using the DreamTaq Green PCR Master Mix (ThermoFisher), with 0.25 μmol L–1 of forward and reverse primer each and 2 μl DNA template. Four PCR negative controls (PCR with nuclease free water as template) were routinely performed per 90 samples. Thermal cycling was performed as shown in Table 1. PCR products were subsequently analyzed by gel electrophoresis either on in-house cast 2% low electroendosmosis (LE)-agarose gels or using an E-Gel system (ThermoFisher) with precast 2% E-Gel 96 Gels with SYBR Safe DNA Gel Stain (ThermoFisher). Thereafter, triplicate first-step PCR reactions were pooled and purified and normalized with the SequalPrep Normalization Plate Kit (Invitrogen) following the manufacturers’ instructions. This clean-up and normalization approach removes DNA fragments shorter than 100 bp (i.e., primer dimers) and adjusts the amount of recovered DNA per reaction to a maximum of 25 ng prior to use as a template for barcoding (second-step) PCRs. The PCR product is combined with a kit-specific binding buffer and transferred into a SequelPrep normalization plate, the walls of which are coated with a DNA binding solid phase capable of retaining a maximum of 25 ng DNA. Unbound PCR products (including short fragments, which cannot bind to the solid phase) and buffer are removed, and the now normalized and cleaned PCR product is detached from the DNA-binding solid phase with a kit-specific elution buffer.
TABLE 1
16S rRNA gene-targeted oligonucleotide primers used in this study.
Primer name
Primer sequence (5′–3′)
Cycling conditions
References
V4 region of 16S rRNA gene
515F Parada
GTG YCA GCM GCC GCG GTA A
94°C for 3 min 30 cycles of:• 94°C for 45 s• 52°C for 60 s• 72°C for 90 s 72°C for 10 min
Parada et al., 2016
806R Apprill
GGA CTA CNV GGG TWT CTA AT
Apprill et al., 2015
V3–4 region of 16S rRNA gene
341F
CCT ACG GGN GGC WGC AG
94°C for 3 min 30 cycles of:• 95°C for 30 s• 55°C for 30 s• 72°C for 60 s 72°C for 10 min
Herlemann et al., 2011
785R
GAC TAC HVG GGT ATC TAA TCC
Herlemann et al., 2011
FIGURE 1
Overview of barcoding systems evaluated in this study. bc8 = 8 nt long barcodes (520 available). bc/bc12 = 12 nt long barcodes (548 available). Fwd/Rev, forward/reverse primer. SB, single barcoding. UDB, unique dual barcoding. CDB, combinatorial dual barcoding.
16S rRNA gene-targeted oligonucleotide primers used in this study.Overview of barcoding systems evaluated in this study. bc8 = 8 nt long barcodes (520 available). bc/bc12 = 12 nt long barcodes (548 available). Fwd/Rev, forward/reverse primer. SB, single barcoding. UDB, unique dual barcoding. CDB, combinatorial dual barcoding.
Barcoding (Second-Step) PCR
In the second amplification step, either a single barcoding primer, consisting of a 8 nt barcode (bc8) and H1 (5′-BC8_1-H1-3′; see also Herbold et al., 2015); a single barcoding primer, consisting of an 12 nt barcode (bc12) and H1 (5′-BC12_1-H1-3′); or two barcoding primers each consisting of a distinct 12 nt barcode and one of the two used 16 nt head sequences (5′-BC12_1-H1-3′ and 5′-BC12_2-H2-3′) were used for amplification (Figure 1). The 8 nt barcodes (n = 520; Supplementary Table 2) used were the same ones as used by Herbold and colleagues (Herbold et al., 2015, previously published in Hamady et al., 2008). The 12 nt barcodes (n = 549, Supplementary Table 3) were selected from a list of 959 12 nt barcodes used in the Earth Microbiome Project (Walters et al., 2016), based on following criteria: Hamming distance ≥ 4 from one another [as determined by the R-package DNABarcodes (Buschmann and Bystrykh, 2013)], and together having a uniform base distribution at each barcode base position. All oligonucleotide barcodes used for this study were synthesized and HPLC purified by and purchased from Microsynth (Balgach, Switzerland). Due to the nature of the barcoding PCR setup, 548 of these 549 12 nt barcodes are currently in use at the JMF (Supplementary Table 3).Barcoding PCR reactions were set up as a single 50 μl reaction using the DreamTaq PCR Master Mix (ThermoFisher), with 0.8 μmol L–1 of each used barcoding primer and 10 μL of purified and normalized first-step PCR product as template. The barcoding PCR was performed with following thermal cycle conditions: initial denaturation at 94°C for 4 min; 7 cycles of denaturation at 94°C for 30 s, annealing at 52°C for 30 s, and elongation 72°C for 60 s; followed by a final elongation step at 72°C for 7 min. Thereafter, samples were again purified and normalized with the SequalPrep Normalization Plate Kit as described above.
Sequencing Library Preparation and Sequencing
Barcoded, purified, and normalized PCR amplicons were multiplexed and concentrated with the innuPREP PCRpure Kit (Analytik Jena). In the here analyzed datasets, up to 546 distinct amplicon libraries, i.e., samples amplified with the same first-step primers, were multiplexed in a single amplicon pool (Table 2). Thereafter, the amplicon pool fragment size was verified on a D1000 ScreenTape Assay using a TapeStation 4200 (Agilent), and the DNA concentration in the amplicon pool was quantified with a Quant-iT dsDNA Assay (ThermoFisher) on a Qubit 4 Fluorometer (ThermoFisher). Sequencing libraries were prepared and indexed by adapter ligation and PCR (8 cycles) using the TruSeq Nano DNA Library Prep Kit (Illumina) and TruSeq DNA Single Indexes Set A or B (Illumina), excluding the DNA fragmentation and clean-up of fragmented DNA steps, and otherwise following the manufacturer’s instructions. Sequencing library size and the efficiency of unligated adaptor removal were verified on a D1000 ScreenTape Assay using a TapeStation 4200, and the sequencing library concentration was determined with a Quant-iT dsDNA Assay on a Qubit 4 Fluorometer. For each sequencing run, 2–5 amplicon pool libraries (in total up to 1,092 distinct amplicon libraries) were combined and sequenced in 2 × 300 cycle paired-end mode on an Illumina MiSeq using the MiSeq Reagent kit v3 (Illumina) with 6 nt library indexes (DNA Single Indexes Set A or B, Illumina). To achieve sufficient variability during the first five sequencing cycles, which is necessary for efficient sequence cluster identification and phasing/prephasing calibration during Illumina sequencing (Fadrosh et al., 2014), we spiked in 1–7 random shotgun genomic or metagenomic sequencing libraries (to an abundance of 9–21%) and 1% PhiX control to each sequencing run.
TABLE 2
Overview of all analyzed runs separated by amplicon pools. Yield is given in read pairs and corresponds to Figure 2.
Run
Pool
Barcoding
Primers
Libraries
Total read pairs
Barcode-sorted read pairs
Final read pairs
NC
Mock
Soil
Note
C64CL
1
SB
V3–4
184
1 922 856
973 290
865 536
CGPPL
2
CDB-H12
V4
478
17 073 818
10 267 424
7 641 530
20
5
CBPFC
3
UDB-H11
V3–4
264
5 583 678
2 825 700
2 118 627
10
3
4
UDB-H11
V3–4
264
5 352 498
2 683 767
2 263 560
3
5
SB
V3–4
64
1 841 019
1 062 035
970 704
CGTHM
6
CDB-H12
V3–4 (and V4)
293
6 650 379
3 389 426
581 522
12
2
Contains nine V4 amplicons. All controls are from the V3–4 primer set.
7
CDB-H12
V4
480
11 813 225
7 275 537
1 367 788
20
5
3
CBM45
8
CDB-H12
V4
200
5 527 705
2 724 523
1 952 019
9
2
2
9
CDB-H12
V4
288
11 337 538
5 656 339
4 042 682
12
3
2
10
SB
V3–4
25
239 984
142 451
93 834
11
SB
V3–4
25
346 018
184 313
131 040
CL24D
12
UDB-H11
V3–4
268
NA
NA
NA
10
3
56 non-standard amplicons were included in this pool, therefore yield data was calculated.
13
UDB-H11
V3–4
264
3 228 166
1 563 877
1 382 073
11
3
14
UDB-H11
V3–4
144
4 275 834
2 142 075
1 357 170
7
2
CBJCM
15
UDB-H11
V3–4
268
NA
NA
NA
10
3
56 non-standard amplicons were included in this pool, therefore yield data was calculated.
16
UDB-H11
V3–4
264
1 811 989
810 445
630 204
11
3
17
UDB-H11
V3–4
144
1 834 298
854 991
468 849
7
2
CL5WG
18
UDB-H12
V4
288
7 767 824
4 221 708
3 508 840
12
2
2
19
UDB-H11
V4
173
3 991 281
2 023 746
1 458 385
7
20
UDB-H12
V4
296
5 934 497
3 413 834
2 853 596
13
3
3
CL87D
21
UDB-H12
V4
181
6 785 009
3 720 004
2 792 137
8
2
2
CRRJM
22
UDB-H12
V3–4
268
5 531 591
3 345 618
2 499 843
11
3
23
UDB-H11
V4
235
3 933 425
1 901 242
1 531 626
9
3
CRRCC
24
UDB-H12
V4
476
6 042 832
3 200 840
2 484 142
20
5
25
UDB-H12
V4
441
NA
NA
NA
20
5
3
20 non-standard amplicons were included in this pool, therefore yield data was calculated.
CWL3Y
26
UDB-H12
V3–4
86
10 009 360
5 472 947
4 885 729
4
1
27
SB
V3–4
36
447 184
272 896
241 612
28
SB
V3–4
16
381 340
217 329
173 503
29
SB
V3–4
14
279 219
155 500
124 336
CRRK3
30
UDB-H12
V3–4
546
6 510 999
4 341 343
3 412 042
23
6
31
UDB-H12
V3–4
546
7 659 918
5 040 085
4 010 843
23
6
Overview of all analyzed runs separated by amplicon pools. Yield is given in read pairs and corresponds to Figure 2.
FIGURE 2
Comparisons of yield between barcoding systems. Yield describes the fraction of reads which matched to the expected barcode, linker, and primer by the total number of raw reads in each pool (A) before any quality filtering or denoising/clustering, and (B) after ASV inference and chimera filtering with DADA2. p-values > 0.1 not shown.
Sequence Processing and Analysis
Individual amplicon pools were extracted from the raw sequencing data using the FASTQ workflow in BaseSpace (Illumina) with default parameters, allowing one mismatch for the 6 nt library indexes, followed by filtering for PhiX contamination with BBDuk (BBTools, Bushnell B[1]). Further demultiplexing of each amplicon pool library into single amplicon libraries was performed with the python package demultiplex (Laros JFJ[2]) allowing one mismatch for the 12 nt barcodes, and two mismatches for head sequence (H1 and H2) and primers (Table 1). Barcodes, linkers, and primers were trimmed off using BBDuk. SB datasets were analyzed as described in 14.ASVs were inferred using the DADA2 R package version 1.14.1 Callahan et al. (2016a) with R version 3.6.1 (R Core Team, 2021) applying the recommended workflow (Callahan et al., 2016b). Therefore, 16S rRNA region V4/V3–4 amplicon FASTQ reads one and two were trimmed at 220/230 nt and 150/220 nt with allowed expected errors of 2/4 and 2/6, respectively. If not noted otherwise, ASV inference was performed in pooled mode using all amplicon libraries per sequencing run. Since error modeling in DADA2 can only be performed on sequencing data from the same run, results from separate sequencing runs were merged in R during post processing. All downstream analyses were performed in R version 4.0.2 (R Core Team, 2021).The theoretical distribution of 16S rRNA genes for the ZymoMock, as specified by the provider (ZymoBIOMICS), was used to simulate count data. To this end, provided relative abundances of each 16S rRNA gene per species were used as probabilities in the R function rmultinom with a size of 2,438 simulated read pairs (100 replicates). All genomes of the bacterial mock community members contain multiple rRNA operons. In cases where the 16S rRNA gene copies within a genome were not identical over the amplified regions, they were treated as separate simulated ASVs (Supplementary Table 1). Inferred ASVs from amplicon libraries generated form the ZymoMock were matched to mock community members using the 16S rRNA gene sequences provided by the manufacturer (ZymoBIOMICS) as database for blastn (Nucleotide-Nucleotide BLAST version 2.10.1+) allowing for one mismatch.Alpha diversity metrics, Bray–Curtis dissimilarity, and PCoAs were calculated using the R package ampvis2 version 2.6.5 (Andersen et al., 2018) with the R package vegan version 2.5–6 (Oksanen et al., 2020). All statistical comparisons between groups were performed using an ANOVA with post-hoc test TukeyHSD in R. Adjusted P-value significance codes used in all figures are ∗<0.05, ∗∗<0.01, ∗∗∗<0.001, and ****<0.0001.
Data Availability
Resulting datasets are deposited under the BioSample accession number SAMN16987472 (negative controls), SAMN13555796 (ZymoMock), and SAMN13555843 (environmental sample). Individual SRA accession numbers used in analyses are listed in Supplementary Table 4.
Results and Discussion
Combinatorial Dual and Unique Dual Barcoding Setups for Optimizing Read Assignment Accuracy and Amplicon Multiplexing Throughput
In this study, we extended upon a previously-published approach that utilizes single barcoding (SB) in a two-step PCR-based amplicon barcoding and sequencing workflow (Herbold et al., 2015). We evaluated two dual-barcoding strategies: unique dual barcoding (UDB), to minimize crosstalk between samples multiplexed in a single amplicon pool library, and combinatorial dual barcoding (CDB), to increase the number of amplicon libraries (i.e., samples) that can be multiplexed into a single amplicon pool library to hundreds of thousands.The SB setup relies on the use of a single 16 nt head sequence (H1) modification of both the forward and reverse primers, as well as barcoding of the generated amplicon with one unique 8 nt barcode (bc8)-H1 fusion primer (Figure 1). With the SB approach, 520 samples can be targeted with the same primer pair, and the resulting 520 amplicon libraries can be multiplexed into one amplicon pool library (520 × n × m samples, where n is the number of distinct target-gene primer pairs used and m is the number of individual amplicon pool libraries combined on a sequencing run). As previously outlined (Kircher et al., 2012; Esling et al., 2015), crosstalk risks associated with all SB protocols also apply to this approach.The UDB and CDB approaches we implemented rely on a different set of barcode sequences than the bc8 used in the SB setup. For dual barcoding, we used a subset of 548 unique 12 nt barcodes (bc12) that were selected from a list of 959 bc12 used in the Earth Microbiome Project (Walters et al., 2016), as further detailed in the methods. The advantage of bc12 over bc8 is the increased dissimilarity between individual barcodes, which increases the median Hamming distance from 6 to 9 and the minimum Hamming distance from 2 to 5 (Supplementary Figure 1), consequently decreasing the risk of crosstalk due to sequencing errors in the barcode (Wright and Vetsigian, 2016).The first UDB approach implemented and evaluated, hereafter referred to as UDB-H11, is based on both the forward and reverse primers being modified to include the H1 sequence (analogous to the SB setup), and barcoding being performed with two distinct, unique bc12-H1 fusion primers (Figure 1). This barcoding approach reduces the crosstalk risk associated with single barcoding (Kircher et al., 2012; Esling et al., 2015), but only allows for the multiplexing of 274 samples amplified with the same primer pair into a single amplicon pool library (274 × n × m samples, if n distinct target-gene primer pairs are used and m amplicon pool libraries were combined on a sequencing run).To increase the number of samples that can be amplified with the same primer pair and be multiplexed into a single amplicon pool library, while adhering to the unique dual barcoding principle with the stringent set of bc12 selected, we introduced a second 16 nt head sequence variant (H2). In this UBD setup, referred to as UDB-H12, the forward and reverse primers are modified to contain the H1 and H2 sequence, respectively. In the barcoding (second step) PCR, a unique bc12-H1 and a second, unique bc12-H2 fusion primer are used for sample barcoding (Figure 1). With the UDB-H12 setup, 548 samples amplified with a single primer pair can be multiplexed in one amplicon pool library (548 × n × m samples, if n distinct target-gene primer pairs are used and m amplicon pool libraries were combined on a sequencing run).Finally, we implemented and evaluated a CDB setup, for which the forward and reverse primers are modified to contain the H1 and H2 sequence - like in UDB-H12, and a unique combination of bc12-H1 and bc12-H2 fusion primers is used for barcoding. However, the same bc12-H1 fusion primer is used for barcoding multiple samples, always in combination with a distinct bc12-H2 fusion primer, resulting in not unique, but combinatorially-unique dual-barcoded amplicons (CDB-H12; Figure 1). This approach allows for multiplexing up to 299,756 samples amplified with the same primer pair into a single amplicon pool library (299,756 × n × m samples, if n distinct target-gene primer pairs are used and m amplicon pool libraries were combined on a sequencing run). For this evaluation, we used a subset of barcode combinations available for the CDB approach to multiplex a maximum of 480 samples in a single amplicon pool library.
The Effect of Barcoding Approach on Assignable Sequence Yield per Amplicon Pool Library
When comparing the yield per amplicon pool library, defined here as the fraction of reads that could be assigned to a sample via barcode matching, both CDB-H12 and UDB-H12 barcoding protocols performed comparably to previous results obtained with the SB approach (mean 54, 58, 56%, respectively), while UDB-H11 performed slightly worse (49%) (Figure 2A). The yield after ASV inference, i.e., after removal of low quality sequence reads, denoising and removal of chimeric sequences from each amplicon pool library, was comparable for the SB (46%, 38–54%), UDB-H11 (36%, 26–43%) and UDB-H12 (47%, 41–52%) setup, but significantly lower (Tukey multiple comparisons of means: p = 0.00441, 0.236, and 0.00178, respectively; −18.3, −9.1, and −19.6% difference in means, respectively; ANOVA: degrees of freedom = 3) and highly variable for the CDB-H12 setup (mean 27%, range 9–45%) (Figure 2B).Comparisons of yield between barcoding systems. Yield describes the fraction of reads which matched to the expected barcode, linker, and primer by the total number of raw reads in each pool (A) before any quality filtering or denoising/clustering, and (B) after ASV inference and chimera filtering with DADA2. p-values > 0.1 not shown.
Unique Dual Barcoding Minimizes Sample Crosstalk in Amplicon Pool Libraries
A noteworthy level of crosstalk between samples, defined as (i) the number of ASVs detectable in PCR negative control reactions, (ii) the number and fraction of spurious ASVs detectable in a commercially available mock community standard (ZymoMock), and (iii) the fraction of amplicon samples in which ASVs known to be specific to the mock community sample, was observed in the CDB-H12 setup, while close to no crosstalk was detectable in both UDB setups (Figure 3 and Supplementary Figure 2). Consequently, the species richness, expressed as Chao 1 species richness estimator, of amplicon libraries generated from the mock community standard (ZymoMock) barcoded with either of the two UDB setups (UD-H11 and UD-H12) matched well with the simulated species richness of the mock community standard (Figure 4A), while the species richness metric was inflated through the presence of spurious contaminants in samples barcoded with the CDB-H12 setup (Figure 4B). No significant difference in performance between the UDB-H11 and UDB-H12 setup was detectable.
FIGURE 3
Evaluation of between-library contamination (crosstalk) based on (A) the number of ASVs detected in systematic PCR negative controls; (B) the number and (C) the fraction of spurious ASVs detected in mock community (ZymoBIOMICS Microbial Community DNA Standard II, D6311) amplicon libraries generated in two-step PCR procedure applying UDB-H11, CDB-H12 and UDB-H12 setups. Mock community members were identified using BLAST allowing for one mismatch. All other ASVs were defined as spurious (contaminants). All ASVs detected in systematic PCR negative controls are defined as spurious (contaminants). p-values > 0.1 not shown. P-value significance codes are *< 0.05, **< 0.01, ***< 0.001, and ****< 0.0001.
FIGURE 4
Evaluation of between-library contamination based on the alpha diversity of mock community (ZymoBIOMICS Microbial Community DNA Standard II, D6311) amplicon libraries generated in two-step PCR procedure applying UDB-H11, CDB-H12, and UDB-H12 setups and simulated count data based on the theoretical composition of 16S rRNA genes. (A) Chao 1 species richness estimator, (B) Observed ASVS, (C) Expected vs measured abundance of detected mock community members. All ASVs with one or zero mismatches to any of the reference sequences 16S rRNA genes are included and added up into the same species. Pearson’s r was calculated with log-transformed relative abundances per amplicon pool and then averaged. Enterococcus faecalis (expected 0.00067%) and Staphylococcus aureus (0.0001%) ASVs were sporadically detected between 0.006–0.05% abundance and were excluded due to their high likelihood of being cross-contaminations from gut and skin samples, respectively.
Evaluation of between-library contamination (crosstalk) based on (A) the number of ASVs detected in systematic PCR negative controls; (B) the number and (C) the fraction of spurious ASVs detected in mock community (ZymoBIOMICS Microbial Community DNA Standard II, D6311) amplicon libraries generated in two-step PCR procedure applying UDB-H11, CDB-H12 and UDB-H12 setups. Mock community members were identified using BLAST allowing for one mismatch. All other ASVs were defined as spurious (contaminants). All ASVs detected in systematic PCR negative controls are defined as spurious (contaminants). p-values > 0.1 not shown. P-value significance codes are *< 0.05, **< 0.01, ***< 0.001, and ****< 0.0001.Evaluation of between-library contamination based on the alpha diversity of mock community (ZymoBIOMICS Microbial Community DNA Standard II, D6311) amplicon libraries generated in two-step PCR procedure applying UDB-H11, CDB-H12, and UDB-H12 setups and simulated count data based on the theoretical composition of 16S rRNA genes. (A) Chao 1 species richness estimator, (B) Observed ASVS, (C) Expected vs measured abundance of detected mock community members. All ASVs with one or zero mismatches to any of the reference sequences 16S rRNA genes are included and added up into the same species. Pearson’s r was calculated with log-transformed relative abundances per amplicon pool and then averaged. Enterococcus faecalis (expected 0.00067%) and Staphylococcus aureus (0.0001%) ASVs were sporadically detected between 0.006–0.05% abundance and were excluded due to their high likelihood of being cross-contaminations from gut and skin samples, respectively.To evaluate how well amplicon pool libraries produced with the different barcoding approaches reflect the relative abundances of the starting DNA pool, we correlated expected and measured relative abundances of the six most abundant mock community members (Figure 4C). This revealed similar patterns across all barcoding setups and primers, with little difference in correlation coefficient. Notably, P. aeruginosa was consistently overestimated in abundance, which may be due to primer bias (Schloss et al., 2011).As no significant difference in data quality with regards to crosstalk between samples was observed between the UDB-H11 and UDB-H12 setup and UDB-H12 allows for a higher level of sample multiplexing, only the UDB-H12 and CDB-H12 approaches were further evaluated.
Finally, we evaluated the occurrence of batch effects in the UDB-H12 and CDB-H12 two-step PCR workflows. To this end, we compared the observed microbial community composition retrieved from replicate samples of a repeatedly sequenced (n = 22) complex environmental sample, barcoded with the CDB-H12 or the UDB-H12 setups and sequenced (i) within the same amplicon sequencing run, or (ii) in separate amplicon sequencing runs. Moreover, we evaluated the effect of sequence data processing on the occurrence and significance of batch effects. To this end, DADA2 error modeling and ASV inference was performed (i) using sequence data from all amplicon libraries generated on a sequencing run (i.e., the environmental control sample, mock community standard, and real sample data from other analysis projects), and (ii) using only sequencing data from the environmental control sample amplicon libraries. Finally, we evaluated the effect of spurious ASV filtering on between sample dissimilarity by removing ASVs that never exceeded the relative abundance threshold of 0.25% in the dataset (Reitmeier et al., 2020) prior to calculation of pairwise Bray-Curtis dissimilarity between samples.Regardless of the applied barcoding setup (UDB-H12 or CDB-H12) or data processing pipeline, the pairwise Bray-Curtis dissimilarity between samples barcoded with the same protocol and sequenced within the same run was always significantly lower than the dissimilarity between samples barcoded with the same protocol but sequenced on separate sequencing runs (Figure 5, Tukey multiple comparisons of means: p = 7.52e–14, 1.54e–13, 6.95e–12, 6.89e–13, 7.78e–14, 1.36e–13, 4.83e–08, and 5.64e–09, respectively; difference in means: −0.096, −0.10, −0.070, −0.088, −0.075, −0.081, −0.057, and −0.072, respectively; ANOVA: degrees of freedom = 1). Furthermore, the dissimilarity between samples barcoded with the UDB-H12 setup was significantly lower than the dissimilarity of samples barcoded with the CDB-H12 setup (Figure 5). While the amplicon library input mode for DADA2 error modeling and ASV inference (all amplicon libraries from one run vs. only amplicon libraries of the target sample) did not have an effect on the observed differences between barcoding setups (Figure 5A/E, Tukey multiple comparisons of means: p = 2.78e–05, 2.16e–03, 6.84e–06, 6.02e–04, respectively; difference in means: −0.040, −0.046, −0.034, and −0.040, respectively; ANOVA: degrees of freedom = 1), the removal of spurious ASV by abundance filtering, as expected, strongly reduced the difference in observed sample dissimilarity between the two barcoding setups (Figure 5C/G, Tukey multiple comparisons of means: p = 4.76e–01, 6.12e–02, 9.98e–02, and 2.82e–02, respectively; difference in means: −0.011, −0.029, −0.018, and −0.033, respectively; ANOVA: degrees of freedom = 1). These results indicate that abundance filtering of spurious ASVs is a suitable approach to significantly reduce effects of crosstalk between samples in combinatorial dual (and presumably single) barcoding approaches.
FIGURE 5
Pairwise distances boxplots (A/C/E/G) and oridations (B/D/F/H) between technical replicates of the same soil DNA. Libraries from the same run are of the same color. Amplicons were generated using the V4 primer set. Analyses were performed using all libraries of one run (A–D) or only the control samples (E–H) as input to DADA2 error modeling and pooled ASV inference. Abundance filtering of all ASVs never exceeding 0.25% was performed for (C/D) and (G/H). Three runs per barcoding setup. Libraries were rarefied to the smallest read depth, i.e., (A/B), 2792 reads; (C/D), 2460 reads; (E/F), 2861 reads; (G/H), 2679 reads.
Pairwise distances boxplots (A/C/E/G) and oridations (B/D/F/H) between technical replicates of the same soil DNA. Libraries from the same run are of the same color. Amplicons were generated using the V4 primer set. Analyses were performed using all libraries of one run (A–D) or only the control samples (E–H) as input to DADA2 error modeling and pooled ASV inference. Abundance filtering of all ASVs never exceeding 0.25% was performed for (C/D) and (G/H). Three runs per barcoding setup. Libraries were rarefied to the smallest read depth, i.e., (A/B), 2792 reads; (C/D), 2460 reads; (E/F), 2861 reads; (G/H), 2679 reads.Notably, regardless of the barcoding protocol applied, batch effects associated either with amplicon PCR, sequencing library preparation, or sequencing were nonetheless still observed. While application of the UDB protocol as well as spurious ASV removal do reduce these effects, they do not completely eliminate them. Consequently, the occurrence and magnitude of batch effects must ultimately be taken into consideration and accounted for during amplicon pool library design, barcoding protocol selection, and data analysis.
Conclusion
We established and benchmarked dual barcoding protocols based on a two-step PCR amplicon sequencing workflow (Herbold et al., 2015), and evaluated and compared their performance with regards to accuracy and reproducibility. Combinatorial dual barcoding with 12 nt barcodes allows for multiplexing orders of magnitude higher numbers of amplicon libraries of the same gene target, as long as low levels of crosstalk between samples are of no concern for the research goal. CDB is a promising approach for large-scale amplicon sequencing projects allowing for the cost-effective and time-efficient generation of datasets minimally affected by sequencing-associated batch effects, as hundreds of thousands of samples can be sequenced in a single run. However, for mid-scale amplicon sequencing projects, UDB barcoding strategies are the better choice, as lower levels of sample crosstalk and within-run and between-run dissimilarity between the inferred microbial community composition of replicate samples were observed. Overall, the here described workflow results in an improved consistency and reproducibility between amplicon sequencing runs, via an optimized sample barcoding approach, a standardized amplicon pool design and a new, superior amplicon sequence processing pipeline. The here presented dual barcoding two-step PCR workflows are currently implemented as standard operating procedures (SOPs) for amplicon sequence generation and analysis at the Joint Microbiome Facility (JMF) of the University of Vienna and the Medical University of Vienna[3].
Data Availability Statement
Datasets used in this study are deposited in the NCBI SRA (https://www.ncbi.nlm.nih.gov/sra/) under the BioSample accession numbers: SAMN16987472 (negative controls), SAMN13555796 (ZymoMock), and SAMN13555843 (environmental sample).
Author Contributions
PP, BH, AL, and DB concieved the study. PP, JS, and GK performed experiments. PP, BH, and CH performed data analysis. PP and BH wrote the manuscript with support of all other authors. All authors contributed to the article and approved the submitted version.
Conflict of Interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Authors: Daniel Pr Herlemann; Matthias Labrenz; Klaus Jürgens; Stefan Bertilsson; Joanna J Waniek; Anders F Andersson Journal: ISME J Date: 2011-04-07 Impact factor: 10.302
Authors: Benjamin J Callahan; Paul J McMurdie; Michael J Rosen; Andrew W Han; Amy Jo A Johnson; Susan P Holmes Journal: Nat Methods Date: 2016-05-23 Impact factor: 28.547
Authors: Jack A Gilbert; Folker Meyer; Dion Antonopoulos; Pavan Balaji; C Titus Brown; Christopher T Brown; Narayan Desai; Jonathan A Eisen; Dirk Evers; Dawn Field; Wu Feng; Daniel Huson; Janet Jansson; Rob Knight; James Knight; Eugene Kolker; Kostas Konstantindis; Joel Kostka; Nikos Kyrpides; Rachel Mackelprang; Alice McHardy; Christopher Quince; Jeroen Raes; Alexander Sczyrba; Ashley Shade; Rick Stevens Journal: Stand Genomic Sci Date: 2010-12-25
Authors: Craig W Herbold; Claus Pelikan; Orest Kuzyk; Bela Hausmann; Roey Angel; David Berry; Alexander Loy Journal: Front Microbiol Date: 2015-07-16 Impact factor: 5.640
Authors: Nadine Bayer; Bela Hausman; Ram Vinay Pandey; Florian Deckert; Laura-Marie Gail; Johanna Strobl; Petra Pjevac; Christoph Krall; Luisa Unterluggauer; Anna Redl; Victoria Bachmayr; Lisa Kleissl; Marion Nehr; Rasmus Kirkegaard; Athanasios Makristathis; Martin L Watzenboeck; Robert Nica; Clement Staud; Lukas Hammerl; Philipp Wohlfarth; Rupert C Ecker; Sylvia Knapp; Werner Rabitsch; David Berry; Georg Stary Journal: Leukemia Date: 2022-10-12 Impact factor: 12.883
Authors: Corinna M Karl; Ana Vidakovic; Petra Pjevac; Bela Hausmann; Gerhard Schleining; Jakob P Ley; David Berry; Joachim Hans; Martin Wendelin; Jürgen König; Veronika Somoza; Barbara Lieder Journal: Front Nutr Date: 2022-05-25
Authors: K Amorim; N Loick-Wilde; B Yuen; J T Osvatic; J Wäge-Recchioni; B Hausmann; J M Petersen; J Fabian; D Wodarg; M L Zettler Journal: Sci Rep Date: 2022-06-13 Impact factor: 4.996
Authors: David Seki; Margareta Mayer; Bela Hausmann; Petra Pjevac; Vito Giordano; Katharina Goeral; Lukas Unterasinger; Katrin Klebermaß-Schrehof; Kim De Paepe; Tom Van de Wiele; Andreas Spittler; Gregor Kasprian; Benedikt Warth; Angelika Berger; David Berry; Lukas Wisgrill Journal: Cell Host Microbe Date: 2021-09-03 Impact factor: 21.023
Authors: Fábio Rangel; Paula Enes; Laura Gasco; Francesco Gai; Bela Hausmann; David Berry; Aires Oliva-Teles; Claudia R Serra; Fátima C Pereira Journal: Front Microbiol Date: 2022-04-12 Impact factor: 6.064