| Literature DB >> 29739332 |
Maura Costello1, Mark Fleharty2, Justin Abreu2, Yossi Farjoun2, Steven Ferriera2, Laurie Holmes2, Brian Granger2, Lisa Green2, Tom Howd2, Tamara Mason2, Gina Vicente2, Michael Dasilva2, Wendy Brodeur2, Timothy DeSmet2, Sheila Dodge2, Niall J Lennon2, Stacey Gabriel2.
Abstract
BACKGROUND: Here we present an in-depth characterization of the mechanism of sequencer-induced sample contamination due to the phenomenon of index swapping that impacts Illumina sequencers employing patterned flow cells with Exclusion Amplification (ExAmp) chemistry (HiSeqX, HiSeq4000, and NovaSeq). We also present a remediation method that minimizes the impact of such swaps.Entities:
Keywords: Barcodes; Exclusion amplification; ILLUMINA sequencing; Index; Index hopping; Index swapping; Indexes; Massively parallel sequencing; Multiplexing; Next generation sequencing
Mesh:
Substances:
Year: 2018 PMID: 29739332 PMCID: PMC5941783 DOI: 10.1186/s12864-018-4703-0
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Percent contamination over time for whole genomes sequenced on HiSeqX. Panel at left: Single indexed PCR-free library contamination by month. Contamination significantly increased when we began 8-plex pooling and worsened as we introduced 24-plex pooling. Panel at right: Single indexed PCR-plus library contamination by month. Although overall contamination was lower for PCR-plus, rates did increase significantly as well when we began pooling
Fig. 2Contamination for single versus dual indexed pooled PCR-free libraries on HiSeqX. Percent contamination month by month continuous run chart as measured by VerifyBamID [3] for 24-plexed PCR-free genomes, demonstrating the drop in mean contamination after implementation of unique dual indexing. Red reference line is the mean, green reference lines are upper and lower control limits of the data generated by JMP statistical software
Mean Index swapping rates by library prep method and machine type
| Library type | Method description | Multiplex PCR prior to sequencing? | Mean library yield | Index swap rate (%)a | ||
|---|---|---|---|---|---|---|
| MiSeq | HiSeqX or 4000 | NovaSeqb | ||||
| PCR-free genomes | DNA shearing + adapter ligation | - | 2.8 nM | 0.13 ± 0.08 | 3.01 ± 0.91 | 4.85 ± 0.88 |
| PCR-plus genomes | DNA shearing + adapter ligation | - | 141.1 nM | 0.03 ± 0.04 | 0.24 ± 0.06 |
|
| Somatic exome | DNA shearing + adapter ligation | + | 354.2 nM | 0.67 ± 0.08 | 0.83 ± 0.23 | 0.52% |
| Germline exome | Nextera transposase | + | 286.2 nM | 0.49 ± 0.12 | 0.68 ± 0.19 |
|
| Stranded mRNA | cDNA prep + adapter ligation | - | 39.8 nM | 0.01 ± 0.001 | 0.32 ± 0.02 |
|
aAll swap rate values for each library type are means of 8 different pool & flowcell observations
bAs of submission, NovaSeq data had only been generated for PCR-free genomes and one flow cell of exomes
Fig. 3Index swapping leads to incorrect assignment of reads from fusion transcripts in cell line RNA-seq data. Counts of reads spanning fusion transcripts for 5 different gene fusions in 3 different cell lines using STAR-Fusion software. Four RNA-seq libraries were pooled for each cell line for a total of 12 libraries, and sequenced on a HiSeq 4000 lane. Only the K562 cell line should contain the BCR—ABL1 translocation, however reads containing BCR—ABL1 (blue and black striped) were also found in data files for the other two cell lines due to index swapping
Fig. 4Variability of index swap rates from pool to pool and flow cell to flow cell. Index swapping rates plotted for seven 24-plex pools, each sequenced on at least two HiSeqX flow cells and prepared using identical automated methods on a Hamilton MiniStar. Each data point represents a flow cell lane. The data shows variability between different pools, but also variability for the same pools run on different flow cells, indicating that flow cell and/or ExAmp reagents also influence swap rate variability
Fig. 5Characterization of index swapping mechanism. a Diagram of a HiSeqX flow cell lane colored by number of index swaps detected at each surface tile, showing relatively uniform distribution of swapping across the entire lane and both surfaces. b Read counts for all 36 index combinations in a 6-plex pool of uniquely dual indexed libraries. The combinations in heavy bordered cells with blue text along the diagonal are the correct index combinations; read counts for all other combinations are due to index swapping. Note all indexes participate in swapping relatively equally. c Mean insert size (bp) and percent chimerism calculated by Picard for both swapped and non-swapped reads. Swapped reads have shorter inserts and higher rates of chimeric read pairs. d Normalized human coverage across GC content bins, indicating there are less high GC reads in the swapped population (blue) compared to non-swapped (red) and all other non-demultiplexed (green) populations
Swap probability calculations for Human & E. coli library mixture experiment
| Read count | |
|---|---|
| Total PF indexed readsa | 842,853,260 |
| Total Non-swapped reads | 807,029,454 |
| Total swapped reads | 34,219,842 |
| p(Total Swap) = (Total swapped reads) / (Total PF indexed reads) = 0.0406 | |
| Undetermined i7 or i5 swaps | 17,136,498 |
| Known i7 swaps | 5,300,327 |
| Known i5 swaps | 12,697,618 |
| Known double i7 and i5 swaps | 689,363 |
| | 110,36,324 |
| | 25,476,845 |
| p(i7 Swap) = (Estimated total i7) / (Non-swap + Undet. + Known i5 + Known i7 + Double) = 0.0131 | |
| p(i5 Swap) = (Estimated total i5) / (Non-swap + Undet. + Known i5 + Known i7 + Double) = 0.0302 | |
aAll passing filter reads with high quality index reads matching any index used in within pool