| Literature DB >> 30418619 |
Simon Haile1, Richard D Corbett1, Steve Bilobram1, Morgan H Bye1, Heather Kirk1, Pawan Pandoh1, Eva Trinh1, Tina MacLeod1, Helen McDonald1, Miruna Bala1, Diane Miller1, Karen Novik1, Robin J Coope1, Richard A Moore1, Yongjun Zhao1, Andrew J Mungall1, Yussanne Ma1, Rob A Holt1, Steven J Jones1, Marco A Marra1,2.
Abstract
Tissues used in pathology laboratories are typically stored in the form of formalin-fixed, paraffin-embedded (FFPE) samples. One important consideration in repurposing FFPE material for next generation sequencing (NGS) analysis is the sequencing artifacts that can arise from the significant damage to nucleic acids due to treatment with formalin, storage at room temperature and extraction. One such class of artifacts consists of chimeric reads that appear to be derived from non-contiguous portions of the genome. Here, we show that a major proportion of such chimeric reads align to both the 'Watson' and 'Crick' strands of the reference genome. We refer to these as strand-split artifact reads (SSARs). This study provides a conceptual framework for the mechanistic basis of the genesis of SSARs and other chimeric artifacts along with supporting experimental evidence, which have led to approaches to reduce the levels of such artifacts. We demonstrate that one of these approaches, involving S1 nuclease-mediated removal of single-stranded fragments and overhangs, also reduces sequence bias, base error rates, and false positive detection of copy number and single nucleotide variants. Finally, we describe an analytical approach for quantifying SSARs from NGS data.Entities:
Mesh:
Substances:
Year: 2019 PMID: 30418619 PMCID: PMC6344851 DOI: 10.1093/nar/gky1142
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Differences in library quality between matched FFPE and FF tissue samples and possible underlying mechanism. (A) Comparison of the frequency of properly paired (PP) reads between matched FFPE (n = 38) and FF (n = 38) tissue samples. (B) Diagrammatic depiction of Strand-split artifact reads (SSARs). Typical PP reads (I) and an example of PP reads with SSAR in Read 2 (II) are shown. For III, Read 1 is contiguously aligned to the reference genome while Read 2 is split into two parts. Part ‘a’ of Read 2 aligns in the expected paired-end orientation while the distal end of Read 2 (part ‘b’) does not match the reference sequence at that position. Part ‘b’ of Read 2 does match the reference genome near Part ‘a’, but aligns in the opposite orientation (b’). Sequence homologies in the reference genome between the regions are marked with red asterisks ‘*’. (C) Comparison of SSAR levels between matched FFPE (n = 38) and FF (n = 38) tissue samples.
Figure 2.SSAR mapping and diagrammatic depiction of the proposed mechanism. The SSAR example shown is a screen shot of an actual IGV image. At the top (I) we depict a ds-DNA region of intact gDNA. In the process of FFPE preparation, storage and extraction (II), gDNA is fragmented and denatured. In the absence of S1 nuclease (III left), ss-DNA fragments from non-contiguous regions of the genome anneal via short complementary repetitive sequences (red asterisks). In contrast, ss-DNA fragments and overhangs are removed upon treatment with S1 nuclease (III right). During the end-repair step of library construction, T4 DNA polymerase removes overhangs (IV) and fills ends (V), resulting in the formation of double-stranded chimeric fragments (‘A’ in V). One class of such chimeric fragments yield SSARs (‘A’ in VI). R1 = read; R2 = read 2. For SSARs, part of Read 2 aligns in the expected paired-end orientation while the distal end of Read 2 does not match the reference at that position and instead aligns to a nearby region of the reference genome in the opposite orientation (denoted as R2′).
Figure 3.Measures to improve FFPE library quality and their effects on the levels of properly paired (PP) reads, chimeric reads and base error rates. (A) Effects of reverse cross-link timing. A time course experiment was performed using mouse liver FFPE tissue as the starting material. 100 ng of FormaPure extracted total nucleic acid (TNA) was used. N = 3 (technical replicates). Error bars = Standard deviations. P<0.05 (relative to 2 h). (B) Effects of S1 nuclease treatment. 100 and/or 300 ng TNA extracted using the FormaPure protocol was used with (F+S1) or without (F-S1) S1 nuclease treatment. 100 ng DNA extracted using the Qiagen/HiPure protocol was used with (A–H+S1) or without (A–H–S1) S1 nuclease treatment. N = 5 (FFPE samples from five patients). Of note, these samples were not patient-matched between the extraction protocols (A–H and F). Error bars = Standard deviations. *P< 0.05 (relative to A–H-S1); #P< 0.05 (relative to F-S1). (C) Comparisons of ligation-based and tagmentation-based library construction protocols. 20 ng TNA from FormaPure extracted TNA was used for library construction using the ligation-based protocol (F+Lig) or the tagmentation-based protocol (F+Tag). N = 3 (FFPE samples from 3 patients). Error bars = Standard deviations. *P< 0.05 (relative to F+Lig).
Figure 4.Effects of S1 nuclease treatment on genome coverage and sequence bias. (A) Genome coverage. A screen shot of an IGV image of a representative chromosomal region is shown for libraries that were prepared from fresh normal and tumor samples, and matching FormaPure FFPE samples with (F+S1) or without (F-S1) S1 nuclease treatment. The lower panel is an enlarged portion of the region shown in the upper panel. Colored vertical lines within coverage histograms designate consensus SNVs that are also shown as colored vertical lines within reads. Colored arrow boxes within reads represent SSARs and improperly paired artifacts (Red = insert size too large relative to consensus insert size range; Blue = insert too small; Green = SSARs. Other colors depict paired reads that aligned to regions from different chromosomes). (B) Effects of S1 nuclease treatment on GC-bias. Samples are the same as in (A). Upper panel shows normalized coverage data at various levels of GC-content and lower panel shows read distribution as a function of GC-content.
Figure 5.Effects of S1 nuclease treatment on FFPE-associated somatic SNV noise. Libraries were prepared from fresh normal and tumor samples from the same patient, and matching FormaPure FFPE samples with (F+S1) and or without (F-S1) S1 nuclease treatment. For each of the three latter libraries, SNVs were identified relative to the library from the normal blood sample. Upset plots indicating data overlaps are shown. In (A) are data obtained using a QSS score cutoff ≥15 as a cut-off. In (B) are data generated using a QSS score cutoff ≥35.
Figure 6.Effects of S1 nuclease treatment on FFPE-associated CNV noise. (A) Example illustrating CNV noise. Samples are the same as in Figure 5. Using a bin size of 200 reads, CNV segments were calculated in the tumor samples relative to the normal blood sample and the resulting profile is shown for chromosome 14. (B) CNV counts at the gene level. Gains are shown on the left and losses are shown in the middle panel. Jaccard's intersection index (Materials and Methods) is shown in the right panel as a measure of overlap of gene-level CNVs between the three samples.