Suet-Feung Chin1, Angela Santonja2, Marta Grzelak3, Soomin Ahn4, Stephen-John Sammut5, Harry Clifford3, Oscar M Rueda6, Michelle Pugh7, Mae A Goldgraben8, Helen A Bardwell3, Eun Yoon Cho9, Elena Provenzano10, Federico Rojo11, Emilio Alba12, Carlos Caldas13. 1. Cancer Research UK Cambridge Institute, Li Ka Shing Centre, University of Cambridge, Robinson Way, Cambridge CB2 0RE, UK; Department of Oncology, University of Cambridge, Cambridge CB2 2QQ, UK; Cancer Research UK Cambridge Cancer Centre, Cambridge CB2 0QQ, UK. Electronic address: suet-feung.chin@cruk.cam.ac.uk. 2. Medical Oncology Service, Hospital Universitario Regional y Virgen de la Victoria, Instituto de Investigación Biomédica de Málaga (IBIMA), Málaga, Spain; Laboratorio de Biología Molecular del Cáncer, Centro de Investigaciones Médico-Sanitarias (CIMES), Universidad de Málaga, Málaga, Spain. 3. Cancer Research UK Cambridge Institute, Li Ka Shing Centre, University of Cambridge, Robinson Way, Cambridge CB2 0RE, UK. 4. Department of Pathology, Seoul National University Bundang Hospital, 82, Gumi-ro 173 Beon-gil, Bundang-gu, Seongnam, Gyeonggi 13620, Republic of Korea; Inivata, Li Ka Shing Centre, Robinson Way, Cambridge CB2 0RE, UK. 5. Cancer Research UK Cambridge Institute, Li Ka Shing Centre, University of Cambridge, Robinson Way, Cambridge CB2 0RE, UK; Department of Oncology, University of Cambridge, Cambridge CB2 2QQ, UK; Cancer Research UK Cambridge Cancer Centre, Cambridge CB2 0QQ, UK. 6. Cancer Research UK Cambridge Institute, Li Ka Shing Centre, University of Cambridge, Robinson Way, Cambridge CB2 0RE, UK; Cancer Research UK Cambridge Cancer Centre, Cambridge CB2 0QQ, UK. 7. Inivata UK, The Portway Building, Granta Park, Cambridge CB21 6GS, UK. 8. Department of Medical Genetics, University of Cambridge, Cambridge CB2 0QQ, UK. 9. Department of Pathology and Translational Genomics, Samsung Medical Center, Sungkyunkwan University School of Medicine, 50 Irwon-dong, Gangnam-gu, Seoul 135-710, Republic of Korea. 10. Cambridge Breast Unit, Addenbrooke's Hospital, Cambridge University Hospital NHS Foundation Trust, NIHR Cambridge Biomedical Research Centre, Cambridge CB2 2QQ, UK; Cancer Research UK Cambridge Cancer Centre, Cambridge CB2 0QQ, UK. 11. Pathology Department, Instituto de Investigación Sanitaria Fundación Jiménez Díaz (IIS-FJD), Madrid, Spain; GEICAM-Spanish Breast Cancer Research Group, Madrid, Spain. 12. Medical Oncology Service, Hospital Universitario Regional y Virgen de la Victoria, Instituto de Investigación Biomédica de Málaga (IBIMA), Málaga, Spain; GEICAM-Spanish Breast Cancer Research Group, Madrid, Spain; Laboratorio de Biología Molecular del Cáncer, Centro de Investigaciones Médico-Sanitarias (CIMES), Universidad de Málaga, Málaga, Spain. 13. Cancer Research UK Cambridge Institute, Li Ka Shing Centre, University of Cambridge, Robinson Way, Cambridge CB2 0RE, UK; Department of Oncology, University of Cambridge, Cambridge CB2 2QQ, UK; Cambridge Breast Unit, Addenbrooke's Hospital, Cambridge University Hospital NHS Foundation Trust, NIHR Cambridge Biomedical Research Centre, Cambridge CB2 2QQ, UK; Cancer Research UK Cambridge Cancer Centre, Cambridge CB2 0QQ, UK. Electronic address: carlos.caldas@cruk.cam.ac.uk.
Abstract
Pathology archives with linked clinical data are an invaluable resource for translational research, with the limitation that most cancer samples are formalin-fixed paraffin-embedded (FFPE) tissues. Therefore, FFPE tissues are an important resource for genomic profiling studies but are under-utilised due to the low amount and quality of extracted nucleic acids. We profiled the copy number landscape of 356 breast cancer patients using DNA extracted FFPE tissues by shallow whole genome sequencing. We generated a total of 491 sequencing libraries from 2 kits and obtained data from 98.4% of libraries with 86.4% being of good quality. We generated libraries from as low as 3.8 ng of input DNA and found that the success was independent of input DNA amount and quality, processing site and age of the fixed tissues. Since copy number alterations (CNA) play a major role in breast cancer, it is imperative that we are able to use FFPE archives and we have shown in this study that sWGS is a robust method to do such profiling.
Pathology archives with linked clinical data are an invaluable resource for translational research, with the limitation that most cancer samples are formalin-fixed paraffin-embedded (FFPE) tissues. Therefore, FFPE tissues are an important resource for genomic profiling studies but are under-utilised due to the low amount and quality of extracted nucleic acids. We profiled the copy number landscape of 356 breast cancerpatients using DNA extracted FFPE tissues by shallow whole genome sequencing. We generated a total of 491 sequencing libraries from 2 kits and obtained data from 98.4% of libraries with 86.4% being of good quality. We generated libraries from as low as 3.8 ng of input DNA and found that the success was independent of input DNA amount and quality, processing site and age of the fixed tissues. Since copy number alterations (CNA) play a major role in breast cancer, it is imperative that we are able to use FFPE archives and we have shown in this study that sWGS is a robust method to do such profiling.
Comparative Genomic Hybridisation (CGH) (Kallioniemi et al., 1992) has had a significant impact in the study of cancer genomes. Chromosomal regions gained or lost in the tumor could be easily visualised by hybridization onto normal human metaphase spreads, allowing characterisation of genome-wide copy number alterations (CNA) in tumours (Kallioniemi et al., 1992). Microarrays with DNA probes (cloned DNA or oligonucleotides) spotted onto glass slides representing the entire genome soon replaced normal chromosomes (Pinkel et al., 1998) making it faster and easier to profile. The importance of characterizing somatic CNAs in cancer is now well established, with a recent TCGA pan-cancer analysis showing that humantumours can be classified into mutation driven (M-class) or copy-number driven (C-class) subtypes. Breast cancer is a C-class cancer type (Ciriello et al., 2013) and we have previously shown that CNAs are the main determinants of the expression architecture of breast cancers. Using gene expression driven in cis by CNAs, we have generated a new molecular taxonomy of breast cancer with 10 genomic driver-based subtypes termed Integrative Clusters. The samples used in this analysis were derived from the METABRIC cohort, which encompassed a large biobank of fresh frozen tumor samples collected across five major teaching hospitals in the UK and Canada (Curtis et al., 2012).Formalin-fixed paraffin-embedded (FFPE) tissue samples are more routinely collected and hence more representative of cancer in the general population. These FFPE archives are a valuable resource for molecular profiling in cancer research. Whilst the fixation process is essential to protect cellular morphology and protein expression, it is detrimental to nucleic acids and results in their chemical modification and degradation. As a result, extraction of DNA from FFPE tissues results in lower yields when compared to extraction from fresh frozen tissues. DNA extracted from FFPE works well for downstream applications using polymerase chain reaction (PCR), particularly for small size amplicons (<300 base pairs), but for other applications, including microarray based CGH, where efficient labelling of the DNA is dependent on its integrity, its use is more challenging. There have been several studies describing different methods for DNA extraction (Janecka et al., 2015), quality control (van Beers et al., 2006; Dang et al., 2016), labelling (Salawu et al., 2012) and other optimisation protocols (Hosein et al., 2013) to improve the performance of FFPE DNA on microarrays. In the past, we have tried to profile CNAs using FFPE DNA on microarrays with limited success. Only Illumina Infinium and Molecular Inversion Probe (MIP, Affymetrix) arrays yielded good results but these required good quality and at least 200 ng of DNA (Iddawela et al., 2017).Next generation sequencing has revolutionised cancer genomics. It is now relatively easy and inexpensive to sequence an entire genome. However, as with microarrays, the robustness of the results obtained are dependent on the quality of the input DNA. Two recent studies have demonstrated the feasibility of doing shallow whole-genome sequencing (sWGS) for CNA profiling using DNA extracted from FFPE tissue material (Scheinin et al., 2014; Kader et al., 2016). The first report used 250 ng of DNA from FFPE tissues and a breast cancer cell line to produce libraries and developed an analytical method for sWGS. The second study compared several sequencing library production kits and reported generating successful sequencing libraries with low input DNA in a small number of FFPE samples.Here we present extensive sWGS data generated from DNA extracted from FFPE breast cancer samples to describe steps to ensure successful libraries.
Materials
Specimen collection
FFPE tissue samples from invasive breast cancerpatients diagnosed between 1997 and 2014 were obtained from several tumor repositories: Addenbrooke's Hospital in Cambridge (n = 62), a consortium of hospitals participating in clinical trials (GEICAM) in Spain (n = 172), and Samsung Medical Center in South Korea (n = 122). In some cases, we extracted DNA from adjacent normal (n = 15) and DCIS (n = 115) samples. Some of the clinical trials samples were biopsies taken at diagnosis (n = 107) and/or surgery (n = 106) where 41 are paired. All tumor samples were collected with informed patient consent and their use for genomics profiling had ethics approval from the institutional review board for each of the biobanks (Cambridge: REC ref. 07/H0308/161; South Korea: 2014-10-041; Spain: NCT00432172 & NCT00841828). Detailed information on the sample cohort is collated in Table 1.
Table 1
Features of input DNA and libraries generated from FFPE blocks collected at three different sites. Data provided in minimum-maximum range and median in brackets.
Cohort
Cambridge
Korea
Spain
Patients
62
122
172
Age (years)
2–19 (12.4)
3–8 (6.4)
9–10(9)
DNA quality (ΔCt)
−0.3–8.1 (3.5)
1–17.1 ⁎ (5.7)
−3.2-7.8 (3.5)
Fragment size (bp)
217–324 (247.5)
197–288⁎ (236)
180–251⁎ (225)
Library yield (nM)
0–77 (17.3)
0–925 (12.3)
0.18–278 (13.8)
Good quality libraries (%)
48 (87.3)
48 (97.1)
163 (74.8)
Age = years since blocks were generated. ΔCt = difference between the cycle threshold of test to the control template ACD1 provided in the kit. ng = nanogram, PCR = polymerase chain reaction, bp = base pairs, nM = nanomoles.
Denotes the site where there is a significant difference to the index group (ie Cambridge).
Features of input DNA and libraries generated from FFPE blocks collected at three different sites. Data provided in minimum-maximum range and median in brackets.Age = years since blocks were generated. ΔCt = difference between the cycle threshold of test to the control template ACD1 provided in the kit. ng = nanogram, PCR = polymerase chain reaction, bp = base pairs, nM = nanomoles.Denotes the site where there is a significant difference to the index group (ie Cambridge).
DNA extraction and quality control
DNA was extracted from either one mm cores punched from tissue blocks or from 10 × 30 μm sections (Cambridge and Korea), or 4-6 × 10 μm sections (Spain) from FFPE blocks, using Qiagen QIAmp DNeasy Kits (Qiagen, Germany) according to the manufacturer's instructions. All DNA samples were quantified fluorometrically using the Qubit dsDNA High Sensitivity Quantification Reagent (ThermoFisher, USA). The DNA quality was assessed using Illumina's FFPE QC kit, a quantitative PCR (q-PCR) assay. All test DNAs and the template control provided in the kit (ACD1) were diluted to 0.25 ng/μl and PCR reactions set up in triplicate as per manufacturer's instructions. DNA quality was quantified as the difference between the Ct (cycle threshold) value of the test FFPE-extracted DNA against the Ct value of the control DNA template.
DNA fragmentation
DNA samples of different concentrations (4-500 ng) were diluted in water to a final volume of 15 μl in Covaris microTUBE-15 8 strip tubes (Covaris, USA) and fragmented to an average size distribution of 150-180 bp with Covaris LE220 Focused Ultrasonicator with Adaptive Focused Acoustics technology. The following parameters were used for shearing: Peak Incident Power: 180 W; Duty Factor: 30%; Cycles per Burst: 50; with the fragmentation time: 250 s for DNA with ΔCt <10, and 200 s for DNA with ΔCt ≥10.
Sequencing library generation
Sequencing libraries were generated using either the beta testing version of the Illumina FFPE TruSEQ kit (ILMN, libraries = 45) or the Rubicon Genomics Thruplex DNASeq (RGT, libraries = 446), as per manufacturer's instructions. For four samples, we generated sequencing libraries using both kits to compare their performance (Supplementary Fig. 1a–b). The sample metrics for both kits are presented in Supplementary Table 1.The ILMN libraries were generated manually whilst RGT libraries were generated either on the Agilent Bravo (n = 228) or manually (Spain, n = 218). Final libraries were purified using magnetic beads (Agencourt SPRI beads, Becton Dickinson, USA) and eluted libraries were quantified using Kapa Library Quantification kit (Roche Life Technologies, USA). Fragment size distributions were analysed utilising a 2100 Bioanalyzer with a DNA High Sensitivity kit (Agilent Technologies, USA). Two nanomoles (nM) of each library were prepared and 48 samples were pooled in one lane for sequencing on a HiSeq4000 (Illumina, USA). The pools were re-quantified and normalised to 10 nM. Single end sequencing was conducted for 50 cycles, generating on average 4.3 × 108 reads per lane.
Bioinformatics
Alignment against the GRCh 37 assembly of the human genome was performed using BWA ver. 0.7.9 (Li and Durbin, 2009) or NovoAlign ver. 3.2.13 (NovoCraft, Malaysia). PCR and optical duplicates were identified using Picard tools (https://broadinstitute.github.io/picard) or Novosort (NovoCraft, Malaysia). Circular binary segmentation on the aligned files was performed in 100 kb windows using the QDNAseq R package available on Bioconductor, which corrects for mappability and GC content (Scheinin et al., 2014). All statistical analyses were performed in R using the functions lm() for fitting linear models and t-test() for Welch two-sample t-test.
Results
The majority of the FFPE samples available were core biopsies collected as part of a neoadjuvant clinical trial (GEICAM/2006-03, n = 107) yielding low amounts of DNA (range = 4–61 ng, median 30 ng). Therefore, to successfully generate libraries for CNA profiling using limited input DNA, we needed to understand how different variables could influence the quality of libraries and steps that can be taken to ensure good sequencing results (Fig. 1).
Fig. 1
Overall Design: Schematic showing the workflow to ensure successful shallow whole genome sequencing (sWGS) libraries.
Overall Design: Schematic showing the workflow to ensure successful shallow whole genome sequencing (sWGS) libraries.
Assessment of the copy number plots
We examined the copy number plots by manual inspection and categorised them based on the variance in the CN data for each case into categories: “Very Good”, “Good”, “Intermediate” and “Poor” (Fig. 2a). We also used QDNAseq (Scheinin et al., 2014) which calculates the expected (estimated from read depth) and measured (using read depth and influenced by DNA quality) standard deviation of the summarised reads, as a measure of variance. Both measures increased as the quality of library decreased and validated our categorisation of library quality (measured standard deviation shown in Fig. 2b).
Fig. 2
Categorisation of copy number profiles.
A. Examples of QDNASEQ copy number plots scored as Very Good, Good, Intermediate and Poor. Failed libraries had very few reads and are not shown. Green dots represent regions of gains/amplifications and red dots represent regions of loss/deletion. B. Boxplots showing increasing measured standard deviations with decreasing libraries'qualities. Dots represent individual samples within each category.
VG = very good, G = good, I = intermediate, P = poor, F = fail.
Categorisation of copy number profiles.A. Examples of QDNASEQ copy number plots scored as Very Good, Good, Intermediate and Poor. Failed libraries had very few reads and are not shown. Green dots represent regions of gains/amplifications and red dots represent regions of loss/deletion. B. Boxplots showing increasing measured standard deviations with decreasing libraries'qualities. Dots represent individual samples within each category.VG = very good, G = good, I = intermediate, P = poor, F = fail.
Assessment of different sequencing kits
We tested two kits (Illumina FFPE TruSEQ kit and Rubicon Genomics Thruplex DNASeq) using four FFPE samples to generate sequencing libraries and found comparable results (Supplementary Fig. 1a–b). The CNA profiles obtained using DNA processed with the ILMN kit had less variance (noise) than those processed using the RGT kit however the ILMN libraries were generated using more input DNA (200-500 ng (ILMN) versus 50 ng (RGT)) and were sequenced deeper (average coverage 0.9× (ILMN) versus 0.08× (RGT)). For a more comparable evaluation, we down-sampled ILMN sequencing data to a similar read depth as RGT; this showed comparable copy number profile qualities between the two library preparation technologies.In theory, increasing the sequencing depth should improve the copy number results by reducing the variance. We examined this by increasing the sequencing depth of 23 RGT kit libraries which had less reads (from 0.08× up to 0.15×) and found improvement in the data quality in 20 out of 23 libraries (examples shown in Supplementary Fig. 2a). To examine the association between sequencing depth and variance, we down-sampled the number of reads (in steps of 1 × 106 reads) for six libraries with high read counts (up to 24 × 106 reads). We found a significant improvement in the quality of copy number plots with increasing number of reads (p < 2.2e-16; Supplementary Fig. 2b). It is interesting to note that the noise reduction levels off at approximately 7 × 106 reads suggesting that increasing the read depth >7 × 106 reads provides little benefit to variance reduction.
Performance of sWGS for copy number profiling using the RGT kit
Due to the limited amount of DNA available for most samples, we chose the RGT kit as it required less input DNA due to fewer processing steps, in particular purifications. Sequencing libraries were generated from as little as 3.8 ng of DNA, and out of 16 libraries prepared from <10 ng of DNA, only one failed, 13 generated good quality CNA plots, and 2 generated intermediate quality CNA plots. Information for all the libraries generated are summarised in Supplementary Table 3.
Recovery of under-performing RGT libraries
Eight (1.8%) libraries failed and 12 (2.7%) generated poor quality libraries out of 446 libraries. To recover some of these failed/poor samples, we prepared fresh libraries from samples with sufficient DNA (n = 6) or repeated the sequencing using three-fold more library material for samples with insufficient DNA to generate new libraries (n = 8). Thirteen of these new/re-sequenced libraries generated good quality data. The one repeat sample that failed was from the re-sequencing group. Consequently, only two out of 446 RGT libraries (taking into consideration the repeated libraries and re-sequencing) failed, resulting in a 99.5% success rate. Good sWGS data produced from 379/446 (84.9%) samples.
Association between FFPE storage time, site, and sequencing quality
The FFPE samples were collected from three different tissue banks, spanning 20 years (Table 1). The effect of storage time on the DNA extracted was analysed (Fig. 3). DNA from older FFPE blocks (>5 years) was generally of poorer quality: higher ΔCt values, shorter fragment size, generating lower yield sequencing libraries. We compared the quality metrics for each banking site and found that overall FFPE samples from different sites were comparable (Table 1 and Supplementary Fig. 3).
Fig. 3
Features of input DNA and libraries generated from blocks less and more than five years.
Dot plots represent the range (minimum-maximum) of observed values for each of the following categories and the red dot represents the median.
A. The quality of input DNA inferred by ΔCt. B. Fragment sizes of the libraries in base pair. C. The library yield in nanomoles.
Association between input DNA characteristics and sequencing library yield
We used the Illumina FFPE QC kit, a quantitative-PCR assay to estimate the quality of FFPE-extracted DNA. This assay measures the difference in Ct (cycle threshold) value of the test FFPE-extracted DNA against the Ct value of the control DNA template provided in the kit. Increasing ΔCt values indicate decreasing DNA quality with Illumina quality thresholds set at: ΔCt < 1.5 denotes high quality (HQ), ΔCt < 3.0 denotes medium quality (MQ), and ΔCt > 3 denotes low quality (LQ) DNA. The Illumina DNA-input recommendations for sWGS are 50 ng DNA with HQ DNA, 200 ng with MQ DNA, and exclusion of LQ DNA. Using the ILMN kit, we could generate good quality sWGS using 50 ng HQ and MQ DNA, and 200-500 ng of LQ DNA. Unsurprisingly, for eight samples with paired libraries generated from 50 ng and 200 or 500 ng of input DNA using the ILMN kit, we found that the sequencing library yields generated with more DNA was significantly higher than when using only 50 ng (p-value: 0.000265; Supplementary Fig. 3a). This is an important consideration if these libraries were destined for downstream target enrichment assays for mutation detection that require 500 ng of library material. Data from all the generated ILMN libraries (n = 45) showed a library yield that averaged 5.6 nM using 50 ng FFPE-extracted DNA, which was significantly less than with libraries made with more input DNA (200 ng, 23.6 nM, Welch Two Sample t-test, p = 5.24e−06; 500 ng: 23.0 nM, Welch Two Sample t-test, p = 0.0121). There was no difference in library yield when using either 200 or 500 ng of DNA (Welch Two Sample t-test, p = 0.2401). This is probably due to the quality of the input DNA as libraries produced from 200 ng of DNA had lower ΔCt values (better quality) than those using 500 ng (Welch Two Sample t-test, p = 0.0179, Supplementary Fig. 4a–c).Using the RGT kit, we found no correlation between amount of input DNA and sequencing library yield (r2 = −0.002, p = 0.81). This is probably due to the fewer library-washing steps using the RGT kit (six washing steps in the ILMN protocol versus one in RGT).Features of input DNA and libraries generated from blocks less and more than five years.Dot plots represent the range (minimum-maximum) of observed values for each of the following categories and the red dot represents the median.A. The quality of input DNA inferred by ΔCt. B. Fragment sizes of the libraries in base pair. C. The library yield in nanomoles.
Association between input DNA characteristics and sequencing library quality
Next we sought to determine if sequencing quality was influenced by the nature of the input DNA by looking at the proportion of samples from all quality (ΔCt) groups (Fig. 4a), fragment sizes (Fig. 4b) and different input groups (Fig. 4c) in each of the sequencing quality categories. Reassuringly, we found no biases in sampling that contributed to the sequencing quality. In other words, each copy number plot quality group had samples from all DNA quality (ΔCt) groups, fragment sizes and input quantity groups, suggesting that we could generate good quality libraries from most of our FFPE DNA regardless of these features.
Fig. 4
Measured standard deviations from the QDNASEQ copy number plots and associations with the quality of sequencing libraries.
A. Bar charts showing proportion of samples with different input DNA quality (based on ΔCt) in each sequencing quality group. B. Bar charts showing proportion of samples from FFPE blocks of different fragment sizes in each sequencing quality group. C. Bar charts showing proportion of samples with different amount of input DNA in each sequencing quality group.
VG = very good, G = good, I = intermediate, P = poor, F = fail.
Measured standard deviations from the QDNASEQ copy number plots and associations with the quality of sequencing libraries.A. Bar charts showing proportion of samples with different input DNA quality (based on ΔCt) in each sequencing quality group. B. Bar charts showing proportion of samples from FFPE blocks of different fragment sizes in each sequencing quality group. C. Bar charts showing proportion of samples with different amount of input DNA in each sequencing quality group.VG = very good, G = good, I = intermediate, P = poor, F = fail.Using our copy number output categorisation scoring, we examined if quality of the libraries (analysed as “all sequencing quality groups” versus “very good”) can be attributed to the different features of the input DNA and library yields (Table 2, Fig. 5). We found that the quantity of template was only significantly different in the good quality libraries. Meanwhile, the quality of input DNA was significantly different in the intermediate libraries only when compared to the “very good” libraries. Therefore, the lesser quality sequencing libraries (I, P and F) cannot be attributed simply to either quantity or quality of the template DNA. The DNA fragment sizes, which should reflect the length of the template as the DNA was sheared under similar conditions, were found to be significantly different in all groups (progressively becoming shorter) except the failures. We found that low quality DNA was associated with shorter DNA fragments, lower library yield and higher number of unmapped reads but no association with the total number of unique reads aligned (Fig. 6a–d). The recovery of most of the poor/failed libraries described previously, was achieved by either repeating the library generation or re-sequencing to generate more reads. Consequently, we suspect the poor/failed libraries could be due to a loss of DNA during the purification steps or that the Q-PCR quantification of the libraries prior to normalisation, over-estimated the library concentration resulting in inadequate amount of library being used for sequencing. This would explain why by simply increasing the quantity of libraries for sequencing and reducing the number of samples in a single pool, ensured adequate read counts and successful sequencing.
Table 2
Features of input DNA and libraries for the different categories of copy number data. Data provided in minimum-maximum range and median values in brackets.
Quality
DNA input (ng)
DNA quality (ΔCt)
Library yield (nM)
Fragment size (bp)
Observed SD
VG
3.8–100 (30.3)
−0.3-12.7 (4.4)
0.28–245.1 (16.1)
180–288 (235)
0.01–0.132 (0.097)
G
4.3–107 (35.9)⁎
1–17.1 (4.2)
0.18–139.7 (11.3)
195–324 (229)⁎
0.0789–0.144 (0.1105)⁎
I
4–59 (33.1)
−3.2-6.3 (3.5)⁎
1.06–924.8 (16.0)⁎
191–258 (231)⁎
0.0739–0.24 (0.127)⁎
P
14.7–50 (28.7)
2–11.3 (4.9)
0–55.63 (8.7)
187–243 (219)⁎
0.119–0.329 (0.171)⁎
F
4.8–51.1 (25.2)
1.9–8.2 (4.9)
0–179.6 (8.2)
204–244 (225)
0.388–0.826 (0.528)⁎
ΔCt = difference between the cycle threshold of test to the control template ACD1 provided in the kit, ng = nanogram, bp = base pairs, nM = nanomoles, SD = standard deviation.
Denotes the site where there is a significant difference to the index group (ie Cambridge).
Fig. 5
Features of the sequencing libraries.
Boxplots showing different features of input DNA and library yield relative to the different library qualities.
A. Input DNA. B. Quality of input DNA inferred from ΔCt. C. Fragment size of libraries. D. Library yield.
VG = very good, G = good, I = intermediate, P = poor, F = fail.
Fig. 6
Effect of input DNA quality.
Scatterplots showing the association between quality of input DNA with different features of the sequencing libraries.
A. Fragment size of libraries. B Library yield. C. Unmapped Reads. D. Unique aligned reads.
Features of the sequencing libraries.Boxplots showing different features of input DNA and library yield relative to the different library qualities.A. Input DNA. B. Quality of input DNA inferred from ΔCt. C. Fragment size of libraries. D. Library yield.VG = very good, G = good, I = intermediate, P = poor, F = fail.Effect of input DNA quality.Scatterplots showing the association between quality of input DNA with different features of the sequencing libraries.A. Fragment size of libraries. B Library yield. C. Unmapped Reads. D. Unique aligned reads.Features of input DNA and libraries for the different categories of copy number data. Data provided in minimum-maximum range and median values in brackets.ΔCt = difference between the cycle threshold of test to the control template ACD1 provided in the kit, ng = nanogram, bp = base pairs, nM = nanomoles, SD = standard deviation.Denotes the site where there is a significant difference to the index group (ie Cambridge).
Discussion
In this study, we have looked at the effect that quantity and quality of DNA from FFPE tissues has on successful sWGS library preparation for CN profiling of humanbreast cancers. Both the quantity and quality of DNA have always been an important consideration for sample selection and in deciding which genomic application to use. For example, microarrays require 100 ng–2.5 μg of DNA depending on the resolution of the arrays whereas PCR based methods require only 10 ng of DNA. In our hands, we have not had much success in obtaining CN data with DNA extracted from FFPE DNA using microarrays, especially when the extracted DNAs are more fragmented and of lower quality (judging from absorbance ratios of 260 nm to 280 nm and multiplex PCR for quality control).Here we have robustly shown that we can generate CN data from virtually all archived FFPE samples using sWGS. We show good CN profiling data irrespective of the quality of input DNA, as inferred by whether it can be amplified with Q-PCR (ΔCt). Previous work has extensively tested the utility of FFPE DNA for mutation analysis (Astolfi et al., 2015; De Paoli-Iseppi et al., 2016; Dumur et al., 2015; Holley et al., 2012; Kerick et al., 2011) but to date no comprehensive study has shown its use for CN profiling. Since many humancancer types, including breast and ovarian cancers, are driven mostly by CNA (C-class) rather than point mutations or indels (M-class), we believe more effort should be focussed on characterizing the copy number landscapes of these cancers (Ciriello et al., 2013). We found sWGS to be very robust in generating these CN profiles, independently of the kits used, quantity and quality of DNA. sWGS is also significantly cheaper (~50%) than microarray-based methods (Supplementary Table 2).Another advantage of generating sWGS libraries is the ability to use the same library for targeted sequence enrichment to identify mutations. There have been other methods reported for CN profiling using DNA extracted from FFPE samples but these methods do not generate sequencing libraries that can then be used for target enrichment and sequencing (Hughesman et al., 2016) or if they do, are expensive (Singh et al., 2016). In addition, sWGS will also serve as a quality control for the libraries, given its relative low cost when compared to that of generating targeted sequencing libraries. Only libraries that generate good CN profiles should be used for target enrichment and mutation detection (Kinde et al., 2012). Whilst we haven't performed target enrichment on our FFPE libraries, we expect the performance of these FFPE libraries for mutation analysis to be similar to that of published data, including known artefacts caused by formalin-based fixation effects on the DNA template (De Paoli-Iseppi et al., 2016; Holley et al., 2012; Carrick et al., 2015; Fassunke et al., 2015; Hadd et al., 2013).
Conclusions
We have shown that sWGS is a robust and cost-effective method for obtaining good quality CN data from FFPE cancer samples, irrespective of the DNA quality and quantity used. In the case of breast cancer, CN profiles can be used to stratify breast cancers into one of the 10 Integrative Clusters (Ali et al., 2014), reiterating the importance of FFPE tumor archives. The methods described here are also of relevance to other cancers, e.g. ovarian cancers where CN profiling is essential to characterise their genomic landscapes.The following are the supplementary data related to this article.
Supplementary Fig. 1
a–b Four separate libraries made with different kits - QDNASEQ copy number plots from four samples made with Rubicon Genomics Thruplex (A, B) or Illumina TruSEQ (C–F). We downsampled the Illumina TruSEQ (C, D) to achieve a more comparable results to the Rubicon Genomics Thruplex (A, B). The actual sequencing depth for Illumina TruSEQ is presented in plot E-F.
Supplementary Fig. 2a
Increasing sequencing depth improves the resolution of the copy number plots - QDNASEQ copy number plots from two samples made with Rubicon Genomics Thruplex with increasing read depth. Sample on the left has a ΔCt of 4.79 and on the right has a ΔCt of 4.31.
Supplementary Fig. 2b
Increasing sequencing depth improves the resolution of the copy number plots – For six libraries, the number of reads were down-sampled stepwise at 1 × 106 reads and plotted against the measured standard deviation (from QDNASEQ plots).
Supplementary Fig. 3
Comparing the input DNA and libraries generated from different sites.Dot plots represent the range (minimum-maximum) observed values for each category and the red dot (•) represents the median. Lines represent comparison between sites with associated statistical p-value.A. The quality of input DNA inferred by ΔCt.B. Fragment sizes of the libraries in base pair. C. The library yield in nanomoles.
Supplementary Fig. 4
Library yield and its association with amount of input DNA using the Illumina TruSEQ kit.A. For eight samples, libraries were generated with 50 ng and either 200 or 500 ng of input DNA. Each coloured line represents paired libraries drawn connecting the lower and higher input DNA for each sample. B. Dot plots represent the range of library yield when generated with different amount of input DNA and the red dot (•) represent median. Red line represents statistical test between the groups spanning the line with p-value. C. Dot plot represent range of input DNA quality inferred by ΔCt for the samples from the different groups of input DNA and the red dot (•) represent median. Red line represents statistical test between the groups spanning the line with p-value.
Supplementary Table 1
Features of libraries generated using Illumina TruSEQ or Rubicon Genomics Thruplex kits. ΔCt = difference between the cycle threshold of test to the control template ACD1 provided in the kit. ng = nanogram, PCR = polymerase chain reaction, bp = base pairs, nM = nanomoles. Supplementary Table 2 Comparing the DNA input requirement and cost for different commercial platforms available for copy number profiling. sWGS – shallow whole genome sequencing. Supplementary Table 3 Description of the libraries generated using the Rubicon Genomics Thruplex kit. Data given in range and median in brackets. ΔCt = difference between the cycle threshold of test to the control template ACD1 provided in the kit. ng = nanogram, bp = base pairs, nM = nanomoles.
Authors: Ricardo De Paoli-Iseppi; Peter A Johansson; Alexander M Menzies; Kerith-Rae Dias; Gulietta M Pupo; Hojabr Kakavand; James S Wilmott; Graham J Mann; Nicholas K Hayward; Marcel E Dinger; Georgina V Long; Richard A Scolyer Journal: Pathology Date: 2016-03-09 Impact factor: 5.306
Authors: D Pinkel; R Segraves; D Sudar; S Clark; I Poole; D Kowbel; C Collins; W L Kuo; C Chen; Y Zhai; S H Dairkee; B M Ljung; J W Gray; D G Albertson Journal: Nat Genet Date: 1998-10 Impact factor: 38.330
Authors: Martin Kerick; Melanie Isau; Bernd Timmermann; Holger Sültmann; Ralf Herwig; Sylvia Krobitsch; Georg Schaefer; Irmgard Verdorfer; Georg Bartsch; Helmut Klocker; Hans Lehrach; Michal R Schweiger Journal: BMC Med Genomics Date: 2011-09-29 Impact factor: 3.063
Authors: E H van Beers; S A Joosse; M J Ligtenberg; R Fles; F B L Hogervorst; S Verhoef; P M Nederlof Journal: Br J Cancer Date: 2006-01-30 Impact factor: 7.640
Authors: Ilari Scheinin; Daoud Sie; Henrik Bengtsson; Mark A van de Wiel; Adam B Olshen; Hinke F van Thuijl; Hendrik F van Essen; Paul P Eijk; François Rustenburg; Gerrit A Meijer; Jaap C Reijneveld; Pieter Wesseling; Daniel Pinkel; Donna G Albertson; Bauke Ylstra Journal: Genome Res Date: 2014-09-18 Impact factor: 9.043
Authors: Curtis B Hughesman; X J David Lu; Kelly Y P Liu; Yuqi Zhu; Catherine F Poh; Charles Haynes Journal: PLoS One Date: 2016-08-18 Impact factor: 3.240
Authors: H Raza Ali; Oscar M Rueda; Suet-Feung Chin; Christina Curtis; Mark J Dunning; Samuel Ajr Aparicio; Carlos Caldas Journal: Genome Biol Date: 2014-08-28 Impact factor: 13.583
Authors: Lennart Raman; Annelies Dheedene; Matthias De Smet; Jo Van Dorpe; Björn Menten Journal: Nucleic Acids Res Date: 2019-02-28 Impact factor: 16.971
Authors: Johannes Smolander; Sofia Khan; Kalaimathy Singaravelu; Leni Kauko; Riikka J Lund; Asta Laiho; Laura L Elo Journal: BMC Genomics Date: 2021-05-17 Impact factor: 3.969
Authors: Joan Frigola; Caterina Carbonell; Patricia Irazno; Nuria Pardo; Ana Callejo; Susana Cedres; Alex Martinez-Marti; Alejandro Navarro; Mireia Soleda; Jose Jimenez; Javier Hernandez-Losa; Ana Vivancos; Enriqueta Felip; Ramon Amat Journal: J Immunother Cancer Date: 2022-04 Impact factor: 12.469
Authors: Maxime Tarabichi; Adriana Salcedo; Amit G Deshwar; Máire Ni Leathlobhair; Jeff Wintersinger; David C Wedge; Peter Van Loo; Quaid D Morris; Paul C Boutros Journal: Nat Methods Date: 2021-01-04 Impact factor: 28.547