| Literature DB >> 25236618 |
Ilari Scheinin1, Daoud Sie2, Henrik Bengtsson3, Mark A van de Wiel4, Adam B Olshen3, Hinke F van Thuijl5, Hendrik F van Essen2, Paul P Eijk2, François Rustenburg2, Gerrit A Meijer2, Jaap C Reijneveld6, Pieter Wesseling7, Daniel Pinkel8, Donna G Albertson9, Bauke Ylstra10.
Abstract
Detection of DNA copy number aberrations by shallow whole-genome sequencing (WGS) faces many challenges, including lack of completion and errors in the human reference genome, repetitive sequences, polymorphisms, variable sample quality, and biases in the sequencing procedures. Formalin-fixed paraffin-embedded (FFPE) archival material, the analysis of which is important for studies of cancer, presents particular analytical difficulties due to degradation of the DNA and frequent lack of matched reference samples. We present a robust, cost-effective WGS method for DNA copy number analysis that addresses these challenges more successfully than currently available procedures. In practice, very useful profiles can be obtained with ∼0.1× genome coverage. We improve on previous methods by first implementing a combined correction for sequence mappability and GC content, and second, by applying this procedure to sequence data from the 1000 Genomes Project in order to develop a blacklist of problematic genome regions. A small subset of these blacklisted regions was previously identified by ENCODE, but the vast majority are novel unappreciated problematic regions. Our procedures are implemented in a pipeline called QDNAseq. We have analyzed over 1000 samples, most of which were obtained from the fixed tissue archives of more than 25 institutions. We demonstrate that for most samples our sequencing and analysis procedures yield genome profiles with noise levels near the statistical limit imposed by read counting. The described procedures also provide better correction of artifacts introduced by low DNA quality than prior approaches and better copy number data than high-resolution microarrays at a substantially lower cost.Entities:
Mesh:
Year: 2014 PMID: 25236618 PMCID: PMC4248318 DOI: 10.1101/gr.175141.114
Source DB: PubMed Journal: Genome Res ISSN: 1088-9051 Impact factor: 9.043
Figure 1.Correction to read counts. Copy number profiles from (A) uncorrected and (C) corrected read counts; (B) median read counts per bin as a function of GC content and mappability; and (D) the corresponding LOESS fit for sample LGG150. Regions of the isobar plots that are white contain no bins with that combination of GC and mappability. In the copy number profiles, bins are ordered along the x-axis by their genomic positions, and the y-axis shows median-normalized log2-transformed data. Small triangles at the top and bottom edges represent data points that fall outside the plot area. Upper left corners show the number and size of bins. Upper right corners of the median read counts plot shows the total number of sequence reads, and upper right corners of the copy number profiles the expected and measured standard deviation. The expected standard deviation (E σ) is defined as , where N is the average number of reads per bin. The measured standard deviation is calculated from the data with a mean-scaled and 0.1%-trimmed first-order estimate, prior to log2 transforming the data for plotting (see text).
Figure 2.Blacklisting problematic regions. (A) Copy number profile for sample LGG150 with bins overlapping with the ENCODE blacklist highlighted in red, bins with mappabilities below 50 highlighted in blue, and the overlap between the two in yellow. (B) Distribution of median residuals per bin from the 1000 Genomes Project across the 38 samples. Residuals are defined as the distance between observed read counts and the fitted LOESS surface, divided by the LOESS value. The outer plot shows the entire range of values with two discrete peaks. The minor peak around −1.0 results from repetitive sequences. Reads that align equally well to multiple locations in the genome are filtered out. Repetitive sequences therefore have a lower than expected number of reads mapped. The major peak around zero contains most of the bins, and the inset shows a magnification of the peak, with the dotted vertical bars and the shaded area showing the cutoff of 4.0 standard deviations (as estimated with a robust first-order estimator) for blacklisting. (C) Copy number profile of sample LGG150 with bins in the novel blacklist based on residuals of the 1000 Genomes samples highlighted in red. (D) The final copy number profile of sample LGG150 after filtering out bins in the ENCODE and 1000G blacklists.
Figure 3.Dependence of variance on sequence depth. (A) The relationship between sequence depth and variance for 15 LGGs (black), cell line BT474 (blue), 10 independent library preparations of SCC sample AB052 (yellow), and subsamplings of AB042 data (red). All individual samples are within the left half of the graph, with the subsamplings extending to the right half as well. The black line shows the linear expectation of the variance as 1/N, where N is the average number of reads per bin. Lines fitted through the AB042 subsamplings and AB052 repeats have slopes of 1.026 and 1.003, and intercepts of 0.00107 and 0.000781, respectively. (B) The relationship between sequence depth and variance for more than a thousand samples sequenced at our institute.
Figure 4.Comparison to other methods. (A) Final copy number profile of sample LGG150 obtained with QDNAseq after removing blacklisted bins and correcting read counts for GC content and mappability. This procedure results in 166,909 bins, and highlighted in red are those 750 bins that are not contained in the output of FREEC. (B) Copy number profile of sample LGG150 obtained with an Agilent 180K microarray with 164,378 unique array elements. (C) Copy number profile of sample LGG150 obtained with FREEC with 170,474 bins. Highlighted in red are those 4315 bins that are not contained in the output of QDNAseq. Note that many of the red bins are in focal peaks that have the potential of being called aberrations but which are probably spurious since they are contained in the QDNAseq blacklists. (D) Noise for QDNAseq versus FREEC calculated from the thousand samples in Figure 3B. Only the 166,159 bins present in the output of both algorithms were used in order to eliminate differences caused by blacklisting spurious bins.