| Literature DB >> 27225215 |
Claudia Pommerenke1, Robert Geffers2, Boyke Bunk3, Sabin Bhuju2, Sonja Eberth3, Hans G Drexler3, Hilmar Quentmeier3.
Abstract
BACKGROUND: Whole exome sequencing (WES) has been proven to serve as a valuable basis for various applications such as variant calling and copy number variation (CNV) analyses. For those analyses the read coverage should be optimally balanced throughout protein coding regions at sufficient read depth. Unfortunately, WES is known for its uneven coverage within coding regions due to GC-rich regions or off-target enrichment.Entities:
Keywords: DNA insert size; Evenness score; Read coverage; Variant calling; Whole exome sequencing
Mesh:
Year: 2016 PMID: 27225215 PMCID: PMC4880973 DOI: 10.1186/s12864-016-2698-y
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1DNA shearing to 130 and 170 bp fractions before Illumina adapter ligation; sequencing base quality. a DNA insert length distribution per sample. b Peak insert lengths for the two different sample groups. c Alignment histograms for 130 bp insert samples (red) exhibited high amplitudes of coverage within the exon in comparison to 170 bp (blue) as exemplified by this gene BMP4 via IGV. Target regions of Agilent v5 and v5+UTR are given in the last two lines. Please note the 3 × fold higher maximum coverage of 130 bp samples. d High Phred score quality values for mapped paired-end reads. Base calling quality was high after trimming and mapping to the human genome. As expected, for both reads in forward and reverse direction (1–100 and 101–200 bases) read quality increased during the first 10 cycles and dropped gradually due to de-phasing errors of Illumina’s sequencing pipeline. After joining paired-end reads, quality scores improved between 75–125 cycles, as the best scores were kept while merging. Quality scores were ≥30 throughout nearly all cycles and similar between 130 and 170 bp samples
Portfolio of the samples in this study
| Sample | Cell line* | Agilent | Insert | Mio. |
|---|---|---|---|---|
| SureSelectXT | length bp | Reads | ||
| HG3CD5n_cl1 | HG3 | v5+UTR | 134 | 32.4 |
| HG3CD5p_cl7 | HG3 | v5+UTR | 132 | 44.4 |
| U2932R1 | U-2932 | v5+UTR | 131 | 44.1 |
| U2932R2 | U-2932 | v5+UTR | 133 | 53.8 |
| WAC3CD5n | WA-C3CD5+ | v5+UTR | 134 | 43.3 |
| WAC3CD5p | WA-C3CD5+ | v5+UTR | 130 | 35.9 |
| HG3CD5n_cl41 | HG3 | v5 | 176 | 25.9 |
| HG3CD5n_cl48 | HG3 | v5 | 167 | 27.9 |
| HG3CD5n_mix | HG3 | v5 | 162 | 19.8 |
| HG3CD5p_mix | HG3 | v5 | 163 | 21.8 |
| NCNC | NC-NC | v5 | 171 | 22.2 |
| WAOSEL | WA-OSEL | v5 | 174 | 18.0 |
*All cell lines are held at the DSMZ
Preprocessing statistics
| Trimming R1 | Trimming R2 | Mapped | Joined | |||
|---|---|---|---|---|---|---|
| Sample | reads | bases | reads | bases | reads | reads |
| HG3CD5n_cl1 | 19,9 % | 9,0 | 22,4 % | 22,2 | 93,7 % | 80,5 % |
| HG3CD5p_cl7 | 20,0 % | 8,8 | 22,4 % | 22,1 | 93,6 % | 82,4 % |
| U2932R1 | 20,1 % | 8,9 | 22,1 % | 21,8 | 93,5 % | 81,9 % |
| U2932R2 | 20,0 % | 8,9 | 22,6 % | 22,8 | 93,7 % | 80,1 % |
| WAC3CD5n | 22,6 % | 8,9 | 20,0 % | 22,4 | 93,5 % | 80,6 % |
| WAC3CD5p | 20,0 % | 9,0 | 22,4 % | 21,8 | 93,7 % | 81,6 % |
| HG3CD5n_cl41 | 14,6 % | 12,8 | 12,7 % | 43,1 | 90,5 % | 38,0 % |
| HG3CD5n_cl48 | 14,5 % | 12,6 | 12,5 % | 42,7 | 90,3 % | 47,1 % |
| HG3CD5n_mix | 14,3 % | 12,5 | 12,1 % | 41,9 | 90,6 % | 50,8 % |
| HG3CD5p_mix | 13,7 % | 12,3 | 11,9 % | 42,4 | 91,0 % | 46,8 % |
| NCNC | 14,6 % | 12,7 | 12,4 % | 43,7 | 89,2 % | 43,8 % |
| WAOSEL | 13,3 % | 15,9 | 10,3 % | 42,4 | 88,0 % | 44,4 % |
Fig. 2Target regions and relative read coverages. a Agilent SureSelectXT v5+UTR target regions (75 Mb) consisted of 68 % overlapping bases to v5 and a unique fraction of 32 %. The target region of v5 (50 Mb) was nearly fully contained in v5+UTR. b Mean coverage of unmerged and merged paired-end reads considering the size of respective target regions 75 and 50 Mbp for 130 bp and 170 bp, respectively. The average coverage was higher in 130 bp inserts than in 170 bp. This difference declined substantially when merging joint paired-end reads. c Recalculation of coverage of unmerged and merged paired-end reads on the respective specific target regions and on common target regions only. The mean coverage on the respective target regions was higher in 130 bp insert samples. d Portion of respective target regions covered by at least 10 ×. Despite higher coverage means for 130 bp, a smaller fraction of target regions was apparent at ≥10× depth for 130 bp samples. For overlapping target regions of v5 and v5+UTR the fraction of covered regions was still not higher in 130 bp reads as would be implicated by the higher coverage means
Fig. 3Evenness between different insert groups and unmerged/merged sequences. Before (a) and after (b) normalisation of coverage to the fraction of respective target regions for unmerged sequences. The complete integral of normalised coverage to the target region is summing up to 1. c The evenness score computed from the area under the curve of unmerged (Fig. 3 b) and merged sequences between 0 and 1 normalised coverage. The closer the evenness score is to 1, the better the uniformity of base coverage. The impact of higher insert length was evident; merged inserts gained top evenness scores regardless of relating to the specific corresponding target region or to overlapping target regions
Fig. 4Missed mutations exemplified on four isogenic subclones. a The DNA of four isogenic subclones (human HG-3 cell line) were fragmented to 130 bp for HG3CD5n_cl1 and HG3CD5p_cl7 and to 170 bp peak insert sizes for HG3CD5n_cl41 and HG3CD5n_cl48. Several mutations were missed by variant calling for samples fragmented to 130 bp, but clearly less for 170 bp. b Example mutation on gene OR5H15 (red arrow) with coverage depth of 6 × and 2 × for 130 bp and 40 × and 35 × depth for 170 bp insert samples. OR5H15 does not contain any UTRs in this single 900 bp exon. Reads were sorted in IGV to bases at the mutation site, hence all detected Ts for 130 bp samples (3 and 1, respectively) are indicated. c Specific target regions of v5+UTR and v5 in gene OR5H15 were identical to which the coverage histogram peaks map. Target regions are given in the last two lines. The amplitudes were higher for the 130 bp samples as well as the maximum read depth in the visible region compared to 170 bp samples. At the same time pronounced amplitudes were also obvious for 170 bp within the gene region of OR5H15 implying that sequencing longer DNA fragments would have gained even smaller amplitudes and a higher coverage across the gene region. The four subclones carried four further mutations (grey arrows) beside the failed mutation of 130 bp samples (red arrow) indicating sequence similarity of the subclones