| Literature DB >> 25820422 |
Janine Meienberg1, Katja Zerjavic1, Irene Keller2, Michal Okoniewski3, Andrea Patrignani4, Katja Ludin5, Zhenyu Xu6, Beat Steinmann7, Thierry Carrel8, Benno Röthlisberger5, Ralph Schlapbach4, Rémy Bruggmann9, Gabor Matyas10.
Abstract
Whole exome sequencing (WES) is increasingly used in research and diagnostics. WES users expect coverage of the entire coding region of known genes as well as sufficient read depth for the covered regions. It is, however, unknown which recent WES platform is most suitable to meet these expectations. We present insights into the performance of the most recent standard exome enrichment platforms from Agilent, NimbleGen and Illumina applied to six different DNA samples by two sequencing vendors per platform. Our results suggest that both Agilent and NimbleGen overall perform better than Illumina and that the high enrichment performance of Agilent is stable among samples and between vendors, whereas NimbleGen is only able to achieve vendor- and sample-specific best exome coverage. Moreover, the recent Agilent platform overall captures more coding exons with sufficient read depth than NimbleGen and Illumina. Due to considerable gaps in effective exome coverage, however, the three platforms cannot capture all known coding exons alone or in combination, requiring improvement. Our data emphasize the importance of evaluation of updated platform versions and suggest that enrichment-free whole genome sequencing can overcome the limitations of WES in sufficiently covering coding exons, especially GC-rich regions, and in characterizing structural variants.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25820422 PMCID: PMC4477645 DOI: 10.1093/nar/gkv216
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Experimental design and characteristics of DNA samples used in this study
| Sample # | Gender | DNA source | Extraction method | Purified | WES/Vendora | WGS/Vendor (coverage) |
|---|---|---|---|---|---|---|
| 44 | female | blood | Qiagen column | no | no WGS | |
| 280 | female | blood | Chemagen | no | no WGS | |
| 326 | female | fibroblasts | Chemagen | no | Agilent/V1, V2 | no WGS |
| 2905 | male | blood | Chemagen | yes | NimbleGen/V1, V3 | no WGS |
| 7344 | female | blood | Chemagen | yes | Illumina/V1, V4 | HiSeq/V3 (30×) |
| 7739 | female | saliva | Chemagen | yes | HiSeq/V1 (60×), V3 (30×), V4 (30×) and XTen/V4 (60×) |
Qiagen column, DNA extraction using Qiagen QIAamp DNA Mini Kit; Chemagen, DNA extraction using PerkinElmer Chemagic Magnetic Separation Module I; Purified, purification of the extracted DNA by re-extraction using Qiagen QIAamp DNA Mini Kit; WES Agilent, SureSelect Human All Exon kit v5+UTR; WES NimbleGen, SeqCap EZ Exome (v3) +UTR; WES Illumina, Nextera Rapid Capture Expanded Exome; V1–V4, vendors 1–4; WGS HiSeq, TruSeq DNA PCR-Free Sample Preparation Kit on a HiSeq2000/2500 system; WGS XTen, TruSeq Nano DNA Sample Preparation Kit on a HiSeq X Ten system.
aFor all six samples.
Overview of studies evaluating exome enrichment platforms as well as summary of which of the platforms performed best for the assessed aspects
| This study | Clark | Asan | Parla | Sulonen | Chilamakuri | |
|---|---|---|---|---|---|---|
| Enrichment platforms | Agilent v5+UTR, NimbleGen v3+UTR and Illumina Nextera Expanded Exome | Agilent v3, NimbleGen v2 and Illumina TruSeq Exome | Agilent v1, NimbleGen v1 (in-solution), 2.1M array | Agilent v1 and NimbleGen v1 | Agilent v1, v3 and NimbleGen v1, v2 | Agilent v4, NimbleGen v3, Illumina TruSeq Exome and Illumina Nextera Expanded Exome |
| Sequencing platform | Illumina HiSeq 2000/2500 paired-end 100-bp reads | Illumina HiSeq 2000 paired-end 100-bp reads | Illumina HiSeq 2000 paired-end 90-bp reads | Illumina GAIIx, paired-end 76-bp reads | Illumina GAIIx, paired-end 82-bp reads | Illumina HiSeq 2000 paired-end 100-bp reads |
| DNA samples | Six samples performed by different vendors, 24 samples performed by one vendor using Agilent | One sample | One sample | Six HapMap samples (two for both platforms and four only for NimbleGen) | One sample for all platforms, 25 samples for one platform | One sample with two technical replicates per platform |
| Region for sequence variant calling | Common designed target region in RefSeq coding exons 100% covered at 20× by all platform-vendor combinations | Genome-wide | Designed target region with 200-bp flanking sequences | CCDS | Genome-wide, designed target region (individual and common), and CCDS | Designed target region (individual and common), CCDS, RefSeq (coding and UTR) and Ensembl |
| Largest designed target region | NimbleGen | Illumina | Agilent | Agilent | Agilent v2 | NimbleGen |
| Largest coding region (reference) | NimbleGen (RefSeq) | Agilent (RefSeq, Ensembl CDS) | Agilent (CCDS) | Agilent (CCDS) | Agilent v2 (CCDS) | Illumina (CCDS, RefSeq, Ensembl) |
| Best designed target enrichment efficiency | Agilent | NimbleGen | NimbleGen (array and in-solution) | NimbleGen | NimbleGen v2 | Agilent |
| Lowest off-target enrichment | Agilent and NimbleGen | NimbleGen | NimbleGen (array and in-solution) | NimbleGen | NimbleGen v1 | Agilent and NimbleGena |
| Best GC-rich region enrichment | Agilent | Agilent | NimbleGen array | No data | NimbleGen v2 | Illumina Nextera |
| Highest accuracy of SNV detection (benchmark) | Agilent (Sanger sequencing, MLPA and SNP array) | Agilent (SNP array) | No clear difference among platforms (SNP array and WGS) | Agilent (HapMap and 1000 Genome Project data) | NimbleGen v2 (SNP array) | No determination of accuracy by comparison to a benchmark (only calling of SNVs) |
aEstimated from provided figures, as off-target reads were reported as relative proportion of filtered reads rather than total mapped reads; CCDS, Consensus Coding Sequences.
Figure 1.Enrichment efficiency of the three updated exome enrichment platforms (Agilent, NimbleGen and Illumina) performed by four vendors (V1, V2, V3 and V4). (A) Mean number of aligned reads (as million reads), mean read depth and percentage of coverage at 20× for each designed target region as well as mean percentage of on-target reads (i.e. within designed target regions) and mean percentage of off-target reads (i.e. within regions more than ±500 bp outside the designed target regions). Note that values for aligned reads indicate the total number of mapped reads without duplicates for V1 and V2 and only uniquely mapped reads without duplicates for V3 and V4 (Supplementary Table S3). (B) Mean read depth and percentage of coverage at 20× for all and only coding exons of the RefSeq database as well as uniformity of the coverage of RefSeq coding exons calculated as the fraction of exons reaching an average read depth within ±70% of mean read depth over all coding exons (uniformity coding). (C) Mean read depth and percentage of coverage at 15 and 20× for RefSeq coding exons as well as for −50-bp and +20-bp flanking intronic regions. Given are means of all six DNA samples (n = 6); error bars indicate 95% confidence intervals. Values were calculated using the SeqMonk program (http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/) and are presented in Supplementary Tables S4-S8 and S12–S13. For complete coverage of RefSeq coding exons see Figure 3.
Figure 3.Complete (i.e. 100%) coverage of RefSeq coding exons. (A) Proportion of RefSeq coding exons 100% covered by each designed target region (design) and by ≥20 reads effectively produced by each vendor (vendors V1–V4). (B) Proportions of RefSeq coding exons not 100% covered at 20× (missed exons). If not otherwise indicated, data of all corresponding vendors are included. Given are means of all six DNA samples (n = 6); error bars indicate 95% confidence intervals. Values were calculated using the SeqMonk program (http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/) and are presented in Supplementary Tables S2, S8 and S10.
Figure 2.Differences among DNA samples. (A) Mean coverage of RefSeq exons (n = 233 644) at 20× (expressed in percentage of the entire exon length) for all six platform-vendor combinations and DNA samples (44, 280, 326, 2905, 7344 and 7739) derived from blood, fibroblasts or saliva. Values were obtained by using the SeqMonk program (www.bioinformatics.babraham.ac.uk/projects/seqmonk) and are presented in Supplementary Tables S6 and S12. Error bars indicate 95% confidence intervals for the arithmetic means of all corresponding exons. (B and C) Mean coverage at ≥20 reads (B) and mean read depth (C) of RefSeq exons per GC content for each DNA sample exemplified by the WES data of V2 using Agilent, demonstrating its high performance stability across samples.
Figure 4.Differences in sensitivity to GC content among all platform-vendor combinations (average of all six DNA samples). (A and B) Scatter plot showing GC content and achieved read depth of RefSeq exons (coding and UTR) for the three updated exome enrichment platforms performed by the same vendor (V1, A) and different vendors (V2–V4, B), exemplified for sample 7344 (plots of all six samples are shown in Supplementary Figures S15 and S16). (C) Mean read depth of RefSeq exons per GC content shown as means of all samples. (D) Mean 20× coverage of RefSeq exons per GC content shown as means of all samples.
Figure 5.Influence of GC content on mean read depth in WGS. (A) GC content and achieved read depth of RefSeq exons (coding and UTR) exemplified by WGS of sample 7344 performed by V3 (plots of all WGS datasets are shown in Supplementary Figure S17). (B) Means of read depths of RefSeq exons per GC content. X Ten, HiSeq X Ten system.
Figure 6.Relative proportions of non-reference (mutant) alleles called in the VCF files provided by vendors (V1–V4). The analysis was restricted to shared heterozygous variants within the designed target regions of the three platforms (Agilent, NimbleGen and Illumina) located in exons completely (100%) covered at 20× by all six platform-vendor combinations. (A and B) Heterozygous SNVs (A) and indels (B) listed according to GC content of 30-bp flanking sequences (for indel lengths see Supplementary Figure S18). Shown are values of all six DNA samples. Dashed lines indicate an interval within which 95% of the relative proportions of non-reference alleles lie (calculated according to the Student's t distribution as the mean of n percentage values ± critical t-value (tcrit,n-1) × SD using n = 8 687, tcrit = 1.960 and n = 51, tcrit = 2.009 for A and B, respectively).