| Literature DB >> 22994565 |
Abdou ElSharawy1, Jason Warner, Jeff Olson, Michael Forster, Markus B Schilhabel, Darren R Link, Stefan Rose-John, Stefan Schreiber, Philip Rosenstiel, James Brayer, Andre Franke.
Abstract
BACKGROUND: Many hypothesis-driven genetic studies require the ability to comprehensively and efficiently target specific regions of the genome to detect sequence variations. Often, sample availability is limited requiring the use of whole genome amplification (WGA). We evaluated a high-throughput microdroplet-based PCR approach in combination with next generation sequencing (NGS) to target 384 discrete exons from 373 genes involved in cancer. In our evaluation, we compared the performance of six non-amplified gDNA samples from two HapMap family trios. Three of these samples were also preamplified by WGA and evaluated. We tested sample pooling or multiplexing strategies at different stages of the tested targeted NGS (T-NGS) workflow.Entities:
Mesh:
Year: 2012 PMID: 22994565 PMCID: PMC3534403 DOI: 10.1186/1471-2164-13-500
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Experimental Workflow and Study Design. This Figure shows the tested samples (individual, pooled, non-amplified gDNA, and matched WGA) using the established T-NGS workflow (for more details see Additional file 3: Table S1). Each enriched individual sample was barcoded and sequenced under the following conditions: 1) one sample per octet, 3) samples were pooled after SOLiD library construction, pre-emulsion PCR (emPCR), and 3) samples were pooled post-emPCR. To assess the reproducibility of the established T-NGS method, equimolar amounts of the enriched six HapMap samples were also pooled before SOLiD library construction and tested in duplicate (Libraries ID 768_1L and 768_2L are technical replicates).
Sample throughput and sequence capacity- general sequencing and enrichment metrics
| 36,453,208 | 33.50% | 65.30% | 2,121 | 99.50% | 98.00% | 86.10% | |||
| | 42,752,716 | 36.30% | 63.20% | 2,588 | 99.60% | 98.10% | 86.40% | ||
| | 10,031,809 | 36.40% | 64.10% | 621 | 99.00% | 96.50% | 86.50% | ||
| | 5,221,822 | 31.60% | 64.20% | 280 | 98.40% | 93.50% | 85.70% | ||
| 39,005,646 | 31.00% | 73.30% | 2,356 | 99.40% | 98.00% | 86.30% | |||
| | 43,079,257 | 17.50% | 64.40% | 1,273 | 99.40% | 97.20% | 84.80% | ||
| | 3,272,424 | 39.30% | 72.70% | 249 | 98.30% | 93.60% | 87.10% | ||
| | 6,023,113 | 32.50% | 72.90% | 380 | 98.50% | 94.60% | 85.70% | ||
| 35,573,703 | 21.10% | 66.00% | 1,309 | 99.40% | 97.50% | 85.90% | |||
| | 43,140,968 | 30.60% | 57.80% | 2,014 | 99.60% | 98.20% | 86.50% | ||
| | 7,114,637 | 35.80% | 67.50% | 456 | 98.60% | 95.30% | 85.80% | ||
| | 6,587,601 | 34.00% | 67.60% | 401 | 98.60% | 94.90% | 85.80% | ||
| 39,864,620 | 29.20% | 63.90% | 1,961 | 99.60% | 97.90% | 85.50% | |||
| | 638,985 | 36.40% | 66.00% | 41 | 95.40% | 63.40% | 84.60% | ||
| | 677,767 | 28.50% | 66.40% | 34 | 94.50% | 55.40% | 84.90% | ||
| 38,654,830 | 34.10% | 68.50% | 2,404 | 99.70% | 98.10% | 85.80% | |||
| | Insufficient amount of material to run sample | ||||||||
| | 5,403,560 | 32.80% | 68.10% | 320 | 98.40% | 94.10% | 85.60% | ||
| 43,185,707 | 34.60% | 61.10% | 2,418 | 99.50% | 98.00% | 85.80% | |||
| | 9,255,333 | 36.30% | 61.30% | 545 | 98.90% | 96.00% | 85.90% | ||
| | 6,435,690 | 33.90% | 62.00% | 357 | 98.50% | 94.40% | 85.00% | ||
| 41,913,860 | 32.40% | 67.00% | 2,406 | 99.80% | 98.30% | 84.90% | |||
| 41,958,628 | 33.40% | 66.70% | 2,474 | 99.60% | 98.30% | 85.30% | |||
- Reads: Total sequencing reads per sample.
- Mapped: Percentage of total reads that could be aligned to the human genome (hg18/NCBI).
- On-Target: Percentage of mapped reads that align to the target regions.
- ADoC: Average depth of coverage of target base.
- C1: Percentage of target bases that are covered by at least one sequencing read.
- C20: Percentage of target bases that are covered by at least 20 sequencing reads.
- Coverage 0.2× Mean: Percentage of target bases that are covered by at least 0.2× of ADoC. Note that one barcode (BC4) was underrepresented (assigned with “*” in this Table; see also Results and Discussion sections).
- gDNA: genomic DNA; WGA: whole-genome amplification; emPCR: emulsion PCR; BC: barcode.
Variant detection and concordance- SNP concordance with HapMap genotypes
| 98.20% | 100.00% | 100.00% | 100.00% | 268 | 0 | 0,0% | |||
| | 97.80% | 100.00% | 100.00% | 100.00% | 268 | 0 | 0,0% | ||
| | 97.40% | 100.00% | 100.00% | 100.00% | 268 | 2 | 0,7% | ||
| | 96.50% | 100.00% | 100.00% | 100.00% | 268 | 5 | 1,9% | ||
| 99.10% | 99.10% | 99.10% | 100.00% | 138 | 0 | 0,0% | |||
| | 99.10% | 99.10% | 99.10% | 100.00% | 138 | 1 | 0,7% | ||
| | 98.30% | 99.10% | 99.10% | 100.00% | 138 | 3 | 2,2% | ||
| | 97.40% | 99.10% | 99.10% | 100.00% | 138 | 3 | 2,2% | ||
| 99.10% | 99.10% | 98.90% | 100.00% | 270 | 2 | 0,7% | |||
| | 98.70% | 99.60% | 99.50% | 100.00% | 270 | 3 | 1,1% | ||
| | 98.30% | 99.60% | 99.50% | 100.00% | 270 | 2 | 0,7% | ||
| 97.90% | 100.00% | 100.00% | 100.00% | 274 | 2 | 0,7% | |||
| | 91.0%* | 99.50% | 99.50% | 100.00% | 274 | 43 | 15,7% | ||
| | 86.3%* | 100.00% | 100.00% | 100.00% | 274 | 55 | 20,1% | ||
| 99.10% | 99.10% | 100.00% | 93.90% | 272 | 0 | 0,0% | |||
| | Insufficient amount of material to run sample | ||||||||
| | 98.70% | 99.10% | 100.00% | 94.10% | 272 | 9 | 3,3% | ||
| 99.60% | 99.60% | 100.00% | 97.40% | 273 | 1 | 0,4% | |||
| | 98.70% | 99.10% | 99.50% | 97.30% | 273 | 3 | 1,1% | ||
| 99.10% | 99.60% | 100.00% | 97.40% | 273 | 4 | 1,5% | |||
- Detection: Percentage of SNPs that are covered by at least five sequencing reads.
- Concordance: Percentage of detected SNPs that matched the HapMap (HapMap Public Release #27; merged II + III) genotype. Due to the underrepresentation of barcode 4 (BC4) in the pools 770 and 792, the detection and concordance rates were lower (assigned in this Table with *).
- gDNA: genomic DNA; WGA: whole-genome amplification; emPCR: emulsion PCR; BC: barcode.
Efficiency and SNP detection rates of non-barcoded and pooled samples
| 244 | 226 | 376 | 92,6% | 39,9% | ||
| | 244 | 230 | 371 | 94,3% | 38,0% | |
| 244 | 212 | 277 | 86,9% | 23,5% | ||
| | 244 | 212 | 267 | 86,9% | 20,6% | |
| 244 | 193 | 214 | 79,1% | 9,8% | ||
| 244 | 198 | 221 | 81,1% | 10,4% |
- Minimum read count for SNP call: Minimum number of non-reference allele counts required for a SNP to be considered detected.
- Positive control SNPs: Positive control SNPs generated from the non-pooled, non-barcoded data (759–764). Since the HapMap genotyping data was incomplete, even for known SNPs, we attempted to create a positive control set of SNPs within the targeted regions. If the SNP was detected within samples 759–764, a combined genotype was determined for that SNP position. For example, position X was determined to have a “CG” genotype in sample 759 and position X had the reference genotype of “CC” in samples 760–764, the predicted allele frequency would be 8.3% (1 in 12). In the non-pooled samples, a SNP with a non-reference allele frequency of 10-90% was considered a heterozygote. A homozygous SNP in non-pooled samples was defined as having >90% non-reference allele frequency. The number in this column represents the total number of SNPs that have a non-reference allele within a given pooled sample. Note that these positive control SNPs include HapMap samples with rs IDs, non-HapMap samples with rs IDs, and potentially novel SNPs.
- Positive control SNPs found: This number represents the number of positive control SNPs that were detected in a given pool with a given set of parameters.
- Total SNPs detected: This number represents the total number of SNPs found in a given pool with a given set of parameters. This number contains the “positive SNPs found” plus other SNPs. It is assumed that most of these SNPs are false positives since this number decreases significantly if you increase the stringency of your SNP detection parameters. However, some novel SNPs could exist in this set.
- Sensitivity: In this case, this is simply the percentage of positive controls SNPs found in a given pool with a given set of parameters. Sensitivity decreases as SNP detection stringency increases.
- False Discovery Rate: This was defined as (total SNPs detected – positive control SNPs found)/Total SNPs detected * 100.
Figure 2(a and b) Actual Versus Predicted Allele Frequencies in Pooled Samples. These two Figures show the correlation between predicted (X-axis) and observed (Y-axis) non-reference allele frequencies in the two pooled gDNA samples before SOLiD library preparation, namely sample/library 768_1L (replicate 1; Figure 2a) and 768_2L (replicate 2; Figure 2b). For each positive control SNP, a genotype was inferred from the non-pooled sequencing samples. Composite SNP allele frequencies were calculated for each pool and compared to the actual SNP allele frequencies. R2 values represent the square of the Pearson coefficient and reveal good correlation between samples (0.9045 and 0.9261 of sample 768_1L and 768_2L, respectively).
Figure 3Reproducibility of target enrichment: Library 768_1L vs. Library 768_2L. Figure 3 shows the comparison of non-reference allele frequency within the two technical replicates (768_1L and 768_2L). R2 represents the square of the Pearson coefficient and reveals high consistency (R2 = 0.9632), and hence reproducibility, between the two replicates.
Figure 4Pooling of non-barcoded samples to detect rare variants. Figure 4 shows the ability to detect minor alleles represented in the two technical replicates (pooled samples before library preparation; samples IDs 768_1L and 768_2L), based on the known reference genotypes for each of the individual samples within each pool. In each case, the predicted frequency (blue boxplots) and the actual frequency (red boxplots) were in agreement.