| Literature DB >> 20011588 |
Dale J Hedges1, Dale Hedges, Dan Burges, Eric Powell, Cherylyn Almonte, Jia Huang, Stuart Young, Benjamin Boese, Mike Schmidt, Margaret A Pericak-Vance, Eden Martin, Xinmin Zhang, Timothy T Harkins, Stephan Züchner.
Abstract
Over the next few years, the efficient use of next-generation sequencing (NGS) in human genetics research will depend heavily upon the effective mechanisms for the selective enrichment of genomic regions of interest. Recently, comprehensive exome capture arrays have become available for targeting approximately 33 Mb or approximately 180,000 coding exons across the human genome. Selective genomic enrichment of the human exome offers an attractive option for new experimental designs aiming to quickly identify potential disease-associated genetic variants, especially in family-based studies. We have evaluated a 2.1 M feature human exome capture array on eight individuals from a three-generation family pedigree. We were able to cover up to 98% of the targeted bases at a long-read sequence read depth of > or = 3, 86% at a read depth of > or = 10, and over 50% of all targets were covered with > or = 20 reads. We identified up to 14,284 SNPs and small indels per individual exome, with up to 1,679 of these representing putative novel polymorphisms. Applying the conservative genotype calling approach HCDiff, the average rate of detection of a variant allele based on Illumina 1 M BeadChips genotypes was 95.2% at > or = 10x sequence. Further, we propose an advantageous genotype calling strategy for low covered targets that empirically determines cut-off thresholds at a given coverage depth based on existing genotype data. Application of this method was able to detect >99% of SNPs covered > or = 8x. Our results offer guidance for "real-world" applications in human genetics and provide further evidence that microarray-based exome capture is an efficient and reliable method to enrich for chromosomal regions of interest in next-generation sequencing experiments.Entities:
Mesh:
Year: 2009 PMID: 20011588 PMCID: PMC2788131 DOI: 10.1371/journal.pone.0008232
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Studied three-generational pedigree.
Pedigree of eight individuals of European descent that was studied with exome capture arrays.
NGS run statistics for eight exomes aligning high-quality sequencing reads.
| Individual | mapped bases (bp)/ % of total bases | # mapped unique reads | Unique reads | Target Base Coverage | Average read length (bp) | Max read length |
| 10032 | 926,438,032 (99.79%) | 2459464 (99.35%) | 1854613 (78.02%) | 92.50% | 369 | 677 |
| 10033 | 814,175,547 (99.73%) | 2275083 (99.26%) | 1570217 (71.60%) | 91.20% | 345 | 635 |
| 10034 | 750,594,870 (99.76%) | 2169892 (99.33%) | 1532537 (73.18%) | 90.90% | 335 | 732 |
| 10035 | 1,146,776,462 (99.69%) | 3293074 (99.28%) | 2457890 (77.45%) | 93.60% | 339 | 755 |
| 10036 | 1,333,018,529 (99.71%) | 3995447 (99.21%) | 3099809 (80.26%) | 93.20% | 328 | 728 |
| 10037 | 892,370,696 (99.75%) | 2421459 (99.30%) | 1875210 (80.15%) | 92.80% | 360 | 736 |
| 10039 | 912,259,209 (99.78%) | 2583714 (99.29%) | 1984028 (79.68%) | 94.20% | 347 | 755 |
| 10082 | 670,270,644 (99.63%) | 2197618 (98.80%) | 1753660 (82.92%) | 92.20% | 299 | 835 |
|
|
|
|
|
|
|
|
#Number of.
*The minimum read length required was 50 bp.
Before alignment, all raw reads were screened for duplicate reads, which are introduced by amplification steps during library preparation on next-generation platforms.
Only the first two runs are shown for better comparison.
Figure 2Sequence coverage of targeted exons.
The graph illustrates the cumulative coverage of targeted bases after sequencing 0.5 Gbp (red), 1 Gbp (blue), 1.5 Gbp (green), and 2 Gbp (purple). 1 Gb resulted in nearly 10x coverage of 50% of all targets; 2 Gb of data increase this number to 88%. Depending on a studies goal, maximum coverage might not always be required.
Genomic variants detected in eight exomes based on 2 454 GS FLX runs of aligned data.
| Individual | 10032 | 10033 | 10034 | 10035 | 10036 | 10037 | 10039 | 10082 | Avg. | Range |
|
|
|
|
|
|
|
|
|
|
|
|
| Non-Synonymous | 3467 | 2687 | 2749 | 4059 | 4257 | 3363 | 3952 | 3108 | 3455 | 2687–4257 |
| indel | 49 | 38 | 34 | 69 | 73 | 41 | 65 | 56 | 53 | 34–73 |
| SNP | 3418 | 2649 | 2715 | 3990 | 4184 | 3322 | 3887 | 3052 | 3402 | 2649–4184 |
| Synonymous | 4495 | 3655 | 3597 | 5421 | 5618 | 4561 | 5446 | 4057 | 4606 | 3597–5618 |
| indel | 19 | 20 | 19 | 38 | 35 | 29 | 28 | 30 | 27 | 19–38 |
| SNP | 4476 | 3635 | 3578 | 5383 | 5583 | 4532 | 5418 | 4027 | 4579 | 3578–5583 |
|
|
|
|
|
|
|
|
|
|
|
|
| Non-Synonymous | 344 | 254 | 244 | 486 | 723 | 347 | 402 | 337 | 392 | 244–723 |
| indel | 49 | 44 | 31 | 118 | 296 | 48 | 58 | 115 | 95 | 31–296 |
| SNP | 295 | 210 | 213 | 368 | 427 | 299 | 344 | 222 | 297 | 210–427 |
| Synonymous | 263 | 202 | 200 | 358 | 440 | 263 | 346 | 254 | 291 | 200–440 |
| indel | 21 | 16 | 19 | 44 | 76 | 21 | 29 | 31 | 32 | 16–76 |
| SNP | 242 | 186 | 181 | 314 | 364 | 242 | 317 | 223 | 259 | 181–364 |
|
|
|
|
|
|
|
|
|
|
|
|
Figure 3Estimated error rates.
Sensitivity of genotype calling based on HCDiff SNPs, AllDiff SNPs, and the proposed coverage-dependent genotype calling approach. A) False negative rates are based on concordance with a subset of 44,513 SNPs that overlapped with genotypes obtained with Illumina 1 M Duo BeadChips. The coverage-dependent variant calling approach that calibrates cut-off rates according to array-based genotypes is the most sensitive method, detecting >96% of SNPs at 5x coverage and >99% of all SNPs at ≥8x coverage. B) False positive rates. HCDiff is the most conservative algorithm, resulting in a smaller false positive rate, while the more relaxed dynamic genotype calling algorithm results in twice as high error rates at lower coverage.
Figure 4Variant read distribution across eight exomes.
Illustration of the dynamic nature of optimal cut-off rates for calling heterozygous/homozygous variants. At lower coverage (<10x) the ideal cut-off is 88% variant reads in our data, while it is 78% at coverage ≥20. Optimal usage of data should take advantage even of low covered targets. Data are based on comparison to Illumina genotyped SNPs. Green triangles: Illumina heterozygous genotypes, Blue diamonds: Illumina homozygous genotypes. NGS genotypes are placed according to their percent variant reads (y axis).