| Literature DB >> 21364695 |
Alison J Coffey1, Felix Kokocinski, Maria S Calafato, Carol E Scott, Priit Palta, Eleanor Drury, Christopher J Joyce, Emily M Leproust, Jen Harrow, Sarah Hunt, Anna-Elina Lehesjoki, Daniel J Turner, Tim J Hubbard, Aarno Palotie.
Abstract
Sequencing the coding regions, the exome, of the human genome is one of the major current strategies to identify low frequency and rare variants associated with human disease traits. So far, the most widely used commercial exome capture reagents have mainly targeted the consensus coding sequence (CCDS) database. We report the design of an extended set of targets for capturing the complete human exome, based on annotation from the GENCODE consortium. The extended set covers an additional 5594 genes and 10.3 Mb compared with the current CCDS-based sets. The additional regions include potential disease genes previously inaccessible to exome resequencing studies, such as 43 genes linked to ion channel activity and 70 genes linked to protein kinase activity. In total, the new GENCODE exome set developed here covers 47.9 Mb and performed well in sequence capture experiments. In the sample set used in this study, we identified over 5000 SNP variants more in the GENCODE exome target (24%) than in the CCDS-based exome sequencing.Entities:
Mesh:
Year: 2011 PMID: 21364695 PMCID: PMC3137498 DOI: 10.1038/ejhg.2011.28
Source DB: PubMed Journal: Eur J Hum Genet ISSN: 1018-4813 Impact factor: 4.246
Comparison of the coverage of the design target between the three exome sets
| No. of bait regions | 197 218 | 316 000 | 406 539 |
| Genome coverage (Mb) | 34.1 | 37.6 | 47.9 |
| ECRs covered (%) | 150529 (72.7) | 164225 (79.3) | 205031 (99.0) |
| Transcripts covered (%) | 66828 (81.0) | 71279 (86.4) | 81204 (98.4) |
| Genes covered (%) | 28203 (76.5) | 30030 (81.5) | 35989 (97.7) |
NimbleGen Sequence Capture 2.1M Human Exome Array.
Agilent SureSelect Human All Exon Kit.
Total length of bait regions including flanking regions.
Theoretical length without flanking regions.
Figure 1Comparison of exon and transcript coverage between oligonucleotide locations of the available exome kits and current reference gene sets (CCDS database March 2010, RefSeq genes March 2010 and GENCODE version 3c). The histogram shows the near-complete coverage by the GENCODE exome of all reference sets. Full data are given in Supplementary Table 1.
Figure 2Coverage achieved by the GENCODE exome. (a) Cumulative fold coverage plot for HapMap samples captured with Agilent SureSelect Human All Exon Kit (CCDS), the GENCODE exome, and the regions covered by the GENCODE exome only. Similar data are presented for the clinical samples and the GENCODE exome only in Supplementary Figure 1. In all cases, the thin red vertical line indicates a fold coverage of eightfold, the preferred coverage required for variant calling. (b and c) Detailed view of the average sequence depth of the seven clinical samples across the entire gene region post sequence capture of two example genes that are unique to the GENCODE exome: (b) ABCB11 and (c) XPC. In the upper part of each panel, the positions of the baits from the GENCODE exome are given as dark grey boxes above the exon structure (adapted from the Ensembl genome browser) of each gene in red. The increased sequencing depth of the eights exon of XPC is caused by good coverage of this larger exon with eight different bait sequences, whereas the other smaller exons are covered by one or two baits.
SNP-calling results from clinical and HapMap samples using GENCODE and Agilent CCDS exome captures
| SNPs (polymorphic sites only) | 21 170 | 21 529 | 21 052 | 21 445 | 21 124 | 23 612 | 23 276 | 20 780 | 21 513 | 24 520 | 16 732 | 17 014 | 21 915 |
| dbSNP, % (version 130) | 93.7 | 93.5 | 93.7 | 93.8 | 92.1 | 93.5 | 93.5 | 96.5 | 94.1 | 94.8 | 98.0 | 95.2 | 95.5 |
| dbSNP/1000G, % (26/03/10 pilot 1) | 96.3 | 95.1 | 96.3 | 96.1 | 94.7 | 96.1 | 96.1 | 97.7 | 97.2 | 97.3 | 99.0 | 97.9 | 98.1 |
| Hets | 12 604 | 13 241 | 12 938 | 13 153 | 13 297 | 14 476 | 14 321 | 12 675 | 13 588 | 16 121 | 10 094 | 10 583 | 14 414 |
| Ti/Tv | 3.029 | 2.996 | 3.036 | 3.025 | 2.930 | 3.021 | 3.120 | 3.069 | 3.112 | 3.138 | 3.235 | 3.258 | 3.322 |
| Concordant | 4638/4648 | 4706/4716 | 4649/4654 | 4569/4582 | 4639/4652 | 5204/5213 | 10 491/10 564 | 10 862/10 904 | 10 720/10 810 | 8850/8909 | 9057/9088 | 9795/9877 | |
| Concordant % | 99.78 | 99.79 | 99.89 | 99.72 | 99.72 | 99.83 | 99.31 | 99.61 | 99.16 | 99.34 | 99.66 | 99.17 | |
| Synonymous | 9196 | 9249 | 9072 | 9191 | 8948 | 10 220 | 10 111 | 8480 | 9207 | 10 568 | 7979 | 8133 | 10 528 |
| Synonymous/Mb (35.2 Mb) | 261.25 | 262.76 | 257.73 | 261.11 | 254.21 | 290.35 | 287.25 | 240.91 | 261.57 | 300.23 | 226.68 | 231.05 | 299.10 |
| Non-synonymous | 8608 | 8804 | 8692 | 8828 | 8696 | 9634 | 9385 | 8758 | 8703 | 9958 | 6863 | 6976 | 8918 |
| Non-synonymous/ Mb (35.2 Mb) | 244.55 | 250.12 | 246.94 | 250.80 | 247.05 | 273.70 | 266.62 | 248.81 | 247.25 | 282.90 | 194.97 | 198.18 | 253.36 |
| Stop gained | 86 | 85 | 80 | 89 | 128 | 87 | 95 | 80 | 83 | 99 | 44 | 40 | 51 |
| Stop gained/Mb (35.2 Mb) | 2.44 | 2.41 | 2.27 | 2.53 | 3.64 | 2.47 | 2.70 | 2.27 | 2.36 | 2.81 | 1.25 | 1.14 | 1.45 |
| SNPs (polymorphic sites only) – within GENCODE-only ECRs | 5179 | 5212 | 5117 | 5162 | 5162 | 5414 | 5424 | 5017 | 5319 | 5887 | |||
| SNPs (polymorphic sites only) – within GENCODE-only ECRs/Mb | 691.77 | 696.17 | 683.48 | 689.50 | 689.50 | 723.16 | 724.49 | 670.13 | 710.47 | 786.33 | |||
Concordant SNPs were compared with Illumina 660K chip GenCall genotypes for clinical samples or HapMap3 genotypes for HapMap samples NA12878, NA07000 and NA19240.
Missing data.
Excluding flanking regions.