| Literature DB >> 23874508 |
Megan L Grove1, Bing Yu, Barbara J Cochran, Talin Haritunians, Joshua C Bis, Kent D Taylor, Mark Hansen, Ingrid B Borecki, L Adrienne Cupples, Myriam Fornage, Vilmundur Gudnason, Tamara B Harris, Sekar Kathiresan, Robert Kraaij, Lenore J Launer, Daniel Levy, Yongmei Liu, Thomas Mosley, Gina M Peloso, Bruce M Psaty, Stephen S Rich, Fernando Rivadeneira, David S Siscovick, Albert V Smith, Andre Uitterlinden, Cornelia M van Duijn, James G Wilson, Christopher J O'Donnell, Jerome I Rotter, Eric Boerwinkle.
Abstract
Genotyping arrays are a cost effective approach when typing previously-identified genetic polymorphisms in large numbers of samples. One limitation of genotyping arrays with rare variants (e.g., minor allele frequency [MAF] <0.01) is the difficulty that automated clustering algorithms have to accurately detect and assign genotype calls. Combining intensity data from large numbers of samples may increase the ability to accurately call the genotypes of rare variants. Approximately 62,000 ethnically diverse samples from eleven Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium cohorts were genotyped with the Illumina HumanExome BeadChip across seven genotyping centers. The raw data files for the samples were assembled into a single project for joint calling. To assess the quality of the joint calling, concordance of genotypes in a subset of individuals having both exome chip and exome sequence data was analyzed. After exclusion of low performing SNPs on the exome chip and non-overlap of SNPs derived from sequence data, genotypes of 185,119 variants (11,356 were monomorphic) were compared in 530 individuals that had whole exome sequence data. A total of 98,113,070 pairs of genotypes were tested and 99.77% were concordant, 0.14% had missing data, and 0.09% were discordant. We report that joint calling allows the ability to accurately genotype rare variation using array technology when large sample sizes are available and best practices are followed. The cluster file from this experiment is available at www.chargeconsortium.com/main/exomechip.Entities:
Mesh:
Year: 2013 PMID: 23874508 PMCID: PMC3709915 DOI: 10.1371/journal.pone.0068095
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Exome chip minor allele frequency distribution by race.
| MAF Interval | African Americans | Caucasians | Hispanics | Asians | All |
| (n = 13,375) (%) | (n = 40,102) (%) | (n = 2,128) (%) | (n = 776) (%) | (n = 56,407) (%) | |
| 0 | 23.6 | 16.5 | 43.6 | 77.5 | 4.5 |
| (0, 0.001] | 36.8 | 58.1 | 22.3 | 3.9 | 58.8 |
| (0.001, 0.005] | 14.6 | 8.6 | 13.9 | 3.9 | 15.3 |
| (0.005, 0.01] | 4.4 | 2.3 | 3.7 | 1.6 | 4.3 |
| (0.01, 0.05] | 7.9 | 3.8 | 5.1 | 3.0 | 5.7 |
| (0.05, 0.1] | 2.7 | 1.7 | 1.7 | 1.4 | 1.8 |
| (0.1, 0.2] | 3.0 | 2.4 | 2.4 | 2.1 | 2.4 |
| (0.2, 0.5] | 7.1 | 6.7 | 7.3 | 6.6 | 7.3 |
The following samples were excluded: all AGES individuals, race unknown or not reported, known replicates, HapMap controls, individuals with p10GC <0.38, and individuals with call rate <0.97. Individuals with race designated as other were included in the overall MAF calculation, but data is not shown separately (n = 26). A total of 238,065 variants were used for calculating minor allele frequencies after excluding those that failed laboratory quality control (n = 8,994) and duplicates (n = 811).
Results of missing data, genotype discordance, uncertainty coefficients and frequencies of exome chip data ascertained by three calling methods and compared to exome sequence genotypes.
| Exome Sequence | Exome Chip | Missing | Discordance | Uncertainty | ||||
| Genotypes | Genotypes | (%) | (%) | Coefficient | ||||
|
|
|
|
|
|
| |||
|
| 94,878,501 | 33,679 | 4,467 | 183,952 | 95,100,599 | |||
|
| 41,395 | 2,350,644 | 4,777 | 15,967 | 2,412,783 | |||
|
| 3,658 | 4,642 | 495,626 | 1,809 | 505,735 | |||
|
| 89,104 | 2,905 | 711 | 1,233 | 93,953 | |||
|
| 95,012,658 | 2,391,870 | 505,581 | 202,961 | 98,113,070 | 0.30 | 0.09 | 0.864 |
|
|
|
|
|
|
| |||
|
| 94,964,611 | 115,849 | 4,394 | 15,745 | 95,100,599 | |||
|
| 41,864 | 2,365,462 | 5,137 | 320 | 2,412,783 | |||
|
| 3,557 | 5,635 | 496,351 | 192 | 505,735 | |||
|
| 89,480 | 2,996 | 714 | 763 | 93,953 | |||
|
| 95,099,512 | 2,489,942 | 506,596 | 17,020 | 98,113,070 | 0.11 | 0.18 | 0.912 |
|
|
|
|
|
|
| |||
|
| 95,023,653 | 33,442 | 4,430 | 39,074 | 95,100,599 | |||
|
| 41,850 | 2,363,664 | 3,969 | 3,300 | 2,412,783 | |||
|
| 3,606 | 4,646 | 496,897 | 586 | 505,735 | |||
|
| 89,391 | 2,930 | 706 | 926 | 93,953 | |||
|
| 95,158,500 | 2,404,682 | 506,002 | 43,886 | 98,113,070 | 0.14 | 0.09 | 0.934 |
A total of 185,119 variants were used for these analyses, excluding duplicated variants, short insertion/deletions, XY chromosome SNPs, Y chromosome SNPs, mitochondrial SNPs, sites not identified in the exome sequencing dataset, and failing SNPs as identified by the CHARGE best practices guidelines. Genotype classes are represented as AA = common variant homozygote, AB = heterozygote, BB = rare variant homozygote, and XX = missing data. Dataset I: exome chip genotypes called with Illumina cluster file. Dataset Z: zCall assigned genotypes to missing data in Dataset I. Dataset C: exome chip genotypes called with the CHARGE cluster file.
Figure 1Results of CHARGE exome chip genotype calls compared to exome sequence data in 530 individuals.
Sample sizes of cohorts participating in joint calling effort by gender and self-reported race.
| Cohort | African Americans | Caucasians | Hispanics | Asians | Other | HapMaps | Replicates | Total | ||||||||
| M | F | M | F | M | F | M | F | M | F | M | F | M | F | U | ||
| AGES | 0 | 0 | 1,305 | 1,767 | 0 | 0 | 0 | 0 | 0 | 0 | 24 | 24 | 6 | 9 | 0 | 3,135 |
| ARIC | 1,121 | 1,832 | 5,198 | 5,873 | 0 | 0 | 0 | 0 | 0 | 0 | 77 | 76 | 62 | 90 | 200 | 14,529 |
| CABS | 283 | 172 | 3,701 | 1,174 | 0 | 0 | 0 | 0 | 0 | 0 | 57 | 59 | 93 | 29 | 0 | 5,568 |
| CHS | 318 | 526 | 2,008 | 2,603 | 0 | 0 | 1 | 3 | 14 | 13 | 48 | 46 | 25 | 34 | 0 | 5,639 |
| CARDIA | 900 | 1,185 | 1,063 | 1,189 | 0 | 0 | 0 | 0 | 0 | 0 | 48 | 48 | 18 | 30 | 0 | 4,481 |
| FamHS | 213 | 409 | 933 | 1,191 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | 23 | 1 | 0 | 0 | 2,784 |
| FHS | 0 | 0 | 3,702 | 4,475 | 0 | 0 | 0 | 0 | 0 | 0 | 75 | 69 | 47 | 76 | 0 | 8,444 |
| HABC | 515 | 680 | 930 | 839 | 0 | 0 | 0 | 0 | 0 | 0 | 48 | 47 | 7 | 5 | 0 | 3,071 |
| JHS | 1,063 | 1,795 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 64 | 64 | 0 | 0 | 0 | 2,986 |
| MESA | 1,129 | 1,464 | 1,282 | 1,397 | 978 | 1,151 | 387 | 386 | 0 | 0 | 44 | 44 | 51 | 39 | 0 | 8,352 |
| RS | 0 | 0 | 1,459 | 1,720 | 0 | 0 | 0 | 0 | 0 | 0 | 47 | 47 | 0 | 0 | 4 | 3,277 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Gender is unavailable for blinded replicates in the ARIC study and four RS samples.
Best practices criteria used to identify SNPs for visual inspection and manual reclustering.
| Best Practices Criteria |
| All X, Y, XY and MT variants |
| Call frequency between 0.95 and 0.99 |
| Cluster separation <0.4 |
| AB frequency >0.6 |
| AB R mean <0.2 |
| Het excess >0.1 |
| Het excess<−0.9 |
| AA theta mean between 0.2 and 0.3 |
| BB theta mean between 0.7 and 0.8 |
| AB theta mean between 0.2 and 0.3 |
| AB theta mean between 0.7 and 0.8 |
| AA theta deviation >0.025 |
| AB theta deviation ≥0.07 |
| BB theta deviation >0.025 |
| AB frequency = 0 and minor allele frequency >0 |
| AA frequency = 1 and call rate <1 |
| BB frequency = 1 and call rate <1 |
| MAF <0.0001 and call rate ≠ 1 |
| Rep error >2 |
| PPC error >1 |
| PC error >1 |
| Variants removed from v1.1 exome chip |
| Cautious sites |
AA: allele A homozygote; AB: heterozygote; BB: allele B homozygote; Het: heterozygote; MAF: minor allele frequency; MT: mitochondrial; PC: parent-child; PPC: parent-parent-child; R: normalized intensity; Rep, reproducibility.
Exome chip SNP exclusion criteria.
| Exclusion Criteria |
| Call frequency <0.95 (except Y chr) |
| Cluster separation <0.4 |
| AB frequency >0.6 |
| AB R mean <0.2 |
| Het excess >0.1 |
| Het excess<−0.9 |
| AA theta mean >0.3 |
| BB theta mean <0.7 |
| AB theta mean <0.2 or >0.8 |
| AA theta deviation >0.06 |
| AB theta deviation ≥0.07 |
| BB theta deviation >0.06 |
| Obvious batch effects |
AA: allele A homozygote; AB: heterozygote; BB: allele B homozygote; Het: heterozygote; R: normalized intensity.
Exome chip content and CHARGE excluded variants by functional category.
| Category | Total Variants | Variants Excluded |
| exonic;stopgain | 5,193 | 145 |
| exonic;splicing;stopgain | 90 | 1 |
| exonic;stoploss | 239 | 2 |
| exonic;splicing;stoploss | 5 | 0 |
| splicing | 2,263 | 60 |
| exonic;splicing;synonymous | 3,363 | 74 |
| exonic;splicing | 70 | 1 |
| exonic;splicing;nonsynonymous | 5,237 | 105 |
| exonic;nonsynonymous | 208,779 | 7,369 |
| exonic;synonymous | 6,415 | 281 |
| UTR3 | 518 | 46 |
| UTR5 | 77 | 6 |
| ncRNA_splicing | 1 | 1 |
| ncRNA_exonic | 111 | 8 |
| ncRNA_UTR3 | 8 | 0 |
| ncRNA_UTR5 | 1 | 0 |
| intronic | 5,762 | 254 |
| ncRNA_intronic | 447 | 23 |
| downstream | 187 | 19 |
| upstream | 181 | 7 |
| upstream;downstream | 8 | 0 |
| intergenic | 8,549 | 528 |
| indel | 137 | 10 |
| mitochrondrial | 226 | 54 |
| no annotation | 3 | 0 |
| Total | 247,870 | 8,994 |
dbNSFP was used for annotating variants [27] (see Methods).