Literature DB >> 20018059

Evaluation of single-nucleotide polymorphism imputation using random forests.

Daniel F Schwarz¹, Silke Szymczak, Andreas Ziegler, Inke R König.

Abstract

Genome-wide association studies (GWAS) have helped to reveal genetic mechanisms of complex diseases. Although commonly used genotyping technology enables us to determine up to a million single-nucleotide polymorphisms (SNPs), causative variants are typically not genotyped directly. A favored approach to increase the power of genome-wide association studies is to impute the untyped SNPs using more complete genotype data of a reference population.Random forests (RF) provides an internal method for replacing missing genotypes. A forest of classification trees is used to determine similarities of probands regarding their genotypes. These proximities are then used to impute genotypes of untyped SNPs.We evaluated this approach using genotype data of the Framingham Heart Study provided as Problem 2 for Genetic Analysis Workshop 16 and the Caucasian HapMap samples as reference population. Our results indicate that RFs are faster but less accurate than alternative approaches for imputing untyped SNPs.

Entities: Chemical Disease Gene Species

Year: 2009 PMID： 20018059 PMCID： PMC2795966 DOI： 10.1186/1753-6561-3-s7-s65

Source DB: PubMed Journal: BMC Proc ISSN： 1753-6561

Background

Recently, genome-wide association studies (GWAS) have expanded our knowledge about genomic variants that influence susceptibility to complex diseases such as myocardial infarction [1,2]. One important reason for this success is the substantial technological progress enabling the genotyping of up to a million single-nucleotide polymorphisms (SNPs) simultaneously. However, with about 15 million known SNPs in the current Build 129 of dbSNP http://www.ncbi.nlm.nih.gov/SNP and almost four million of these available in release 23a from the HapMap project http://www.hapmap.org, the coverage achieved by direct genotyping is still far from perfect. Thus, the majority of all known SNPs in the genome are evaluated only indirectly with commonly used genotyping platforms. Consequently, today's GWAS are usually not able to genotype causal variants but will detect association with a nearby SNP in high linkage disequilibrium (LD). Although this approach has proved successful in many cases, it is still likely that a great number of causal variations are yet undetected and that the power of GWAS could be increased by performing statistical tests with disease influencing SNPs directly [3]. One preferred approach to increase the power of GWAS is to combine data from several studies [4], thus increasing the sample sizes from thousands to tens of thousands. However, these meta-analyses of GWAS pose special problems, such as a limited overlap in genotyped SNPs if different platforms were used across the studies. A promising solution is to impute the respective untyped SNPs using genotype data of the performed study and data of a similar reference population that has been genotyped at additional SNPs [5]. As a result, estimated genotypes may be used to fill in the gaps in the original GWAS and to increase the overlap between different GWAS. In our work, we applied the random forests (RF) imputation approach to untyped SNPs [6]. We evaluate this method on genotype data of probands of the Framingham Heart Study provided as Problem 2 for Genetic Analysis Workshop (GAW) 16.

Methods

Algorithm

RF is a data-mining method that is able to produce accurate classifiers even when many variables are observed in relatively few individuals. Furthermore, it provides estimates of variable importance, generates an unbiased generalization error estimate, and includes a technique for estimating missing data. Using its classifying power, we have recently shown that a screening of SNPs by RF is suitable to detect promising candidate SNPs in GWAS for complex diseases [7]. A specific feature of RF is the ability to replace missing values through an iterative process [8]. To use this for imputation, two essential prerequisites need to be satisfied. First, each variable must have at least one non-missing value. This is not fulfilled in presence of untyped SNPs because all values are missing. Adding non-missing data of a reference population is a possible approach to overcome this as described in Algorithm 1 Step 1a. Second, RF needs a variable to classify on because RF is a supervised learning method. If the data contain only genotypes, this precondition is not met. Algorithm 1 Step 1b describes a solution to use genotype data by enriching it with synthetic data. Thus, the original method was modified accordingly. The procedure for imputing missing SNPs comprises several steps as shown in Figure 1 and proceeds as follows:

Figure 1

Flow chart of algorithm 1. Algorithm 1 proceeds as follows: 1) enrich data; 2) mark missing and undefined genotypes; 3) roughly replace missing values; 4) grow forest; 5) calculate sample proximities; 6) update former missing values using proximities; 7) repeat Steps 4-6 several times; 8) extract imputed original data.

Algorithm 1

1) Enrichment of original data

The data is enriched in two successive steps as follows: a) The original data is merged with a subset of HapMap [9] genotype data. This subset contains exactly the SNPs that were typed in the original data. If one also aims at imputing SNPs that were not typed at all in the original data, these SNPs need to be contained in the HapMap subset as well. The selected HapMap probands have to be independent from each other and chosen from a population similar to the probands in the original data. b) The data is subsequently modified as follows: to begin, the original data set that contains only genotype information is considered as Class 1. A new synthetic data set with the same number of probands and SNPs is created and labelled as Class 2. This synthetic data is created by sampling at random without replacement from the univariate distributions of the original data. The sampling is separately performed for each SNP. Thus, each SNP has the same univariate distribution as the corresponding SNP in the original data. The random SNPs of Class 2 are independently distributed and contain no dependency structure [8]. The original data and the synthetic data are merged into a single data set, resulting in an artificial two-class data set that can be used by supervised learning methods. The RF is thus able to perform unsupervised learning so that phenotype data is not mandatory [8,10].

2) Labelling

Missing and undefined genotypes are internally marked with the label .

3) Rough imputation

Missing values of each SNP are roughly imputed by its median value. This initial crude imputation is essential because RF cannot handle missing data [8].

4) Forest growing

A classification forest is grown. The trees are built on the new data set created in Step 1.

5) Calculate probands' proximities

An important part of the RF imputation method is the proximity matrix, which contains the pair-wise similarities between all pairs of probands. Specifically, the proximities are determined as follows: first, they are set to zero. Then, each proband is classified by all trees in the forest. For each tree, if two probands are evaluated by exactly the same series of decision rules, their proximity is increased by one. A detailed description of decision tree rules and classification is given in Breiman et al. [11].

6) Updating MISSING genotypes

Genotypes internally marked as are re-estimated. The updating is separately performed for each SNP. The new value is calculated as follows: at a specific SNP, each proband holds exactly one genotype value of the set 0, 1, or 2. This also applies to former missing genotypes that were roughly imputed or updated during a previous iteration. For each proband with a genotype marked as , the new genotype is calculated using a weighted average of the genotypes of remaining probands. Each weight is calculated based on the proximity between the two samples as determined in Step 5.

7) Iterate

It is recommended to perform Steps 4-6 at most five times [8].

8) Imputed data

The resulting data set of this iterative procedure consists of HapMap data, imputed original data, and synthetic data. The imputed original data set is extracted and the imputation is finished. Algorithm 1 was implemented in C++ language.

Evaluation of imputation

The assessment of the imputation quality was performed as follows: 1) A subset of the Framingham Heart Study [12] genotype data containing SNP genotypes of independent probands was chosen. 2) From all available SNPs in the original data set, 10% were drawn without replacement. All genotypes of these SNPs were deleted in this data set to mimic a situation with untyped SNPs. 3) Imputation of the deleted SNPs was performed as described in Algorithm 1. 4) For each SNP, the imputed genotypes were compared with the corresponding original genotypes. The imputation accuracy of a SNP equals the number of correctly imputed genotypes divided by the number of all imputed genotypes. The result reflects the quality of the SNP's imputation. Quality and computing time of the imputation depend primarily on two parameters, namely, the number of trees in each forest and the number of iterations in Algorithm 1. To obtain the optimal trade-off between computing time and imputation quality, several RF imputation runs were performed using different parameter settings. In addition, a potential correlation of SNP imputation quality with minor allele frequency (MAF) was subsequently investigated. As a standard approach, the untyped SNPs were also imputed using the computer program IMPUTE [5] using default parameters and option pgs. IMPUTE calculates three probabilities for each SNP genotype of a sample. Each probability belongs to the homozygote rare allele, homozygote common allele, or heterozygote genotype. The most likely genotype has been chosen for missing replacement. Results of IMPUTE and our RF method were subsequently compared with regard to accuracy and computing time.

Data

Data from 6752 participants in the Framingham Heart Study [12] were provided as Problem 2 for GAW16. For our analysis, only the genotypes of 762 unrelated individuals from generation 3 were selected; neither haplotype data nor LD block data were used. Standard quality control was applied to genotype data of the Affymetrix GeneChip® Human Mapping 500k Array Set (488,146 SNPs) as recommended [1,2,4,13]. SNPs with a call rate < 0.98 per study group, a MAF < 0.01 in the cases and controls combined or a p-value < 0.0001 for deviation from Hardy-Weinberg equilibrium in control group were excluded, resulting in 336,206 SNPs. Finally, only SNPs of chromosome 22 were selected. The resulting data contained the genotypes of 3,775 SNPs. The mean distance between two adjacent SNPs amounted to 3488 bp. About 10% (n = 376) of the SNPs were deleted to represent untyped SNPs in a real-world data set as previously described. The mean MAF of SNPs was 0.2286, with a standard deviation of 0.1348. The minimal MAF was 0.0134 and MAF maximum was 0.4993. The mean distance between an untyped SNP and the nearest genotyped SNP amounted to 3916 bp. For reference HapMap data [9] used in Algorithm 1 Step 1a, genotypes of the 3,775 SNPs in 60 unrelated CEU founders were downloaded.

Results

The optimal trade-off between computing time and imputation quality was obtained by using 300 trees and five iterations. In this setting, RF imputation required 5 minutes on a quad-core computer with a 2.33 GHz processor. The mean accuracy amounted to 62.70% with a standard deviation of 17.88%. The minimal and maximal accuracy was 34.78% and 97.32%, respectively. Imputation accuracy and MAF of a SNP were found to be strongly correlated as shown in Figure 2. Only SNPs with a small MAF showed a high imputation quality. Considering SNPs with a higher MAF, the accuracy decreased drastically. Accuracy of imputed SNPs with a MAF between 0.15 and 0.3 is heterogeneous. Given the MAF of a SNP, imputation accuracy is similar to the maximum of genotype frequencies of a SNP in Hardy-Weinberg equilibrium (Figure 2).

Figure 2

Correlation between imputation accuracy and MAF. Blue and gray dots denote 3,775 SNPs that were imputed by RF and IMPUTE, respectively. SNPs are plotted according to accuracy and MAF. Black lines denote the three genotype frequencies of a SNP in Hardy-Weinberg equilibrium given its MAF. IMPUTE required 20 minutes computing time on a computer with a 2.33 GHz processor. The mean accuracy was 92.62%, with a standard deviation of 10.61%. The minimal and maximal accuracy was 52.49% and 100.00%, respectively. MAF and accuracy were not found to be correlated (Figure 2).

Discussion and conclusion

The RF imputation procedure consumes an acceptable amount of computing time and imputes considerably faster than the alternative standard approach. An imputation of a full GWAS SNP data set might be feasible even on slow computers. However, this advantage is accompanied by a lower quality of imputation compared to IMPUTE. Obviously, the imputation quality of a SNP strongly depends on its MAF. Further theoretical study is needed to investigate whether the expected imputation accuracy of a SNP can be roughly estimated by calculating the maximum of its genotype frequencies. To conclude, we presented an approach of imputing untyped SNPs using RF. The procedure is computationally feasible. However, for a highly accurate imputation of untyped SNPs, alternative methods may be more appropriate.

List of abbreviations used

GAW: Genetic Analysis Workshop; GWAS: Genome-wide association study; LD: Linkage disequilibrium; MAF: Minor allele frequency; RF: Random forests; SNP: Single-nucleotide polymorphisms.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

DFS carried out the analysis, the programming of all C++/C/R program code, and drafted the manuscript. SS merged HapMap data and Framingham Heart Study data and drafted the manuscript. IRK participated in the design and coordination of the study. AZ conceived of the study and finalized the manuscript.

9 in total

1. A new multipoint method for genome-wide association studies by imputation of genotypes.

Authors: Jonathan Marchini; Bryan Howie; Simon Myers; Gil McVean; Peter Donnelly
Journal: Nat Genet Date: 2007-06-17 Impact factor: 38.330

2. Conjuring SNPs to detect associations.

Authors: Andrew G Clark; Jian Li
Journal: Nat Genet Date: 2007-07 Impact factor: 38.330

Review 3. Biostatistical aspects of genome-wide association studies.

Authors: Andreas Ziegler; Inke R König; John R Thompson
Journal: Biom J Date: 2008-02 Impact factor: 2.207

4. The Third Generation Cohort of the National Heart, Lung, and Blood Institute's Framingham Heart Study: design, recruitment, and initial examination.

Authors: Greta Lee Splansky; Diane Corey; Qiong Yang; Larry D Atwood; L Adrienne Cupples; Emelia J Benjamin; Ralph B D'Agostino; Caroline S Fox; Martin G Larson; Joanne M Murabito; Christopher J O'Donnell; Ramachandran S Vasan; Philip A Wolf; Daniel Levy
Journal: Am J Epidemiol Date: 2007-03-19 Impact factor: 4.897

5. Repeated replication and a prospective meta-analysis of the association between chromosome 9p21.3 and coronary artery disease.

Authors: Heribert Schunkert; Anika Götz; Peter Braund; Ralph McGinnis; David-Alexandre Tregouet; Massimo Mangino; Patrick Linsel-Nitschke; Francois Cambien; Christian Hengstenberg; Klaus Stark; Stefan Blankenberg; Laurence Tiret; Pierre Ducimetiere; Andrew Keniry; Mohammed J R Ghori; Stefan Schreiber; Nour Eddine El Mokhtari; Alistair S Hall; Richard J Dixon; Alison H Goodall; Henrike Liptau; Helen Pollard; Daniel F Schwarz; Ludwig A Hothorn; H-Erich Wichmann; Inke R König; Marcus Fischer; Christa Meisinger; Willem Ouwehand; Panos Deloukas; John R Thompson; Jeanette Erdmann; Andreas Ziegler; Nilesh J Samani
Journal: Circulation Date: 2008-03-24 Impact factor: 29.690

6. A second generation human haplotype map of over 3.1 million SNPs.

Authors: Kelly A Frazer; Dennis G Ballinger; David R Cox; David A Hinds; Laura L Stuve; Richard A Gibbs; John W Belmont; Andrew Boudreau; Paul Hardenbol; Suzanne M Leal; Shiran Pasternak; David A Wheeler; Thomas D Willis; Fuli Yu; Huanming Yang; Changqing Zeng; Yang Gao; Haoran Hu; Weitao Hu; Chaohua Li; Wei Lin; Siqi Liu; Hao Pan; Xiaoli Tang; Jian Wang; Wei Wang; Jun Yu; Bo Zhang; Qingrun Zhang; Hongbin Zhao; Hui Zhao; Jun Zhou; Stacey B Gabriel; Rachel Barry; Brendan Blumenstiel; Amy Camargo; Matthew Defelice; Maura Faggart; Mary Goyette; Supriya Gupta; Jamie Moore; Huy Nguyen; Robert C Onofrio; Melissa Parkin; Jessica Roy; Erich Stahl; Ellen Winchester; Liuda Ziaugra; David Altshuler; Yan Shen; Zhijian Yao; Wei Huang; Xun Chu; Yungang He; Li Jin; Yangfan Liu; Yayun Shen; Weiwei Sun; Haifeng Wang; Yi Wang; Ying Wang; Xiaoyan Xiong; Liang Xu; Mary M Y Waye; Stephen K W Tsui; Hong Xue; J Tze-Fei Wong; Luana M Galver; Jian-Bing Fan; Kevin Gunderson; Sarah S Murray; Arnold R Oliphant; Mark S Chee; Alexandre Montpetit; Fanny Chagnon; Vincent Ferretti; Martin Leboeuf; Jean-François Olivier; Michael S Phillips; Stéphanie Roumy; Clémentine Sallée; Andrei Verner; Thomas J Hudson; Pui-Yan Kwok; Dongmei Cai; Daniel C Koboldt; Raymond D Miller; Ludmila Pawlikowska; Patricia Taillon-Miller; Ming Xiao; Lap-Chee Tsui; William Mak; You Qiang Song; Paul K H Tam; Yusuke Nakamura; Takahisa Kawaguchi; Takuya Kitamoto; Takashi Morizono; Atsushi Nagashima; Yozo Ohnishi; Akihiro Sekine; Toshihiro Tanaka; Tatsuhiko Tsunoda; Panos Deloukas; Christine P Bird; Marcos Delgado; Emmanouil T Dermitzakis; Rhian Gwilliam; Sarah Hunt; Jonathan Morrison; Don Powell; Barbara E Stranger; Pamela Whittaker; David R Bentley; Mark J Daly; Paul I W de Bakker; Jeff Barrett; Yves R Chretien; Julian Maller; Steve McCarroll; Nick Patterson; Itsik Pe'er; Alkes Price; Shaun Purcell; Daniel J Richter; Pardis Sabeti; Richa Saxena; Stephen F Schaffner; Pak C Sham; Patrick Varilly; David Altshuler; Lincoln D Stein; Lalitha Krishnan; Albert Vernon Smith; Marcela K Tello-Ruiz; Gudmundur A Thorisson; Aravinda Chakravarti; Peter E Chen; David J Cutler; Carl S Kashuk; Shin Lin; Gonçalo R Abecasis; Weihua Guan; Yun Li; Heather M Munro; Zhaohui Steve Qin; Daryl J Thomas; Gilean McVean; Adam Auton; Leonardo Bottolo; Niall Cardin; Susana Eyheramendy; Colin Freeman; Jonathan Marchini; Simon Myers; Chris Spencer; Matthew Stephens; Peter Donnelly; Lon R Cardon; Geraldine Clarke; David M Evans; Andrew P Morris; Bruce S Weir; Tatsuhiko Tsunoda; James C Mullikin; Stephen T Sherry; Michael Feolo; Andrew Skol; Houcan Zhang; Changqing Zeng; Hui Zhao; Ichiro Matsuda; Yoshimitsu Fukushima; Darryl R Macer; Eiko Suda; Charles N Rotimi; Clement A Adebamowo; Ike Ajayi; Toyin Aniagwu; Patricia A Marshall; Chibuzor Nkwodimmah; Charmaine D M Royal; Mark F Leppert; Missy Dixon; Andy Peiffer; Renzong Qiu; Alastair Kent; Kazuto Kato; Norio Niikawa; Isaac F Adewole; Bartha M Knoppers; Morris W Foster; Ellen Wright Clayton; Jessica Watkin; Richard A Gibbs; John W Belmont; Donna Muzny; Lynne Nazareth; Erica Sodergren; George M Weinstock; David A Wheeler; Imtaz Yakub; Stacey B Gabriel; Robert C Onofrio; Daniel J Richter; Liuda Ziaugra; Bruce W Birren; Mark J Daly; David Altshuler; Richard K Wilson; Lucinda L Fulton; Jane Rogers; John Burton; Nigel P Carter; Christopher M Clee; Mark Griffiths; Matthew C Jones; Kirsten McLay; Robert W Plumb; Mark T Ross; Sarah K Sims; David L Willey; Zhu Chen; Hua Han; Le Kang; Martin Godbout; John C Wallenburg; Paul L'Archevêque; Guy Bellemare; Koji Saeki; Hongguang Wang; Daochang An; Hongbo Fu; Qing Li; Zhen Wang; Renwu Wang; Arthur L Holden; Lisa D Brooks; Jean E McEwen; Mark S Guyer; Vivian Ota Wang; Jane L Peterson; Michael Shi; Jack Spiegel; Lawrence M Sung; Lynn F Zacharia; Francis S Collins; Karen Kennedy; Ruth Jamieson; John Stewart
Journal: Nature Date: 2007-10-18 Impact factor: 49.962

7. New susceptibility locus for coronary artery disease on chromosome 3q22.3.

Authors: Jeanette Erdmann; Anika Grosshennig; Peter S Braund; Inke R König; Christian Hengstenberg; Alistair S Hall; Patrick Linsel-Nitschke; Sekar Kathiresan; Ben Wright; David-Alexandre Trégouët; Francois Cambien; Petra Bruse; Zouhair Aherrahrou; Arnika K Wagner; Klaus Stark; Stephen M Schwartz; Veikko Salomaa; Roberto Elosua; Olle Melander; Benjamin F Voight; Christopher J O'Donnell; Leena Peltonen; David S Siscovick; David Altshuler; Piera Angelica Merlini; Flora Peyvandi; Luisa Bernardinelli; Diego Ardissino; Arne Schillert; Stefan Blankenberg; Tanja Zeller; Philipp Wild; Daniel F Schwarz; Laurence Tiret; Claire Perret; Stefan Schreiber; Nour Eddine El Mokhtari; Arne Schäfer; Winfried März; Wilfried Renner; Peter Bugert; Harald Klüter; Jürgen Schrezenmeir; Diana Rubin; Stephen G Ball; Anthony J Balmforth; H-Erich Wichmann; Thomas Meitinger; Marcus Fischer; Christa Meisinger; Jens Baumert; Annette Peters; Willem H Ouwehand; Panos Deloukas; John R Thompson; Andreas Ziegler; Nilesh J Samani; Heribert Schunkert
Journal: Nat Genet Date: 2009-02-08 Impact factor: 38.330

8. Genomewide association analysis of coronary artery disease.

Authors: Nilesh J Samani; Jeanette Erdmann; Alistair S Hall; Christian Hengstenberg; Massimo Mangino; Bjoern Mayer; Richard J Dixon; Thomas Meitinger; Peter Braund; H-Erich Wichmann; Jennifer H Barrett; Inke R König; Suzanne E Stevens; Silke Szymczak; David-Alexandre Tregouet; Mark M Iles; Friedrich Pahlke; Helen Pollard; Wolfgang Lieb; Francois Cambien; Marcus Fischer; Willem Ouwehand; Stefan Blankenberg; Anthony J Balmforth; Andrea Baessler; Stephen G Ball; Tim M Strom; Ingrid Braenne; Christian Gieger; Panos Deloukas; Martin D Tobin; Andreas Ziegler; John R Thompson; Heribert Schunkert
Journal: N Engl J Med Date: 2007-07-18 Impact factor: 91.245

9. Picking single-nucleotide polymorphisms in forests.

Authors: Daniel F Schwarz; Silke Szymczak; Andreas Ziegler; Inke R König
Journal: BMC Proc Date: 2007-12-18