Literature DB >> 18466556

Two-stage approach for identifying single-nucleotide polymorphisms associated with rheumatoid arthritis using random forests and Bayesian networks.

Yan Meng¹, Qiong Yang, Karen T Cuenco, L Adrienne Cupples, Anita L Destefano, Kathryn L Lunetta.

Abstract

We used the simulated data set from Genetic Analysis Workshop 15 Problem 3 to assess a two-stage approach for identifying single-nucleotide polymorphisms (SNPs) associated with rheumatoid arthritis (RA). In the first stage, we used random forests (RF) to screen large amounts of genetic data using the variable importance measure, which takes into account SNP interaction effects as well as main effects without requiring model specification. We used the simulated 9187 SNPs mimicking a 10 K SNP chip, along with covariates DR (the simulated DRB1 gentoype), smoking, and sex as input to the RF analyses with a training set consisting of 750 unrelated RA cases and 750 controls. We used an iterative RF screening procedure to identify a smaller set of variables for further analysis. In the second stage, we used the software program CaMML for producing Bayesian networks, and developed complex etiologic models for RA risk using the variables identified by our RF screening procedure. We evaluated the performance of this method using independent test data sets for up to 100 replicates.

Entities: Chemical Disease Gene Species

Year: 2007 PMID： 18466556 PMCID： PMC2367609 DOI： 10.1186/1753-6561-1-s1-s56

Source DB: PubMed Journal: BMC Proc ISSN： 1753-6561

Background

It is commonly believed that complex diseases are caused not by single genes acting alone, but by multiple genes and non-genetic factors interacting with one another. Due to the large number of single-nucleotide polymorphisms (SNPs) now available in genome-wide scans, the computational burden of testing each locus for main effects and all possible two-way, three-way, and higher-order interactions is overwhelming. One approach to reducing the number of interactions to examine is to perform a two-stage analysis. In the first stage, one identifies a subset of SNPs for further analysis of interaction in the second stage. Often, a univariate test (e.g., a chi-square test) is used to identify SNPs associated with outcome in the first stage. When risk-associated SNPs have small marginal effects but large interaction effects in the population, univariate methods will result in low power for detecting these SNPs. "Multi-locus" approaches consider interactions of multiple genes and environmental factors in identifying susceptibility loci for complex diseases [1]. Random Forests (RFs) [2] provide a powerful method for detecting interacting risk susceptibility SNPs (rSNPs) [3]. However, this method does not provide a model that delineates the interactions. Bayesian networks (or directed graphical models) are graphs in which the nodes represent random variables and the arrows represent dependence relationships [4]. These methods have been successfully applied to generate a model describing the relationship among SNPs in multiple candidate genes for a complex trait [5].

Methods

We used the 100 replicates of simulated data in Problem 3 provided by the Genetic Analysis Workshop 15 (GAW15). We performed analyses with knowledge of the "real" answers but screened all of the 9187 SNPs, distributed on the genome to mimic a 10 K SNP chip without regard to the generating model. We used disease status for rheumatoid arthritis (RA) as the outcome and smoking, sex and DR genotype (the simulated DRB1 genotype) as covariates.

Subjects

To obtain biologically independent cases, for each replicate we randomly selected one affected sibling from each of 1500 nuclear families. These 1500 cases were then divided at random into a training data set of 750 affected subjects and a test data set of 750 cases. The GAW data provided 2000 unrelated unaffected individuals for use as controls. Two independent sets of 750 controls were selected at random from the 2000 for use as training data set and test data set controls. Thus, for each replicate we had independent training and test data sets consisting of 750 cases and 750 controls.

Random Forests

RFs grow a large number of classification or regression trees with no trimming or pruning. The RF method produces an importance score for each variable that quantifies the relative contribution of that variable to the prediction accuracy. We used this score to prioritize the predictor variables. The RF also produces prediction errors for the individuals, which we used for evaluation of the method. We used Random Forests version 5 [6] to analyze the training data. We used an iterative process similar to a strategy previously proposed for gene expression analysis [7] in which, at each iteration, we built a random forest using the training data, and saved the 50% of variables with the highest importance scores to build the next forest. The random forests built at each iteration were named IT0, IT1,..., IT, and the prediction errors of the training data set were estimated for the forest built at each iteration. We call the forest with the best prediction error IT. The variables included in ITwere then used in second-stage analysis. We compared the performance of the iterative procedure that resulted in the forest IT, in terms of keeping the true risk variables and removing noise variables, to a simple procedure of selecting the top 50 ranked variables by importance from iteration 0, in the test data sets. Specifically, we compared the prediction error of the ITforest, the IT0 forest (all variables used; no selection), and a forest built using only the top 50 SNPs from the IT0 forest ("ITtop50"). Because the iterative procedure averaged 53 SNPs in the final forest, we chose 50 SNPs from IT0 to yield a forest with approximately the same number of SNPs. We computed prediction error using the test data ("test"), and using the out-of-bag observations of the training data set ("training").

Network inference

Bayesian networks (BN) are directed acyclic graphs for representing the joint probability distribution of all variables. A network for discrete variables, e.g., Figure 1, is specified by the graph structure (nodes and arcs) and the conditional probability table (CPT) at each node (node chr6_162 is shown). Each node is a variable, and each directed arc implies association and direction of dependency between the two variables. The origin node of an arc is usually called the parental node, and nodes that an arc points to are called child nodes. A child node is conditionally independent of other nodes given its parental nodes. Thus, the joint probability of n variables can be simplified to , where x, x,..., xin the condition are the parental nodes of xand a subset of x1, x2, x,..., x, x. BN models are useful for describing complex relationships among variables, as well as for making predictions for variables that are regarded as outcomes.

Figure 1

Bayesian network based on variables of ITbp for Replicate 1.

Bayesian network based on variables of ITbp for Replicate 1. CaMML [8], Causal Minimum Message Length (MML), is a program for generating Bayesian networks. The general goal is to find a model that maximizes the posterior probability of that model given the data. CaMML searches over all possible structures (models) using the Metropolis algorithm. It uses MML as a metric that includes a penalty on model complexity to control the resampling process. We evaluated the performance of CaMML on a set of variables used in the ITforest described above. We used the test data set to predict case status and estimate prediction error.

Results

We identified the best surrogates for all risk loci (A-G) as the SNPs with the highest linkage disequilibrium (LD) (r2) with risk loci from the answer files given with the GAW15 data (Table 1). For locus C, three SNPs had r2 ≥ 0.2; for locus D, two SNPs had r2 ≥ 0.2. When analyzing the results, we considered these SNPs true positives, in that they are the best proxies for the true risk loci that were not genotyped.

Table 1

Power estimate of ITbp and CaMML

Variables	Disease locus	R²with disease locus	IT_bp100 replicates	IT_bpReplicate 1–50	CaMML Replicate 1–50
DR genotype^a	NA	NA	100%	100%	100%
Sex	NA	NA	100%	100%	100%
Smoking	NA	NA	96%	96%	96%
chr 6_154	C	0.958	100%	100%	100%
chr 6_153	C	0.563	100%	100%	100%
chr 6_152	C	0.418	100%	100%	100%
chr 6_155	C	0.104	97%	98%	98%
chr 6_150	C	0.027	13%	8%	4%
chr 6_149	C	0.014	6%	8%	4%
chr 6_139	C	0.009	18%	20%	10%
chr 6_138	C	0.009	17%	18%	10%
chr 6_140	C	0.007	1%	2%	2%
chr 6_134	C	0.007	4%	4%	4%
chr 6_137	C	0.006	2%	2%	2%
chr 6_130	C	0.006	8%	6%	2%
chr 6_148	C	0.005	9%	8%	4%
chr 6_147	C	0.004	9%	10%	8%
chr 6_135	C	0.002	3%	6%	2%
chr 6_145	C	0.001	35%	32%	24%
chr 6_132	C	0.0	7%	6%	6%
chr 6_162	D	0.902	100%	100%	100%
chr 6_160	D	0.273	67%	68%	66%
chr 6_156	D	0.001	11%	14%	2%
chr 11_387	F	0.135	5%	6%	6%
chr 11_388	F	0.064	5%	4%	4%
chr 11_389	F	0.934	98%	100%	100%
chr 11_391	F	0.031	1%	2%	2%
chr 16_29	A	0.001	1%	0%	0%
chr 18_269	E	0.171	51%	48%	10%
chr 8_442	B	0.001	0%	0%	0%
chr 9_186	G	0.021	0%	0%	0%
chr 9_189	H	0.014	0%	0%	0%

aSurrogates and covariates are in bold.

Power estimate of ITbp and CaMML aSurrogates and covariates are in bold.

Risk variables identified by RF

We compared ITand IT0 top 50 for choosing a set of variables by comparing how often the best surrogates for loci A-G appeared in the variable set. DR and the best surrogates for C, D, and F were included in 94 and 98 out of 100 replicates for the ITforest and the top 50 variables for IT0 forest, respectively. The average number of variables included in the ITforest was 53 (range 8–287). The ITforest occurred, on average, at iteration 7.64 (range 5–10).

Estimate of prediction error

As seen in Table 2, the mean and median prediction error for the training data sets is smaller than that for the test data sets for the ITand ITtop50 methods (median differences -2.77, -0.93, p < 0.0001), which may indicate overfitting. The IT0 forest gives similar prediction error for test and training data.

Table 2

Prediction error for random forest analyses

	IT_bp		IT_top50		IT₀

Statistics	Training	Test	Training	Test	Training	Test	CaMML Test
Mean	11.28	14.05	12.73	13.60	14.60	14.73	12.42
SD	0.83	0.90	0.95	0.85	0.96	0.91	0.97
Min	9.80	12.20	10.93	11.60	12.27	12.20	10.35
Max	14.73	16.87	16.00	15.47	18.00	16.53	16.00
p-Value^a	5.26 × 10^-18		1.35 × 10^-9		0.25
Difference in median	-2.77		-0.93		-0.17

ap-Value of the paired Wilcoxon rank test comparing training and test data prediction error.

Prediction error for random forest analyses ap-Value of the paired Wilcoxon rank test comparing training and test data prediction error. For the training data sets, the mean prediction error for the ITforests is smaller than that for the IT0 forests; the ITtop50 forests fall in between (Table 3). For the test data sets, although both ITtop50 and IToutperform IT0, the IThas larger prediction error than ITtop50 (difference in median = 0.43, p < 0.0001), which might be due to overfitting for the iterative method.

Table 3

Paired Wilcoxon rank test of prediction errors from three RFs, using Replicates 1–100

	Training data		Test data

Comparison of prediction errors	p-Value	Difference in median	p-Value	Difference in median
IT_bpvs. IT_top50	3.94 × 10^-18	-1.40	9.09 × 10^-10	0.43
IT_bpvs. IT₀	3.95 × 10^-18	-3.33	2.57 × 10^-12	-0.73
IT_top50vs. IT₀	3.94 × 10^-18	-1.87	1.20 × 10^-17	-2.10

Paired Wilcoxon rank test of prediction errors from three RFs, using Replicates 1–100 We used CaMML to analyze the variables selected from ITfor Replicates 1 to 50. Due to computational limits, if more than 50 variables were selected by IT, only the top 50 variables were used for second-stage analysis. With the maximum number of variables restricted to 50, the average number of variables used in CaMML across the 50 replicates was 40. In the estimated BNs, an average of 11 variables were connected to RA status directly or indirectly through other variables in a path of a network that included RA status. The average prediction error using the test data was 12.4% (Table 2), which is smaller than that of IT(Table 4). An example BN with the conditional probability table (CPT) for node chr6_162, using Replicate 1 is displayed in Figure 1. In this BN, all SNPs included in the analysis with r2 > 0.3 with one of the disease loci (Table 1) were connected directly or indirectly to RA. Many SNPs on chromosome 6 were interconnected due to LD between these markers. The CPT for node chr6_162 showed 5.5-fold increased risk of RA for carrying allele 3 versus allele 2.

Table 4

Paired Wilcoxon rank test of prediction errors from three RFs and CaMML using test data and Replicates 1–50

	Test data

Comparison of prediction errors	p-Value	Difference in median
CaMML vs. IT_bp	1.10 × 10^-8	-1.52
CaMML vs. IT_top50	1.04 × 10^-6	-1.13
CaMML vs. IT₀	2.16 × 10^-9	-2.32

Paired Wilcoxon rank test of prediction errors from three RFs and CaMML using test data and Replicates 1–50 Table 1 displays the frequency of variables appearing in the network for Replicates 1–50. We have 100% power to detect SNP6_152, SNP6_153, SNP6_154 (surrogates for locus C), SNP6_162 (surrogate for locus D), and SNP11_389 (surrogate for locus F), all of which have strong LD (r2 ≥ 0.418) with disease loci. We have lower power (66%) to detect SNP6_160, a surrogate for D that is in lower LD (r2 = 0.273). Despite its low LD with locus C (r2 = 0.104), the power to detect SNP6_155 is 98%. This may be due to the very strong effect of locus C. Importantly, CaMML identified all covariates (DR, sex, and smoking) and almost all surrogates in LD with disease loci (with exception of SNP6_160, which was not detected by CaMML in one replicate) as part of the RA network from variables selected from IT.

Discussion

Using the simulated data from Problem 3, we assessed a two-stage approach for identifying SNPs associated with RA that employs random forests to identify important variables, and Bayesian networks to further filter out noise SNPs by reducing prediction error. The random forest analysis reduced the number of variables for further Bayesian network analyses from 9190 to about 53. This screening strategy successfully filtered out many SNPs unassociated with the disease loci, while keeping the surrogates for risk SNPs for four out of nine of these loci (DR, C, D, and F) in 94 of 100 replicates. Although ITseems to give lower prediction error than ITtop50 in training data sets, ITtop50 gives lower prediction error than ITin test data sets. Therefore, the strategy of building a second forest using the top 50 SNPs from a first forest may be a better variable selection method overall. However, the effects of these loci in this data set are very strong, and it is not clear that this result will generalize to data weaker association signals. Further, it is not clear how to choose the number of variables to select if one uses the simpler procedure. Additional simulation studies are needed to determine how to generalize our results to less ideal circumstances. The fact that the difference in the median of prediction errors for training and test data sets are large for ITsuggests overfitting; however, because we removed a large (50%) proportion of "noise" in this first stage, ITis not expected to be the optimal RF with the lowest prediction error. It is possible to remove one noise variable at a time; however, it is not practical in the context of thousands of variables. We expected the BN analysis to further reduce the number of noise SNPs and provide some guidance as to important interaction effects. Bayesian network analysis based on a subset of the variables (≤ 50) selected from ITcaptured most of the true loci and the correct dependencies among them and further decreased the test set prediction error. The network model provides a method for predicting case status and facilitates the understanding of complex relationships between the disease and genetic and environmental factors. The limitations of BN include the difficulty to discern the exact relationship between variables that are interconnected and the exponential increase in computation time with the number of variables. These make BN impractical for genome-wide scan of dense SNPs. However, BN results are at least useful to generate potentially biological meaningful hypotheses to be confirmed by further statistical analyses or/and biological experiments.

Competing interests

The author(s) declare that they have no competing interests.

4 in total

Review 1. Mathematical multi-locus approaches to localizing complex human trait genes.

Authors: Josephine Hoh; Jurg Ott
Journal: Nat Rev Genet Date: 2003-09 Impact factor: 53.242

2. Genetic dissection and prognostic modeling of overt stroke in sickle cell anemia.

Authors: Paola Sebastiani; Marco F Ramoni; Vikki Nolan; Clinton T Baldwin; Martin H Steinberg
Journal: Nat Genet Date: 2005-03-20 Impact factor: 38.330

3. Gene selection and classification of microarray data using random forest.

Authors: Ramón Díaz-Uriarte; Sara Alvarez de Andrés
Journal: BMC Bioinformatics Date: 2006-01-06 Impact factor: 3.169

4. Screening large-scale association study data: exploiting interactions using random forests.

Authors: Kathryn L Lunetta; L Brooke Hayward; Jonathan Segal; Paul Van Eerdewegh
Journal: BMC Genet Date: 2004-12-10 Impact factor: 2.797

4 in total

13 in total

1. TRM: a powerful two-stage machine learning approach for identifying SNP-SNP interactions.

Authors: Hui-Yi Lin; Y Ann Chen; Ya-Yu Tsai; Xiaotao Qu; Tung-Sung Tseng; Jong Y Park
Journal: Ann Hum Genet Date: 2011-12-11 Impact factor: 1.670

2. Evaluation of a two-stage framework for prediction using big genomic data.

Authors: Xia Jiang; Richard E Neapolitan
Journal: Brief Bioinform Date: 2015-03-18 Impact factor: 11.622

Review 3. Random forests for genetic association studies.

Authors: Benjamin A Goldstein; Eric C Polley; Farren B S Briggs
Journal: Stat Appl Genet Mol Biol Date: 2011-07-12

4. Cost-Effectiveness of Peer- Versus Venue-Based Approaches for Detecting Undiagnosed HIV Among Heterosexuals in High-Risk New York City Neighborhoods.

Authors: Elizabeth R Stevens; Kimberly A Nucifora; Qinlian Zhou; Ronald Scott Braithwaite; Charles M Cleland; Amanda S Ritchie; Alexandra H Kutnick; Marya V Gwadz
Journal: J Acquir Immune Defic Syndr Date: 2018-02-01 Impact factor: 3.731

5. A comparative analysis of methods for predicting clinical outcomes using high-dimensional genomic datasets.

Authors: Xia Jiang; Binghuang Cai; Diyang Xue; Xinghua Lu; Gregory F Cooper; Richard E Neapolitan
Journal: J Am Med Inform Assoc Date: 2014-04-15 Impact factor: 4.497

6. Immunologic profiles distinguish aviremic HIV-infected adults.

Authors: Christina M Ramirez; Elizabeth Sinclair; Lorrie Epling; Sulggi A Lee; Vivek Jain; Priscilla Y Hsue; Hiroyu Hatano; Daniel Conn; Frederick M Hecht; Jeffrey N Martin; Joseph M McCune; Steven G Deeks; Peter W Hunt
Journal: AIDS Date: 2016-06-19 Impact factor: 4.177