Literature DB >> 19416535

Willows: a memory efficient tree and forest construction package.

Heping Zhang¹, Minghui Wang, Xiang Chen.

Abstract

BACKGROUND: Existing tree and forest methods are powerful bioinformatics tools to explore high dimensional data including high throughput genomic data. However, they cannot deal with the data generated by recent genotyping platforms for single nucleotide polymorphisms due to the massive size of the data and its excessive memory demand.
RESULTS: Using the recursive partitioning technique, we developed a new software package, Willows, to maximize the utility of the computer memory and make it feasible to analyze massive genotype data. This package includes three tree-based methods -- classification tree, random forest, and deterministic forest, and can efficiently handle the massive amount of SNP data. In addition, this package can easily set different options (e.g., algorithms and specifications) and predict the class of test samples.
CONCLUSION: We developed Willows in a user friendly interface with the goal of maximizing the use of memory, which is critical for analysis of genomic data. The Willows package is well documented and publicly available at (http://c2s2.yale.edu/software/Willows).

Entities: Disease Gene

Mesh：

Year: 2009 PMID： 19416535 PMCID： PMC2683818 DOI： 10.1186/1471-2105-10-130

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

Successes of genomewide association (GWA) studies have demonstrated repeatedly that single nucleotide polymorphisms (SNPs) can be used to identify genetic variants underlying complex diseases [1-5]. Thanks to those successes, GWA studies have emerged as the most effective study designs for identifying candidate genes. Classification trees and forest-based methods [6-9] are powerful tools for identifying complex relationships between a response and many predictors, particularly if the predictors have interactive effects on the response. These methods have been widely used, such as in the analyses of genomic data [10-13]. However, the grand scale of the GWA data presents a significant computational challenge to any data analysis. For example, the genotype data from the Framingham Heart Study (FHS, 9,300 subjects and 550,000 SNPs) require more than 38.1 GB memory for input when each genotype at a SNP marker is stored in the double data type or 4.8 GB when stored in the byte type. For a typical GWA study, e.g., the Cancer Genetic Markers of Susceptibility (CGEMS) breast cancer projects (2,434 subjects and 550,000 SNPs) [14], the genotype data occupy 10 GB in the double type and 1.2 GB in the byte type. None of the existing tree/forest tools are capable of analyzing these massive data in commonly available computing facilities. It is noteworthy that PLINK [15] and Chen, et al. [16] already utilize efficient memory use algorithms similar to what we propose to use in trees and forests, and the compressed data format designed by PLINK has been adopted by NCBI to distribute GWA data. Thus, incorporating an efficient memory use algorithm in other statistical methods such as tree- and forest-based methods is imperative in order to apply those well-established methods for analyzing ultra-dense SNP data. To this end, we have developed a new software package, Willows. The statistical method is based on the classical recursive partitioning technique [17,18]. Compression/decompression algorithms have been implemented in Willows to efficiently reduce the memory level used for the storage and analysis of SNP data. Three recursive partitioning-based methods – classification tree, random forest, and deterministic forest – have been included in this package, which can efficiently handle the massive amount of SNP data. In addition, this package is equipped with a user-friendly graphic interface by which users can easily select different options (e.g., algorithms and specifications) and predict the class of a test sample.

Implementation

Classification tree

Classification tree is based on recursive partitioning method [6,18]. It extracts homogeneous strata from the sample and builds a classification rule to predict class membership. A splitting rule consists of two components: a predictor and its corresponding threshold. The quality of a splitting rule is measured by node impurity such as Gini index or entropy. Once the root node is split into two daughter nodes, the daughter nodes can be further split by repeating the splitting procedure. This partitioning process continues recursively until no more split is possible. To avoid over fitting, pruning procedures is used to eliminate redundant nodes [18-20].

Random forest

Random forests [7] grows many classification trees instead of one. Suppose that the sample size in a data set is N. First, we draw N observations at random from the original data with replacement. Then, we grow a tree using this bootstrap sample. Trees in a random forest are built differently from the classification tree described in the previous section in the following two ways: (a) the trees in the random forest are not pruned; and (b) we do not consider all predictors in selecting the optimal node-split. In fact, if there are M predictors in the original data set, m out of M predictors are chosen randomly to split a node; here m is a pre-specified, much smaller number than M. Random forest ranks variables by a variable importance index [7], which reflects the "importance" of a variable on the basis of the classification accuracy, while considering the interaction among variables. Specifically, in a random forest each tree is constructed using a different cohort of bootstrap samples from the original cohort. About one-third of the samples are left out of the bootstrap samples and hence not used in the construction of the tree. These left-out samples are referred to as the out-of-bag (oob) samples. To determine the importance of a variable, first the values of the variable (i.e., predictor) in the oob samples are randomly permuted; then both the original oob samples and the permuted oob samples are classified by the corresponding tree. The difference in the correct classification rates between the original and permuted oob samples determines the importance of the variable, and the variable importance is obtained by averaging the differences over all trees in the random forest.

Deterministic forest

Like a random forest, a deterministic forest [8,11] is also an ensemble of classification trees. Because of the large number of covariates, multiple splits may have very similar performance in terms of the quality of split and the prediction accuracy of the outcome. Thus, it is useful to consider all competitive splits, and construct a forest consisting of these competitive trees. Specifically, a pre-specified number (for example, 20) of the top splits of the root node and a pre-specified number (for example, 3) of the top splits of the two daughter nodes of the root node are selected. These combinations generate a total of 180 possible trees, leading to a deterministic forest. The frequency of each predictor being used to split a node is indicative of the importance of the predictor. A deterministic forest is different from a random forest in that it is constructed through a deterministic and reproducible manner and that the trees in the deterministic forest tend to be very limited in size. A deterministic forest is not only computationally more efficient than a random forest, but also its reproducibility makes it easier to interpret.

Missing value

Considering the massive amount of SNPs, we expect some SNP genotypes may be missing either due to mishandling or poor quality. There are two simple approaches to dealing with missing SNPs. First, we can impute the missing SNP based on the allele frequency in the data or the haplotype block covering the missing SNP. After this imputation, all of the missing SNPs are replaced by the imputed SNPs and the "completed" data are then fed to Willows. Alternatively, the Missings Together Approach [18] can be adopted; namely, the subjects with missing SNPs are grouped together so that they can be easily tracked. In the tree framework, the first approach is expected to produce trees with a lower misclassification rate than the second approach. However, when forests are constructed, it warrants a further comparison as to which of two approaches leads to better performing forests.

Compression Algorithms

In genetic studies, a SNP-based genotype has only four possible choices: AA, AB, BB or missing. Each choice can be represented by 2 bits. Thus, 16 genotypes can be packed into one integer data type (4 bytes) in Java or C++ using bit shift operators. The theoretical compression ratio is 4:1 compared to the byte storage scheme and 32:1 compared to the double storage scheme.

Implementation

Willows, implemented in C and Java, comes with a user-friendly graphic user interface (GUI) on Windows, Linux and Mac OS X. It also can be executed from the command line on Windows, Linux and Mac OS X.

Results and discussion

The performance of Willows was analyzed on a computer equipped with 2.33 GHz processor and 2 GB physical memory running on Microsoft Windows XP Professional Version.

Simulated data

The compression and decompression operations for a specific genotype take a constant operation time using bit operators. In fact, the time required for these operations is negligible comparing to the overall running time. For example, we randomly generated two simulated data sets, which had 10, 000 SNPs and 100, 000 SNPs, respectively. Both data sets contained 1, 000 subjects. For each data set, we built classification trees, a random forest of 100 trees, and a deterministic forest of 8 trees, respectively, on a computer described above. The number of SNPs used to split at each node in the random forest is set to be int(log2 M) + 1, where M is the number of SNPs. The running time with the compressed and uncompressed operations is given in Table 1, and differs very little with or without the compressed operations.

Table 1

Run time (in seconds) of the operations.

	SNPs	Classification tree	Random forest	Deterministic forest
Compressed	10, 000	70	7	480
Uncompressed	10, 000	68	6	470
Compressed	100, 000	257	96	673
Uncompressed	100, 000	249	94	640

Run time (in seconds) of the operations.

CGEMS

For a typical GWA study, e.g., the Cancer Genetic Markers of Susceptibility (CGEMS) breast cancer projects [14], which contains 2,434 subjects and 550,000 SNPs, the genotype data occupy 10 GB in the double type and 1.2 GB in the byte type. As we did for the simulated data sets, we built classification trees, a random forest of 1000 trees, and a deterministic forest of 8 trees, respectively. The number of SNPs used to split on at each node in random forest is set to be int(log2 550000) + 1. Table 2 displays the time of using Willows to analyze CGEMS data, and it demonstrates that with the efficient use of memory, we can indeed construct classification trees and forests from typical GWA data.

Table 2

Computation time (in seconds) for analyzing the CEGM data set

Memory	Loading data	Classification trees	Random forest	Deterministic forest
0.32 Gb	1698	2562	485	12170

Computation time (in seconds) for analyzing the CEGM data set

Input files

Willows supports input files in a text format: the first line indicates the variable type (response, nominal or ordinal) with no particular order. Among various features is the prediction function that predicts the response class based on the predictors. Additional input files are necessary for this feature. We refer to the supplementary information on our website.

Output results

The main output produced by Willows is the tree structures. An example is provided in Figure 1. In this figure, internal and terminal nodes are represented by ellipsoids and rectangles, respectively. The frequency counts of the outcome are displayed inside each node, and the splitting variable and the corresponding thresholds are provided for internal nodes.

Figure 1

Tree structure.

Tree structure. Depending on the needs, other outputs including the importance score of each variable and the predicted classes in a test sample can be viewed. For example, Figure 2 and Figure 3 show the importance score and prediction results of the two simulated data sets. Furthermore, all of the results are saved in local files for future view. Detailed instructions are provided in our website.

Figure 2

Importance score results in the random forest.

Figure 3

Prediction results in a test sample.

Importance score results in the random forest. Prediction results in a test sample.

Conclusion

GWA studies have produced landmark successes in identifying genetic variants for complex diseases. Due to the large size of the data generated from GWA studies, data management and analysis has been a major hurtle to overcome for GWA studies. One of the immediate challenges is the memory management for GWA databases, especially for prevailing 32-bit operation systems. Parallel supercomputers are useful to accelerate the computation when the computational tasks are "parallel," but this may not be the case or may be challenging to implement in GWA studies. Furthermore, parallel supercomputers are not easily accessible, and even if they are available, data confidentiality and security restrictions may not allow the transfer of the genomic data to a networked supercomputer, as those released by dbGap . Thus, it is ideal to have more accessible and efficient computing software. In fact, some of the dbGap data sets have been distributed in a compressed binary format designed in PLINK and incompatible for other statistical software including trees and forests. To this end, Willows implements three classifiers in a user friendly interface with the goal of maximizing the use of memory, which is necessary for analysis of GWA SNP data.

Availability and requirements

• Project name: Willows • Project home page: . • Operating system(s): Multiple platform (tested on Windows, Linux and Mac OS X). • Programming language: C++ and Java. • Other requirements: Java 1.6+. • License: Free for non-commercial use.

Authors' contributions

All authors jointly developed the methods and wrote the article. They read and approved the final manuscript.

16 in total

1. Complement factor H polymorphism in age-related macular degeneration.

Authors: Robert J Klein; Caroline Zeiss; Emily Y Chew; Jen-Yue Tsai; Richard S Sackler; Chad Haynes; Alice K Henning; John Paul SanGiovanni; Shrikant M Mane; Susan T Mayne; Michael B Bracken; Frederick L Ferris; Jurg Ott; Colin Barnstable; Josephine Hoh
Journal: Science Date: 2005-03-10 Impact factor: 47.728

2. Identifying SNPs predictive of phenotype using random forests.

Authors: Alexandre Bureau; Josée Dupuis; Kathleen Falls; Kathryn L Lunetta; Brooke Hayward; Tim P Keith; Paul Van Eerdewegh
Journal: Genet Epidemiol Date: 2005-02 Impact factor: 2.135

3. A forest-based approach to identifying gene and gene gene interactions.

Authors: Xiang Chen; Ching-Ti Liu; Meizhuo Zhang; Heping Zhang
Journal: Proc Natl Acad Sci U S A Date: 2007-11-28 Impact factor: 11.205

4. A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer.

Authors: David J Hunter; Peter Kraft; Kevin B Jacobs; David G Cox; Meredith Yeager; Susan E Hankinson; Sholom Wacholder; Zhaoming Wang; Robert Welch; Amy Hutchinson; Junwen Wang; Kai Yu; Nilanjan Chatterjee; Nick Orr; Walter C Willett; Graham A Colditz; Regina G Ziegler; Christine D Berg; Saundra S Buys; Catherine A McCarty; Heather Spencer Feigelson; Eugenia E Calle; Michael J Thun; Richard B Hayes; Margaret Tucker; Daniela S Gerhard; Joseph F Fraumeni; Robert N Hoover; Gilles Thomas; Stephen J Chanock
Journal: Nat Genet Date: 2007-05-27 Impact factor: 38.330

5. A common variant on chromosome 9p21 affects the risk of myocardial infarction.

Authors: Anna Helgadottir; Gudmar Thorleifsson; Andrei Manolescu; Solveig Gretarsdottir; Thorarinn Blondal; Aslaug Jonasdottir; Adalbjorg Jonasdottir; Asgeir Sigurdsson; Adam Baker; Arnar Palsson; Gisli Masson; Daniel F Gudbjartsson; Kristinn P Magnusson; Karl Andersen; Allan I Levey; Valgerdur M Backman; Sigurborg Matthiasdottir; Thorbjorg Jonsdottir; Stefan Palsson; Helga Einarsdottir; Steinunn Gunnarsdottir; Arnaldur Gylfason; Viola Vaccarino; W Craig Hooper; Muredach P Reilly; Christopher B Granger; Harland Austin; Daniel J Rader; Svati H Shah; Arshed A Quyyumi; Jeffrey R Gulcher; Gudmundur Thorgeirsson; Unnur Thorsteinsdottir; Augustine Kong; Kari Stefansson
Journal: Science Date: 2007-05-03 Impact factor: 47.728

6. A tree-based method for modeling a multivariate ordinal response.

Authors: Heping Zhang; Yuanqing Ye
Journal: Stat Interface Date: 2008 Impact factor: 0.582

7. Cell and tumor classification using gene expression data: construction of forests.

Authors: Heping Zhang; Chang-Yung Yu; Burton Singer
Journal: Proc Natl Acad Sci U S A Date: 2003-03-17 Impact factor: 11.205

8. A common allele on chromosome 9 associated with coronary heart disease.

Authors: Ruth McPherson; Alexander Pertsemlidis; Nihan Kavaslar; Alexandre Stewart; Robert Roberts; David R Cox; David A Hinds; Len A Pennacchio; Anne Tybjaerg-Hansen; Aaron R Folsom; Eric Boerwinkle; Helen H Hobbs; Jonathan C Cohen
Journal: Science Date: 2007-05-03 Impact factor: 47.728

9. Tree-based risk factor analysis of preterm delivery and small-for-gestational-age birth.

Authors: H Zhang; M B Bracken
Journal: Am J Epidemiol Date: 1995-01-01 Impact factor: 4.897

10. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls.

Authors:
Journal: Nature Date: 2007-06-07 Impact factor: 49.962

10 in total

1. Power of data mining methods to detect genetic associations and interactions.

Authors: Annette M Molinaro; Nicholas Carriero; Robert Bjornson; Patricia Hartge; Nathaniel Rothman; Nilanjan Chatterjee
Journal: Hum Hered Date: 2011-09-17 Impact factor: 0.444

2. On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data.

Authors: Daniel F Schwarz; Inke R König; Andreas Ziegler
Journal: Bioinformatics Date: 2010-05-26 Impact factor: 6.937

3. Predictors of Abstinence From Heavy Drinking During Follow-Up in COMBINE.

Authors: Ralitza Gueorguieva; Ran Wu; Lisa M Fucito; Stephanie S O'Malley
Journal: J Stud Alcohol Drugs Date: 2015-11 Impact factor: 2.582

4. Predictors of abstinence from heavy drinking during treatment in COMBINE and external validation in PREDICT.

Authors: Karl Mann; Stephanie S O'Malley; Ralitza Gueorguieva; Ran Wu; Patrick G O'Connor; Constance Weisner; Lisa M Fucito; Sabine Hoffmann
Journal: Alcohol Clin Exp Res Date: 2014-10 Impact factor: 3.455

5. De novo genome sequencing and comparative genomics of date palm (Phoenix dactylifera).

Authors: Eman K Al-Dous; Binu George; Maryam E Al-Mahmoud; Moneera Y Al-Jaber; Hao Wang; Yasmeen M Salameh; Eman K Al-Azwani; Srinivasa Chaluvadi; Ana C Pontaroli; Jeremy DeBarry; Vincent Arondel; John Ohlrogge; Imad J Saie; Khaled M Suliman-Elmeer; Jeffrey L Bennetzen; Robert R Kruegger; Joel A Malek
Journal: Nat Biotechnol Date: 2011-05-29 Impact factor: 54.908

6. Ranking causal variants and associated regions in genome-wide association studies by the support vector machine and random forest.

Authors: Usman Roshan; Satish Chikkagoudar; Zhi Wei; Kai Wang; Hakon Hakonarson
Journal: Nucleic Acids Res Date: 2011-02-11 Impact factor: 16.971

7. Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

Authors: Wouter G Touw; Jumamurat R Bayjanov; Lex Overmars; Lennart Backus; Jos Boekhorst; Michiel Wels; Sacha A F T van Hijum
Journal: Brief Bioinform Date: 2012-07-10 Impact factor: 11.622

Review 8. Bioinformatics challenges for genome-wide association studies.

Authors: Jason H Moore; Folkert W Asselbergs; Scott M Williams
Journal: Bioinformatics Date: 2010-01-06 Impact factor: 6.937

9. Incorporating epistasis interaction of genetic susceptibility single nucleotide polymorphisms in a lung cancer risk prediction model.

Authors: Michael W Marcus; Olaide Y Raji; Stephen W Duffy; Robert P Young; Raewyn J Hopkins; John K Field
Journal: Int J Oncol Date: 2016-04-25 Impact factor: 5.650

10. The phenotypic manifestations of rare genic CNVs in autism spectrum disorder.

Authors: A K Merikangas; R Segurado; E A Heron; R J L Anney; A D Paterson; E H Cook; D Pinto; S W Scherer; P Szatmari; M Gill; A P Corvin; L Gallagher
Journal: Mol Psychiatry Date: 2014-11-25 Impact factor: 15.992

10 in total