Literature DB >> 27081304

Design of Probabilistic Random Forests with Applications to Anticancer Drug Sensitivity Prediction.

Raziur Rahman1, Saad Haider1, Souparno Ghosh2, Ranadip Pal1.   

Abstract

Random forests consisting of an ensemble of regression trees with equal weights are frequently used for design of predictive models. In this article, we consider an extension of the methodology by representing the regression trees in the form of probabilistic trees and analyzing the nature of heteroscedasticity. The probabilistic tree representation allows for analytical computation of confidence intervals (CIs), and the tree weight optimization is expected to provide stricter CIs with comparable performance in mean error. We approached the ensemble of probabilistic trees' prediction from the perspectives of a mixture distribution and as a weighted sum of correlated random variables. We applied our methodology to the drug sensitivity prediction problem on synthetic and cancer cell line encyclopedia dataset and illustrated that tree weights can be selected to reduce the average length of the CI without increase in mean error.

Entities:  

Keywords:  drug sensitivity prediction; heteroscedasticity; probabilistic random forests; variance analysis of random forests

Year:  2016        PMID: 27081304      PMCID: PMC4820080          DOI: 10.4137/CIN.S30794

Source DB:  PubMed          Journal:  Cancer Inform        ISSN: 1176-9351


Introduction

The prediction of an output response Y based on supervised training of a predictor X has been approached using numerous methodologies, such as elastic net,1 support vector regression, and random forests (RFs),2,3 where majority of the techniques provide point prediction estimates of the output. In this article, we consider the generation of prediction confidence intervals (CIs) for RFs, which is a commonly used prediction model in diverse scenarios.3,4 The generation of input-dependent prediction probability distribution provides an estimate of the heteroscedasticity or the change in error variance for different predictor samples. RF regression2 consists of an ensemble of regression trees where the prediction output of the forest is based on the average prediction of individual regression trees. We utilize the concept of probabilistic regression trees5,6 to convert the point estimate of individual trees to probability distributions and further consider the optimization of the weights of the ensemble of probabilistic regression trees that can provide stricter CIs. The ensemble of probabilistic regression trees is considered from two different perspectives. First, we consider the ensemble as a mixture distribution for each prediction sample X. Consider an ensemble of T trees where the tree j produces the predicted output probability density function P(Y|X). The probability density function P(Y|X) of the ensemble of the T regression trees with weights α1,…,α is then given by This approach considers that based on the weights α, a tree k will be selected and the prediction will be decided based on P(Y|X). For the second perspective, we consider the output of the ensemble to be the random variable Z where is a weighted sum of T random variables Z1,…,Z with Z denoting a random variable with probability density function P(Z|X) based on tree j. This scenario is equivalent to analyzing a weighted sum of random variables with different probability density functions, ie, we model the weighted sum of the realizations of the random variables rather than the weighted sum of their distributions as was considered in the first case. Note that the use of equal weights (ie, for j = 1, …,T) for the regression trees is supposed to work well in terms of reducing the variance of prediction when the generated trees are uncorrelated. However, some of the generated trees can often be correlated to each other, and in such a scenario, we can potentially optimize the weights of the trees to reduce the variance of ensemble prediction. Based on this idea, we analyze the variance of the prediction based on a weighted sum of random variables scenario for different forms of tree covariance matrices. For the mixture distribution scenario, we use maximum-likelihood estimation (MLE) to generate the weights of the regression trees and analyze the effect of estimated weights on the mean and variance of the error distributions. We applied our methodology over synthetic and experimental cancer cell line encyclopedia (CCLE) dataset and illustrated a reduction in variance with comparable mean error following the application of MLE-optimized tree weights.

Background

A probabilistic theory for classification has been developed for some time that can provide bounds on the probability of misclassification.7 For instance, binary minimax probability machine classification algorithm8 computes a bound on the probability of misclassification, using only estimates of the covariance matrix and mean for each class, as obtained from the training data. Probability estimation trees (PETs)9 are introduced in Ref. 10 as classification trees11 with a class probability distribution at each leaf instead of single-class label. Similar to classification trees, the PETs can be used for classifying examples, and this is simply done by assigning the most probable class according to the PET. They can also be used for ranking examples, and this is done by ordering the examples according to their likelihood of belonging to some particular class as estimated by the PET. Probabilistic RF for classification has been introduced in Ref. 12 with the perspective of providing an estimate of the probability of misclassification for each data point, without detailed probability distribution assumptions or resorting to density modeling. Probabilistic RF for classification is based on two existing algorithms: minimax probability machine classification8 and RFs.2 It has been noted in Ref. 13 that classification trees are not equally successful in labeling all instances. This simple observation led to the idea that use of selected trees in classification can potentially increase accuracy. The selection of trees based on their performance on similar instances had limited success. Further refinement of this idea led to the concept of weighted voting. Mishina et al.14 proposed a boosted RF model where a boosting algorithm is integrated with a conventional RF approach. The boosted RF maintains a high classification performance, even with fewer decision trees, based on constructing complementary classifiers through sequential training by boosting. For our relevant purpose of regression using ensemble approaches, there have been limited studies on the probabilistic behavior of ensemble of regression trees. Theoretical analysis of RF models has usually focused on the consistency and rate of convergence of the design procedure.15 Probabilistic decision and regression trees have been considered in Ref. 5 but the ensemble of probabilistic regression trees in the context of altering the variance of prediction error has not been explored. A weighted random forest (wRF) for regression approach has been proposed,16 where the weight of each tree has been calculated based on the prediction accuracy of out-of-bag samples for that tree. wRF considers the empirical out-of-bag errors for estimating the regression tree weights, whereas this article considers an analytical approach where parametric distributions are estimated to specify a probabilistic representation of each regression tree and the sample-dependent probability distributions are utilized to generate the tree weights.

Methods

RF regression

RF regression refers to ensembles of regression trees,2 where a set of T unpruned regression trees are generated based on bootstrap sampling from the original training data. For each node, the optimal node splitting feature is selected from a set of m features that are picked randomly from the total M features. For m = M, the selection of the node splitting feature from a random set of features decreases the correlation between different trees, and thus, the average response of multiple regression trees is expected to have lower variance than individual regression trees. Larger m can improve the predictive capability of individual trees and can also increase the correlation between trees and void any gains from averaging multiple predictions. The bootstrap resampling of the data for training each tree also increases the variation between the trees.

Process of splitting a node

Let xtr (i, j) and y(i) (i = 1,…,n; j = 1,…,M) denote the training predictor features and output response samples, respectively. At any node η, we aim to select a feature j from a random set of m features and a threshold z to partition the node into two child nodes η (left node with samples satisfying xtr (I ∈ η, j) ≤ z) and η (right node with samples satisfying xtr(i ∈ η, j) > z). We consider the node cost as the sum of square differences: where µ(η) is the expected value of y(i) in node η. Thus, the reduction in cost for partition γ at node η is The partition γ* that maximizes C(γ, η) for all possible partitions is selected for node η.

Forest prediction

Using the randomized feature selection process, we fit the tree based on the bootstrap sample {(X1, Y1),…, (X)} generated from the training data. Let denote the regression tree prediction for input response x corresponding to tree i. The prediction for the RF consisting of T trees denoted by is given by .

Weighted RF

For comparison purposes, we will also consider the wRF methodology proposed in Ref. 16 that uses empirical values to calculate the weight of the trees. The prediction error of a tree denoted by tPE is calculated based on the out-of-bag samples for that tree, and the weight of that tree is estimated as , where x = 1−tPE.

Probabilistic regression trees

Let us consider the generation of regression trees from a probabilistic perspective, which will allow us to utilize well-known concepts of parameter estimations for statistical models. Estimation of regression trees using probability models has been explored in Refs. 5,6. For a regression tree, our goal is to generate the conditional density of the form P(y|x, φ) where y and x refers to the output and input responses, respectively, and φ denotes the collection of parameters for the tree. The tree splits can be modeled by probabilistic decisions that are conditional on the input x and previous node decisions. As an example, consider the two-level tree shown in Figure 1.
Figure 1

Example of probabilistic decision tree.

The first decision is based on probability P(ω1|x, η) where ω1 is the event signifying partition toward the left of the root node and η denotes a parameter vector η = [η1 T1]. Note that if we consider with large η1, the split will be close to a sharp linear decision boundary similar to a regression tree. Similarly, we will have If we consider all the branches of the tree as shown in Figure 1, the corresponding distribution of y conditional on x and tree parameters ϕ = {η, λ1, λ2, ϑ11, ϑ12, ϑ21, ϑ22} will be given by Eq. 5. For larger number of branches in the tree, the above technique can be extended to obtain P(y|x, ϕ) for a tree with parameter set ϕ. In this article, we consider that the tree parameters φ are generated based on the standard RF node generation criteria given in Eq. 2. The probability distribution at any leaf node is approximated by a Gaussian distribution with mean and variance equal to the mean and variance of the samples at the leaf node. Some examples of empirical distributions fitted to normal approximations are shown in Figure 1. Consequently, an ensemble of T trees generated by RF regression can be represented by the T tree parameters ϕ1,ϕ2,…,ϕ with each producing the conditional distribution P(y|x, ϕ) for i = 1,…,T.

Probabilistic RFs

Mixture distribution

As discussed earlier, we consider the prediction ensemble as a mixture distribution for each sample Xi. Consider an ensemble of T trees where tree j produces the predicted output probability density function P(Y|X). The predicted distribution for each tree is based on the estimated probabilistic regression tree model described in the previous section. The probability density function (pdf) P(Y|X) of the forest of T regression trees with weights α1,…,α with α ≥ 0 and is given by This approach considers that based on the weights α, a tree k will be selected, and the prediction will be decided based on P(Y). The mean (µ) of the mixture distribution will be equal to the weighted sum of the distribution means (µ) of the trees as shown in Eq. 6. The variance of the mixture distribution (σ 2) is given by Eq. 7.

Weighted sum of random variables

The mixture distribution approach selects a tree based on the tree weights and then selects a sample output according to the pdf of the tree. Another potential is to consider the weighted sum of realizations from each tree. As discussed earlier, this will be equivalent to considering the output of the forest to be a random variable Z where is a weighted sum of T random variables Z1,…,Z where Z denotes a random variable with pdf P(Z) based on tree j. The distribution of a sum of random variables can be computed as the convolution of the individual distributions. This scenario is equivalent to analyzing a weighted sum of random variables with different probability density functions, ie, we model the weighted sum of the realizations of the random variables rather than the weighted sum of their distributions as was considered in the first case.

Example for weighted sum of two uncorrelated Gaussian distributions

Consider two independent random variables X1 and X2 that are normally distributed with pdfs and , respectively. The distributions of X1 = α1X1 and X2 = α2X2 are given by Eq. 8: Based on the idea of derived distributions, the pdf of random variable Z = α1X1 + α2X2 = X1 + X2 is given by the convolution of the pdfs of X1 and X2 and is given by Eq. 9: Substituting Eq. 8 in Eq. 9, we arrive at: Eq. 11 represents the sum of two independent Gaussian random variables. For T independent Gaussian random variables X1, X with pdfs ℕ(µ1, σ1), ⋯, ℕ(µ, σ) representing the distribution at the T leaf nodes of the forest, the distribution of the random variable Z representing their weighted sum with weights α1, ⋯, α is given by Eq. 12 (derived based on multiple convolutions). Thus, Z has a normal distribution with and . However, if the random variables are correlated, ie, the covariance between different tree outputs are nonzero, the mean and variance of Z are given as follows: If the vector C = [α1, ⋯, α]′ represent the weight vector and Σ represent the T × T covariance matrix, the variance of Z can be represented concisely as where C′ represents the transpose of C. Note that the mean of Z denotes the weighted sum of the means of individual trees in the forest and the prediction is same as regular RF when the tree weights are equal. The mean of Z remains the same irrespective of whether the trees are correlated or not, whereas the variance of Z is directly related to the covariance of the trees using Eq. 15. In the following sections, we will attempt to estimate the covariance among the trees in a forest and analyze the effect of change in C on the variance of Z.

Empirical measure of correlation between probabilistic trees

The covariance between the trees will be estimated using empirical approaches to arrive at the covariance matrix Σ. The i, j position element of Σ denotes the covariance between the predictions of ith and jth tree represented by random variables Y and Y, respectively, ie, ∑(i, j) = E[(Y − E(Y))(Y − E(Y))]. For each input sample X, tree j will produce a pdf P(Y|X), which will be used to select an output prediction realization y. We perform this for all the other trees to arrive at a joint realization of the trees for sample X. This is repeated for N input training samples to produce N joint realizations of the random variables Y1, ⋯, Y, which are used to calculate the sample covariance matrix shown in Eq. 34.

Effect of tree weight on variance

In this section, we will attempt to generate the lower and upper bounds on CC where C = [α1,⋯, α]′ represents the tree weight vector. Assuming is a Hermitian positive definite matrix (note that the covariance matrix is always positive semidefinite17), we can generate the Cholesky decomposition18 of = LL, where L is a lower triangular matrix with real and positive diagonal entries. Let the variance of the prediction of a specific forest be given by the function f(C). We have where A = L′C. Let us analyze the minimum and maximum value of f(C). Since and We have, The minimum for f(C) under the constraint C′e = 1 where e = [1, 1,⋯, 1]′ is given by . Note that this does not preclude solutions with entries of C being less than zero. The details of the derivation using Lagrange multipliers is included in the Appendix. The weight vector C achieving the minimum is given by . The computational complexity of estimating the weight vector is based on the complexity of matrix inversion using Coppersmith–Winograd algorithm.

Diagonal elements of covariance matrix equal

In the conventional RF model, it is assumed that the trees are uncorrelated. Thus, nondiagonal elements of the covariance matrix (which shows the covariance between two different trees) are infinitesimal compared with the diagonal elements (reflecting the variance in the tree). If we ignore the small nondiagonal values and replace them with zeroes, then the covariance matrix () is a diagonal matrix. If the variance of the trees are equal (Var[X1] = Var[X2 ] = … = Var[X] = σ2), then the covariance matrix (Eq. 34) is a diagonal matrix. Since the covariance matrix is diagonal with each diagonal entry equal to σ 2, all the T eigenvalues will also be equal to σ 2. From Eq. 22, Since ||C||1 = 1 and each entry of C is nonnegative, Thus from Eq. 23 When C′ = [1/T,1/T,⋯,1/T] as in a conventional RF scenario, the variance is given by: Comparing Eqs. 25 and 26, we observe that C′ = [1/T, 1/T, ⋯, 1/T] achieves the minimum variance for uncorrelated trees with equal variance.

Diagonal elements of covariance matrix unequal

Consider the case where the covariance matrix is a diagonal matrix and the variance of the trees are not equal as shown in Eq. 27. where is the variance of the ith tree. Without loss of generality, assume Based on Eq. 22, we have Since ||C||1 = 1 and each entry of C ≥ 0, When the weights of the trees are equal (ie, C′ = [1/T, 1/T, ⋯, 1/T] we have Thus, there is always a possibility that for some C, We can show that the minimum f(Cmin) in such a scenario is where The derivation is included in the Appendix.

Forest with correlated trees

If we consider scenarios where trees are correlated (ie, covariance matrix is not diagonal), placing higher weights on uncorrelated trees will result in lower variance. We illustrate this idea intuitively for a forest consisting of three trees. Consider the covariance matrix for a three-tree forest as where a1 ≅ a2 ≅ a3 ≅ a, b1 ≅ b2 ≅ b, c1 ≅ c2 ≅ c, and d1 ≅ d2 ≅ d. Consider that the first two trees have high correlation between themselves, while the third tree has little correlation with the other two. Thus, c1 ≅ c2 ≅ c ≅ 0 and d1 ≅ d2 ≅ d ≅ 0. The minimum variance will be achieved for . The details of the derivation are included in the Appendix. Based on numerical weights, we next illustrate the effect of placing higher weight for the third tree on the variance. Consider (note that the third tree that is uncorrelated to the other trees has higher weight), then the variance of the forest is given by For a regular RF scenario with equal weights , the variance of the forest is given by We note that when b > 0. Weights for achieving minimum variance for few more scenarios are derived in the Appendix.

Regression Forest Weight Optimization

In this section, we discuss two approaches to select the weights for the ensemble of trees based on MLE and incorporation of tree correlations.

MLE for mixture model

Consider N independent and identically distributed samples (x, y) for i = {1,⋯,N} used for the generation of the T trees. Let α1, α denote the weights of the trees, then the likelihood (conditional) will be given by: To ensure that represent a valid probability density function, the weights has to satisfy the following constraints α ≥ 0 for i = 1,⋯,T and . If we denote P(y |x,ϕ) by ξ, for i = 1,2,…,N samples and j = 1,⋯,T trees, a compact form of representation of the likelihood of the samples as an N length vector f = [f1,⋯, f ] is f = : The goal is to maximize the product with constraints α ≥ 0 and . We solve this optimization problem using Matlab fmincon function that utilizes an interior point approach to find the minimum of a constrained nonlinear function.

Weight distribution based on correlation of trees for weighted sum of random variables model

Among T trees in the forest, consider that some of the trees can have higher correlation between themselves which can be clustered as groups with high correlations among the trees in a group but have limited correlation between trees in different groups. The purpose is to provide higher weight to the uncorrelated trees as compared with the correlated trees. The algorithmic pseudo code is shown as Algorithm 1.
Algorithm 1

Algorithmic representation of weight selection.

STEP 1: Cluster Trees Based on Correlations
STEP 2: Let the k clusters be [α1,,αρ1],.,[αTρk+1,αT]
STEP 3: Assign equal weight 1k to each cluster ie, assign weight 1ρrk to each tree in cluster r
Algorithmic representation of weight selection. To achieve the clustering of the trees, we have applied hierarchical clustering with inverse of the covariance between trees as the distance criteria and linkages between clusters decided based on the minimum distance among pairs belonging to the two clusters (single-linkage clustering). The pair of trees that have the smallest distance among all pairs is linked first followed by the next pair and so on. An example of hierarchical ordering with six trees is shown in Figure 2. To gene rate the final clusters, we have applied a threshold for the inverse covariance and all links below the threshold are considered as separate clusters. The threshold has been taken to be 60% of the average variance of the trees or in other words where tr(Σ) denotes the trace of the covariance matrix. As an example, consider the hierarchical ordering in Figure 2 with a threshold of 3.8, which will result in four separate cluster of trees [2,3], [4,5], [1], and [6].
Figure 2

Example of hierarchical clustering.

Results

Synthetic dataset

ML estimate for mixture model

To evaluate the performance of our algorithm as compared with competing methodologies, we created a synthetic dataset consisting of 100 × 10 size predictor matrix X and 100 × 1 size response vector Y. The predictor variables are randomly generated based on a [0 1] uniform distribution. The response variable is generated based on the predictor variables using Eq. 38 where N1 denotes a random noise with pdf ℕ(0,1). One of our objectives is to check if we are able to reduce the mean square error (MSE) in prediction along with lowering the width of the CI by using MLE of the tree weights. From henceforth, the probabilistic RF with tree weights generated by MLE will be termed as PRF and the probabilistic RF with equal tree weights will be denoted as RF. The weighted random forest approach16 will be denoted by wRF. To report our results, we have used 75 samples (75%) for training and the remaining 25 samples (25%) for testing (holdout validation) and compared Pearson correlation coefficients, mean absolute error (MAE), MSE, normalized root mean square error (NRMSE), and width of the CI between predicted and experimental responses for RF, wRF, and PRF models. NRMSE of output response can be calculated as: where y and denote the vector of actual and predicted drug sensitivities, respectively, and denote the expectation of vector y. We have considered different number of trees to build the models (RF, wRF, and PRF) and the change in MAE, NRMSE, and correlation coefficients between actual and predicted values for these models are shown in Table 1. Table 1 shows that the prediction errors measured in terms of MAE and NRMSE are smaller with PRF as compared with wRF and RF for different number of trees in the forest. We observe analogous behavior based on a similarity measure with PRF correlation coefficient between predicted and experimental values to be higher than wRF and RF.
Table 1

MAE, NRMSE, and correlation between actual and predicted responses for 100 samples for different number of trees in the forest.

TREEMAENRMSECORRELATION
RFwRFPRFRFwRFPRFRFwRFPRF
50.14370.14310.13570.77850.77590.75650.65600.66000.6681
100.11260.11290.10440.76230.76450.73260.65350.65070.6848
200.09370.09430.08760.71030.71270.67470.74350.74020.7627
300.13190.13180.11950.63830.63710.60510.81690.81880.8217
500.12400.12390.11090.67370.67380.63940.81510.81690.8499

Note: Minimum leaf size is 3 and 5 features considered for each split.

The previous measures are based on the mean of the predicted pdf and actual observation. We also consider a probabilistic measure to capture where an actual observation lies in comparison with the predicted pdf. Similar to P-value for doubled tailed event, we considered the probability η(y) of observing results more extreme than y when our prediction probability density function is given by the pdf of . A higher value of η(y) will denote that we cannot reject the hypothesis that the observed responses are from the predicted distributions. A higher value of , where the expectation is over the testing samples, will be preferable for model comparisons. Table 2 shows for different number of trees for RF, wRF, and PRF. Based on Table 2, we observe that for PRF is higher than both wRF and RF.
Table 2

for different number of trees in the forest.

TREEE(η(yi))
RFwRFPRF
50.55960.55860.5617
100.64670.64670.6655
200.70490.70340.7226
300.63280.63350.6281
500.60430.60420.6303

Note: Minimum leaf size is 3 and 5 features considered for each split.

We report the percentage difference in the width of the CI at different confidence levels (CLs) for PRF as compared with RF and wRF in Table 3. An M% change in the width of the CI denotes that on an average, PRF generated CI is M% lower than the RF generated CI. We observe that the average width of the CI for PRF is lower than RF and wRF for all CLs and different number of trees.
Table 3

Change in CI width for different CLs between RF and PRF and wRF and PRF model for 100 samples for different number of trees in the forest.

TREE% DECREASE IN MEAN CI COMPARED TO RF% DECREASE IN MEAN CI COMPARED TO wRF
50% CL70% CL80% CL95% CL99% CL50% CL70% CL80% CL95% CL99% CL
57.529.2910.138.996.316.498.068.578.205.79
100.140.630.781.551.980.720.811.001.551.92
203.162.431.930.730.503.162.341.780.640.34
3012.7813.4512.539.657.5412.9213.3613.019.657.56
503.942.612.551.411.653.792.432.481.321.74

Note: Minimum leaf size is 3 and 5 features considered for each split.

Table 4 shows the NRMSE and correlation coefficient between actual and predicted responses for RF, wRF, and PRF and percentage change in the width of CI with PRF as compared with RF for a simulation with 250 samples. Similar to the previous results with 100 samples, we observe improvement with PRF as compared with both RF and wRF with respect to NRMSE and correlation coefficient between actual and predicted responses. The results also show that the NRMSE has decreased for all the approaches when the sample size has been increased to 250 samples as compared to 100 samples. The absolute difference in performance for PRF as compared with RF and wRF is better for 100 samples, but the percentage improvement in performance is similar for both the sample scenarios.
Table 4

NRMSE, correlation between actual and predicted output, and change in CI width for different CLs for 250 samples for different number of trees in the forest.

TREE% DECREASE IN MEAN CI WITH PRF COMPARED TO RF
NRMSECORRELATIONDIFFERENT CONFIDENCE LEVEL
RFwRFPRFRFwRFPRF50%70%80%95%99%
50.58800.58710.57100.81770.81820.82404.664.364.634.614.02
100.55030.54980.52990.83960.84000.85294.194.453.923.262.87
200.57460.57500.56810.82430.82410.83023.863.072.772.602.79
300.58300.58320.58090.81800.81780.81851.361.341.511.351.26
500.58210.58260.56260.82590.82580.83804.193.152.902.732.89

Note: Minimum leaf size is 3 and 5 features considered for each split.

Weighted sum of random variables

In this section, we consider the effect of tree weights on the MSE and prediction variance for the weighted sum of random variables scenario. We generated a synthetic feature matrix of 500 samples and 1000 features based on a uniform probability distribution [0 1]. The output response has been generated based on Eq. 38 where the output response is dependent on nine of the input features. A random set of 300 of these 500 samples have been used for training, while the remaining 200 samples have been used for testing. We have used the filter feature selection approach RRelieff19 to reduce the initial set of 1000 features to 100 (10 among these 100 are randomly considered for each node splitting) for training the regression trees. We have considered five trees for the generation of the RF model, and the covariance matrix for the five trees based on the training samples is given by Eq. 41. We have used smaller number of trees for easy visualization of the covariance matrix along with concise analysis of the inferred weights. In Eq. 41, the diagonal elements are the variance of each tree with itself, while the nondiagonal elements are covariance between different trees. We note that the covariance between trees 2 and 5 is high compared with the other covariances. By applying hierarchical clustering with inverse covariance as the distance measure and 60% of average variance as threshold, we arrive at four clusters: [2, 5], [1], [3], [4]. We assign equal weights to each cluster (0.25), and where there is more than one tree in a cluster, the weight is equally divided among the trees in the cluster. Thus, we arrive at the following weight vector for PRF model C = [0.25 0.125 0.25 0.25 0.125]. Since we considered holdout validation for variance comparison, we generated the covariance among the trees for the testing samples (denoted by Σ) which is shown in Eq. 42. Consequently, the variance of the forest with equal tree weights is given by whereas the variance of the forest based on our weight selection is given by The above results illustrate that for the weighted sum of random variables scenario, the variance of the forest prediction can be reduced by generating the weight of the trees based on tree clusters as compared with using equal weights for all trees.

ML estimate of mixture model applied to CCLE dataset

CCLE dataset has been downloaded from http://www.broadinstitute.org/ccle/home. CCLE dataset has two types of genetic characterization information: (i) gene expression and (ii) single-nucleotide polymorphism (SNP6). Gene expression has been downloaded from CCLE_Expression_Entrez_2012-09-29.gct. In this dataset, there are 18,988 gene features with no missing values for 1037 cell lines. The SNP6 dataset has been extracted from CCLE_copynumber_byGene_2013-12-03.txt. For 1043 cell lines, there are 23,316 features. For our experiments, we have selected 1012 cell lines that are common to both gene expression and SNP6 dataset. The drug sensitivity data has been downloaded from the addendum published by Barretina et al.20 The data provide 24 drug responses for 504 cell lines. Drug sensitivity data of the area under the curve have been collected from Act Area and normalized to [0 1]. The SNP6 and gene expression data integrated model was constructed based on individual RF models combined with a linear regression stacking approach.3

CI and variance

For the calculation of the CI, we have considered 15 drugs in the CCLE database and considered samples with drug sensitivity higher than 0.1 so as to have noticeable variance among the output responses. The number of samples used for the experiments for the 15 drugs varies from 70 to 395. We have used fivefold cross-validation for all our computations, where the data samples are randomly partitioned into five equal parts and four parts are used for training and the remaining part used for testing and the process repeated five times corresponding to the five different testing partitions. Based on the model inferred from the training samples, the mean and variance of the output of the leaf node for the testing set has been calculated. Thus, for a testing set of 20 samples and 10 trees, we have a matrix of mean and variance of size 20 × 10. Based on the calculated means and variances, a Gaussian mixture distribution has been derived. Cumulative distribution function has been eventually derived from this distribution to calculate the CIs for different CLs. To analyze the estimated CIs, we have considered the ratio of the number of experimental testing responses contained in the predicted CI to the total number of testing samples. We will term the ratio as the coverage probability of the CI. Note that we are calculating the coverage probability from cross-validation data as compared with resubstitution data, and thus, there can be significant differences from the CI level for limited samples. The coverage probability for different CLs for all the 15 drugs is shown in Table 5. We observe that the RF and PRF coverage probabilities are quite similar and PRF coverage probability is closer to the actual CL than the RF coverage probability. As expected, the coverage probability is increasing with the increase in CL for both RF and PRF model.
Table 5

Coverage probabilities for four CIs (CL) for PRF and RF predictions for different drugs.

DRUGCOVERAGE PROBABILITY
50% CL70% CL80% CL95% CL
RFPRFRFPRFRFPRFRFPRF
17-AAG0.6860.6500.9110.8830.9770.95911
AZD05300.8290.7640.9340.9290.9490.9590.9940.989
AZD62440.7430.7120.9200.8930.9510.9380.9910.986
Erlotinib0.8440.8360.9310.9390.9820.9910.9911
Lapatinib0.8380.7880.9320.8980.9740.94911
Nilotinib0.7950.7420.9130.8810.9560.9130.9860.989
Nutlin-30.8720.8250.9410.9530.9530.96511
Paclitaxel0.7070.6710.9090.8860.9690.95910.997
PD-03259010.6860.6650.8930.8720.9650.9440.9960.996
PD-03329910.8490.8310.9730.9290.9910.96410.991
PF23410660.8420.8080.9310.9380.9720.9650.9931
PHA-6657520.8550.8420.9730.9601111
PLX47200.90.7850.9710.9420.9850.9570.9850.985
Sorafenib0.9010.8620.9500.9500.9800.9700.990.99
TAE6840.8160.7710.9550.9480.9820.9650.9960.996

Note: We have used T = 10 trees and the following constraints for the weights of the trees for PRF model and .

For the results shown in Table 5, we also calculated the P-values of paired t-test between PRF and RF predictions and actual responses. The P-values of paired t-test between (a) PRF prediction and actual responses turned out to be 0.6172 and between (b) RF prediction and actual responses turned out to be 0.6052. A higher value for the PRF scenario represents that the PRF predictions are closer to the actual responses as compared with the RF predictions. The change in coverage probability with the number of trees (T) for drug 17-AAG is shown in Table 6. We observe that the coverage probabilities are closer to the actual CLs with lower number of trees. However, the increase in the number of trees in the forest produces lower variance and higher prediction accuracy.
Table 6

Coverage probabilities for four CIs for different number of trees (from 2 to 100) for drug 17-AAG with 395 samples.

NO. OF TREESCOVERAGE PROBABILITY
50% CL70% CL80% CL95% CL
RFPRFRFPRFRFPRFRFPRF
T = 20.65320.62280.82280.81010.89110.88860.98480.9873
T = 50.69370.66580.90380.88610.95700.94180.99750.9949
T = 100.6860.6500.9110.8830.9770.95911
T = 200.70890.68860.91390.90890.97470.972211
T = 1000.72910.72410.93420.92660.98230.977211

Note: Results for both RF and PRF models show similar type of behavior.

From Tables 5 and 6, we observe that both RF and PRF provide similar coverage probabilities for the generated CIs. We next analyzed the error in prediction using different error metrics (MSE, MAE, and NRMSE) and the length of the CIs for PRF in comparison with RF and wRF.16 We first explored whether PRF in comparison with RF and wRF can reduce prediction error (as measured by different metrics) while decreasing the CI in majority of the cases. The ratio of the number of testing samples, where the PRF model-generated CI is lower than the RF model-generated CI, to all samples is defined as PRF CI ratio. For example, a PRF CI ratio of 0.60 will denote that for 60% of the testing samples, PRF model-generated CI is lower than RF model-generated CI. The MSE, MAE, and NRMSE for different drugs are shown in Table 7, while Table 8 shows the PRF CI ratio in comparison with RF and wRF for different CLs. Tables 7 and 8 show that the average errors for PRF in comparison with RF and wRF is similar based on multiple error metrics, whereas the PRF CI ratio is >0.5 (between 0.54 and 0.6) for all CLs. Thus, the results support the idea that as compared with using equal weights for all trees, weight optimization using MLE can potentially predict drug sensitivity with higher confidence while maintaining similar error. Figure 3 represents two example pdfs generated by PRF and RF, which shows that the PRF predicted distribution has lower variance as compared with RF, while maintaining similar mean.
Table 7

Performance of all the drugs in terms of MSE, MAE, and NRMSE.

DRUGMSEMAENRMSE
RFwRFPRFRFwRFPRFRFwRFPRF
17-AAG0.01750.01670.01850.10750.10510.11081.00550.98281.0363
AZD05300.00710.00660.00780.06280.06050.06501.00230.96371.0502
AZD62440.01570.01600.01690.09830.10180.10110.95670.96420.9946
Erlotinib0.00470.00490.00570.05130.05280.05730.99561.00211.0894
Lapatinib0.00700.00730.00790.06290.06160.06490.97990.99921.0418
Nilotinib0.02410.02260.02300.10150.09310.09741.03261.00211.0116
Nutlin-30.00340.00380.00370.04350.04490.04380.97621.04151.0296
Paclitaxel0.02370.02360.02430.12260.12400.12570.92290.92050.9354
PD-03259010.02590.02540.02790.13120.13060.13640.95340.94460.9873
PD-03329910.00530.00450.00580.05730.05240.06090.98250.91391.0379
PF23410660.00750.00740.00600.06460.06030.05641.05781.04580.9519
PHA-6657520.00390.00390.00390.05090.04970.04881.06671.07001.0616
PLX47200.01000.01070.01060.07300.07750.07201.00111.02701.0283
Sorafenib0.00720.00690.00630.05680.05330.05051.04711.03090.9923
TAE6840.00870.00750.00890.06960.06440.07070.96820.89570.9759
Average0.01140.01120.01180.07690.07550.07740.99660.98691.0149

Note: We have used T = 10 and the following constraints for the weight of the trees for the PRF model and .

Table 8

Performance of all the drugs in terms of CI.

DRUG% DECREASE IN CI COMPARED TO RF% DECREASE IN CI COMPARED TO wRF
50% CL70% CL80% CL95% CL99% CL50% CL70% CL80% CL95% CL99% CL
17-AAG2.432.392.112.032.282.412.342.112.002.29
AZD05303.042.202.182.563.153.092.202.172.563.15
AZD62444.64.574.484.765.104.604.614.504.775.12
Erlotinib−1.92−2.13−2.02−1.341.20−1.81−2.26−1.98−1.271.24
Lapatinib4.254.774.704.044.624.214.894.643.944.65
Nilotinib7.879.0711.7613.129.607.399.2011.8713.299.84
Nutlin-38.766.765.805.667.738.946.875.925.817.76
Paclitaxel3.022.973.073.263.352.962.933.123.253.37
PD-03259011.101.842.072.112.121.101.752.092.052.08
PD-03329912.231.220.360.741.522.240.930.350.641.46
PF23410663.694.634.263.453.753.564.554.213.433.69
PHA-6657529.659.119.068.798.359.479.018.988.708.37
PLX472013.1211.0811.8512.8511.4113.1311.5512.0313.1211.51
Sorafenib−2.46−3.15−3.65−1.502.34−2.39−3.23−3.48−1.472.31
TAE6844.083.903.643.593.624.173.913.653.523.63

Notes: PRF CI ratio denotes the ratio of samples where PRF CI is lower than RF CI or wRF CI. We have used T = 10 and the following constraints for the weight of the trees for the PRF model and .

Figure 3

RF generated PDF is more spread out than PRF generated pdf, which implies that the CI of RF generated pdf is higher than PRF generated pdf.

The percentage decreases in mean CI with PRF as compared with RF and wRF are shown in Table 9. We note that the average CI for PRF is lower than RF and wRF in an overwhelming majority of cases.
Table 9

Percentage decrease in mean CI with PRF as compared with RF and wRF for 15 drugs of CCLE dataset.

DRUG% DECREASE IN CI COMPARED TO RF% DECREASE IN CI COMPARED TO wRF
50% CL70% CL80% CL95% CL99% CL50% CL70% CL80% CL95% CL99% CL
17-AAG2.432.392.112.032.282.412.342.112.002.29
AZD05303.042.202.182.563.153.092.202.172.563.15
AZD62444.64.574.484.765.104.604.614.504.775.12
Erlotinib−1.92−2.13−2.20−1.341.20−1.84−2.26−1.98−1.271.24
Lapatinib4.254.774.704.044.624.214.894.643.944.65
Nilotinib7.879.0711.7613.129.607.399.2011.8713.299.84
Nutlin-38.766.765.805.667.738.946.875.925.817.76
Paclitaxel3.022.973.073.263.352.962.933.123.253.37
PD-03259011.101.842.072.112.121.101.752.092.052.08
PD-03329912.231.220.360.741.522.240.930.350.641.46
PF23410663.694.634.263.453.753.564.554.213.433.69
PHA-6657529.659.119.068.798.359.479.018.988.708.37
PLX472013.1211.0811.8512.8511.4113.1311.5512.0313.1211.51
Sorafenib−2.46−3.15−3.65−1.502.34−2.39−3.23−3.48−1.472.31
TAE6844.083.903.643.593.624.173.913.653.523.63

Notes: Number of features used for each split is 10, minimum number of samples in a leaf node = 5, T = 10 and PRF constraints and .

We also compared our approach with quantile regression forests (QRFs)21 that uses nonparametric empirical distributions to model the distributions at the leaf nodes. We observed (results not included) that QRFs can produce smaller CIs than RF and PRF but the coverage probability of PRF is significantly lower. It appears that the empirical distributions based on a few samples can provide smaller variance but has limited coverage that defeats the purpose of designing the CIs.

Prior feature selection

In this experiment, we have used filter feature selection algorithm RRelieff19 to reduce the initial set of features used for training the RF, PRF, and wRF models. We have considered the CCLE cell lines that are common to all 15 drugs resulting in 396 samples. Features election has been used to reduce the number of features to 50 for each dataset. Table 10 shows the average errors in terms of MSE and MAE for the 15 drugs with 50 selected features for RF, wRF, and PRF. We observe that PRF performs better in comparison with RF and wRF in terms of both average MSE and MAE.
Table 10

Performance of all the drugs in terms of MSE and MAE for PRF compared to RF and wRF.

DRUGMSEMAE
RFwRFPRFRFwRFPRF
17-AAG0.01790.01860.01710.10880.11380.1044
AZD05300.01030.01020.00790.08390.07390.0737
AZD62440.01620.01280.01330.10370.09200.0927
Erlotinib0.00420.00430.00450.05020.05240.0503
Lapatinib0.00520.00520.00580.05130.05120.0565
Nilotinib0.00560.00690.00420.05270.05200.0490
Nutlin-30.00420.00350.00300.04360.04410.0450
Paclitaxel0.01920.02190.01820.11220.11800.1098
PD-03259010.02440.02310.02300.13130.12330.1218
PD-03329910.00500.00390.00460.05350.05320.0501
PF23410660.00460.00410.00680.04450.04400.0536
PHA-6657520.00430.00460.00300.04530.04580.0427
PLX47200.00420.00300.00680.04450.04130.0512
Sorafenib0.00550.00340.00310.04600.04390.0443
TAE6840.01200.00930.00890.08500.07580.0764
Average0.00950.00900.00870.07040.06830.0681

Notes: Number of features used for building the model is 50, and the number of trees considered is 40. and .

The prediction performance can also be measured in terms of the bias and variance of the error distributions produced by different predictive models. The bias will be an inverse measure of accuracy, and variance will be an inverse measure of precision. Table 11 shows the bias and variance for RF, wRF, and PRF for different drugs. We note that the average absolute bias (measure of inaccuracy) is lower for PRF (0.0014) as compared with RF (0.0020) and wRF (0.0025). Similarly, the variance (measure of imprecision) for PRF (0.0087) is smaller than variances for RF (0.0096) and wRF (0.0090).
Table 11

Performance of all the drugs in terms of bias and variance for PRF compared with RF and wRF.

DRUGBIASVARIANCE
RFwRFPRFRFwRFPRF
17-AAG0.00230.00870.00200.01820.01870.0173
AZD0530−0.0055−0.0150−0.00050.01040.01010.0080
AZD6244−0.01330.01900.00370.01620.01260.0135
Erlotinib0.00540.0110−0.00540.00420.00420.0045
Lapatinib−0.0116−0.0080−0.01420.00520.00520.0057
Nilotinib−0.00620.00020.00120.00570.00700.0043
Nutlin-3−0.01020.0020−0.00570.00420.00360.0031
Paclitaxel0.00400.01920.03510.01940.02170.0171
PD-03259010.00860.0025−0.00480.02460.02340.0232
PD-03329910.01180.00640.00830.00500.00390.0047
PF23410660.0012−0.0053−0.00030.00470.00410.0069
PHA-665752−0.0045−0.01080.00580.00440.00460.0030
PLX4720−0.00270.0030−0.00380.00420.00300.0069
Sorafenib−0.00810.0005−0.00280.00550.00340.0032
TAE684−0.00090.00480.00300.01220.00940.0090
Average−0.00200.00250.00140.00960.00900.0087

Notes: Number of features used for building the model is 50, and the number of trees considered is 40. and .

The results given in Tables 10 and 11 show that the PRF provides improvement in terms of average error, accuracy, and precision. We next consider the length of the CI with PRF as compared with RF and wRF. Table 12 shows that the percentage decrease in average CI for PRF when compared with RF and wRF is positive for majority of the drugs.
Table 12

Performance of all the drugs in terms of % decrease in mean CI with PRF as compared with RF and wRF.

DRUG% DECREASE IN CI COMPARED TO RF% DECREASE IN CI COMPARED TO wRF
50% CL70% CL80% CL95% CL99% CL50% CL70% CL80% CL95% CL99% CL
17-AAG2.792.372.482.592.542.872.382.522.612.54
AZD05303.102.472.862.523.403.112.352.982.583.37
AZD62441.521.681.880.870.701.531.831.790.930.79
Erlotinib3.022.131.751.371.633.032.171.761.391.64
Lapatinib1.491.341.291.021.971.371.311.290.961.95
Nilotinib1.620.640.492.124.501.440.570.442.134.49
Nutlin-30.01−0.61−1.01−1.180.060.07−0.65−1.01−1.140.08
Paclitaxel3.922.912.381.922.033.902.792.481.912.01
PD-0325901−0.09−0.07−0.03−0.49−0.400.00−0.01−0.06−0.43−0.41
PD-03329913.383.132.581.652.353.393.132.551.572.29
PF23410665.766.095.085.065.885.826.025.155.055.80
PHA-6657521.830.970.37−0.59−0.231.760.930.37−0.66−0.23
PLX47200.800.590.97−0.230.040.800.640.94−0.200.01
Sorafenib3.954.163.572.814.333.954.293.552.874.32
TAE6840.610.400.02−0.68−0.330.700.440.02−0.62−0.33

Notes: Number of features used for building the model is 50, and the number of trees considered is T = 40. and .

Conclusions

In this article, we considered the probabilistic analysis of RFs by representing an RF as an ensemble of probabilistic regression trees. The two perspectives that we presented in the manuscript are based on how we would like to treat a probabilistic ensemble of regression trees. We can consider that we would like to select one tree from the available trees conditional on the weights and predict the output response based on the tree distribution resulting in the mixture distribution scenario. The second scenario is where the output response is considered as the weighted average of all the realizations of the trees similar to the averaging of responses from different trees as considered in conventional RF. Thus, if individual trees have large biases (measure of inaccuracy) that are both positive and negative, considering a weighted sum of random variables can provide a better representation. If we consider the mixture distribution approach for this case, selecting an individual tree for each prediction might be unable to remove the bias. However, the mixture distribution approach is reasonable in selecting tree weights to reduce the CIs, while maintaining coverage and MSE as shown in the results presented in this article. The probabilistic representation presented in this article allowed us to generate and analyze the CIs of individual predictions. We explored various structures of covariance matrices representing the relationships between the generated probabilistic regression trees and the corresponding tree weights that will optimally reduce the variance of prediction. We studied the effect of tree weights generated using MLE on different error measures and prediction CIs. The application of the maximum likelihood estimates of tree weights on the CCLE drug sensitivity prediction problem illustrated the average reduction in CI, while maintaining or lowering MSE. Future research will consider the generation of a probabilistic framework for multivariate RFs along with generation of sufficiency conditions for reduction in CI by optimizing tree weights.

Software Availability

Matlab implementation can be downloaded from https://github.com/razrahman/PRF_codes.git.
  4 in total

1.  Predicting in vitro drug sensitivity using Random Forests.

Authors:  Gregory Riddick; Hua Song; Susie Ahn; Jennifer Walling; Diego Borges-Rivera; Wei Zhang; Howard A Fine
Journal:  Bioinformatics       Date:  2010-12-05       Impact factor: 6.937

2.  A Weighted Random Forests Approach to Improve Predictive Performance.

Authors:  Stacey J Winham; Robert R Freimuth; Joanna M Biernacka
Journal:  Stat Anal Data Min       Date:  2013-12-01       Impact factor: 1.051

3.  The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity.

Authors:  Jordi Barretina; Giordano Caponigro; Nicolas Stransky; Kavitha Venkatesan; Adam A Margolin; Sungjoon Kim; Christopher J Wilson; Joseph Lehár; Gregory V Kryukov; Dmitriy Sonkin; Anupama Reddy; Manway Liu; Lauren Murray; Michael F Berger; John E Monahan; Paula Morais; Jodi Meltzer; Adam Korejwa; Judit Jané-Valbuena; Felipa A Mapa; Joseph Thibault; Eva Bric-Furlong; Pichai Raman; Aaron Shipway; Ingo H Engels; Jill Cheng; Guoying K Yu; Jianjun Yu; Peter Aspesi; Melanie de Silva; Kalpana Jagtap; Michael D Jones; Li Wang; Charles Hatton; Emanuele Palescandolo; Supriya Gupta; Scott Mahan; Carrie Sougnez; Robert C Onofrio; Ted Liefeld; Laura MacConaill; Wendy Winckler; Michael Reich; Nanxin Li; Jill P Mesirov; Stacey B Gabriel; Gad Getz; Kristin Ardlie; Vivien Chan; Vic E Myer; Barbara L Weber; Jeff Porter; Markus Warmuth; Peter Finan; Jennifer L Harris; Matthew Meyerson; Todd R Golub; Michael P Morrissey; William R Sellers; Robert Schlegel; Levi A Garraway
Journal:  Nature       Date:  2012-03-28       Impact factor: 49.962

4.  An ensemble based top performing approach for NCI-DREAM drug sensitivity prediction challenge.

Authors:  Qian Wan; Ranadip Pal
Journal:  PLoS One       Date:  2014-06-30       Impact factor: 3.240

  4 in total
  4 in total

1.  Evaluating the consistency of large-scale pharmacogenomic studies.

Authors:  Raziur Rahman; Saugato Rahman Dhruba; Kevin Matlock; Carlos De-Niz; Souparno Ghosh; Ranadip Pal
Journal:  Brief Bioinform       Date:  2019-09-27       Impact factor: 11.622

2.  Functional random forest with applications in dose-response predictions.

Authors:  Raziur Rahman; Saugato Rahman Dhruba; Souparno Ghosh; Ranadip Pal
Journal:  Sci Rep       Date:  2019-02-07       Impact factor: 4.379

3.  Investigation of model stacking for drug sensitivity prediction.

Authors:  Kevin Matlock; Carlos De Niz; Raziur Rahman; Souparno Ghosh; Ranadip Pal
Journal:  BMC Bioinformatics       Date:  2018-03-21       Impact factor: 3.169

4.  Tree-Weighting for Multi-Study Ensemble Learners.

Authors:  Maya Ramchandran; Prasad Patil; Giovanni Parmigiani
Journal:  Pac Symp Biocomput       Date:  2020
  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.