Literature DB >> 33016991

selectBoost: a general algorithm to enhance the performance of variable selection methods.

Frédéric Bertrand^1,2, Ismaïl Aouadi^3,4,5, Nicolas Jung^1,3, Raphael Carapito^3,4,5, Laurent Vallat^3,5, Seiamak Bahram^3,4,5, Myriam Maumy-Bertrand¹.

Abstract

MOTIVATION: With the growth of big data, variable selection has become one of the critical challenges in statistics. Although many methods have been proposed in the literature, their performance in terms of recall (sensitivity) and precision (predictive positive value) is limited in a context where the number of variables by far exceeds the number of observations or in a highly correlated setting.
RESULTS: In this article, we propose a general algorithm, which improves the precision of any existing variable selection method. This algorithm is based on highly intensive simulations and takes into account the correlation structure of the data. Our algorithm can either produce a confidence index for variable selection or be used in an experimental design planning perspective. We demonstrate the performance of our algorithm on both simulated and real data. We then apply it in two different ways to improve biological network reverse-engineering.
AVAILABILITY AND IMPLEMENTATION: Code is available as the SelectBoost package on the CRAN, https://cran.r-project.org/package=SelectBoost. Some network reverse-engineering functionalities are available in the Patterns CRAN package, https://cran.r-project.org/package=Patterns. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Year: 2021 PMID： 33016991 PMCID： PMC8097688 DOI： 10.1093/bioinformatics/btaa855

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Technological innovations make it possible to measure large amounts of data in a single observation. As a consequence, problems in which the number P of variables is larger than the number N of observations have become common. As reviewed by Fan and Li (2006), such situations arise in many fields from fundamental sciences to social science, and variable selection is required to tackle these issues. For example, in biology/medicine, thousands of messenger RNA (mRNA) expressions (Lipshutz ) may be potential predictors of some disease. Moreover, in such studies, the correlation between variables is often very strong (Segal ), and variable selection methods often fail to make the distinction between the informative variables and those which are not. Similarly, inference of gene regulatory networks from perturbation data can enhance the insights of a biological system (Morgan ). In this article, we propose a general algorithm that enhances model selection in correlated variables. First, we will assume a statistical model with a response variable (with the symbol ‘’’ as the transposed), a variable matrix of size N × P, and a vector of parameters . Then, we will assume that the vector of parameters is sparse. In other words, we will assume that except for a quite small proportion of elements of the vector. We note as the set of indices for which and is the cardinality of this set . Without any loss of generality, we will assume that if and only if . When dealing with a problem of variable selection, one of the goals is the estimation of the support, in which you want to be close to one, with . Here, our interest is mainly as follows, i.e. in identifying the correct support . This kind of issue arises in many fields, e.g. in biology, where it is of greatest interest to discover which specific molecules are involved in a disease (Fan and Li, 2006). There is a vast literature dealing with the problem of variable selection in both statistical and machine-learning areas (Fan and Li, 2006; Fan and Lv, 2010). The main variable selection methods can be gathered in the common framework of penalized likelihood. The estimate is then given by: where is the log-likelihood function, is a penalty function and is the regularization parameter. As the goal is to obtain a sparse estimation of the vector of parameters , a natural choice for the penalty function is to use the so-called norm (), which corresponds to the number of non-vanishing elements of a vector: which induces . For example, when λ =1, we get the Akaike Information Criterion (AIC) (Akaike, 1974) and when , we get the Bayesian Information Criterion (BIC) (Schwarz, 1978). Many different penalties can be found in the literature. Solving this problem with as part of the penalty is an NP-hard problem (Fan and Lv, 2010; Natarajan, 1995). It cannot be used in practice when P becomes large, even when it is employed with some search strategy like forward regression, stepwise regression (Hocking, 1976) and genetic algorithms (Koza ). Donoho and Elad (2003) showed that relaxing to norm ends, under some assumptions, to the same estimation. This result encourages the use of a wide range of penalties based on different norms. For example, the case where is the lasso estimator (Tibshirani, 1996) [or equivalently Basis Pursuit Denoising (Chen )] whereas leads to the Ridge estimator (Hoerl and Kennard, 1970). Nevertheless, the penalty term induces variable selection only if: Equation (3) explains why the lasso regression allows for variable selection, while the Ridge regression does not. The lasso regression is, however, known to lead to a biased estimate (Zou, 2006). The Smoothly Clipped Absolute Deviation (SCAD) (Fan, 1997), Minimax Concave Penalty (Zhang, 2010) or adaptive lasso (Zou, 2006) penalties all address this problem. The popularity of such variable selection methods is linked to fast algorithms like Least-Angle Regression Selection (Efron ), coordinate descent or Penalized Linear Unbiased Selection (Zhang, 2010). Nevertheless, the goal of identifying the correct support of the regression is complicated and the reason why variable selection methods fail to select the set of non-zero variables can be summarized in two words: linear correlation. Choosing the lasso regression as a special case, Zhao and Yu (2006) stated that if an irrelevant predictor is highly correlated with the predictors in the true model, lasso may not be able to distinguish it from the true predictors with any amount of data and any amount of regularization. Zhao and Yu (2006) [and simultaneously Zou (2006)] found an almost necessary and sufficient condition for lasso sign consistency (i.e. selecting the non-zero variables with the correct sign). This condition is known as ‘irrepresentable condition’: where . In other words, when , this can be seen as the regression of each variable, which is not in over the variables, which are in . As all variables in the matrix are centered, the absolute sum of the regression parameters should be smaller than 1 to satisfy this ‘irrepresentable condition’. Facing this issue, existing variable selection methods can be split into two categories: those which are ‘regularized’ and try to give similar coefficients to correlated variables [e.g. elastic net (Zou and Hastie, 2005)], those which are not ‘regularized’ and pick up one variable among a set of correlated variables [e.g. the lasso (Tibshirani, 1996)]. The former group can further be split into methods in which groups of correlation are known, such as the group lasso (Friedman ; Yuan and Lin, 2006) and those in which groups are not known as in the elastic net (Zou and Hastie, 2005). The latter combines the and the norm and takes advantage of both. Non-regularized methods will select some co-variables among a group of correlated variables while regularized methods will select all variables in the same group with similar coefficients. The main idea of our algorithm is to consider that any observed value of a group of linearly correlated variables of the X matrix is the independent realization of a given random function. This common random function is then used to perturb the observed values of the relevant correlated variables. Strictly speaking, the use of noise to determine the informative variables is not a new idea. For example, it has been shown that adding random pseudo-variables decreases over-fitting (Wu ). In the case where P > N, the pseudo-variables are generated either with a standard normal distribution or by using permutations on the matrix X (Wu ). Another approach consists of adding noise to the response variable and leads to similar results (Luo ). The rationale of this method is based on the work of Cook and Stefanski (1994), which introduces the simulation-based algorithm SIMEX (Cook and Stefanski, 1994). Adding noise to the matrix X has already been used in the context of microarrays (Chen ). Simsel (Eklund and Zwanzig, 2012) is an algorithm that both adds noise to variables and uses random pseudo-variables. One new and inspiring approach is stability selection (Meinshausen and Bühlmann, 2010) in which the variable selection method is applied on sub-samples, and informative variables are defined as variables which have a high probability of being selected. Bootstrapping has been applied to the lasso on both the response variable and the matrix X with better results in the former case (Bach ). A random lasso, in which variables are weighted with random weights, has also been introduced (Wang ). In this article, following the idea of using simulation to enhance the variable selection methods, we propose the SelectBoost algorithm. Unlike other algorithms reviewed above, it takes into account the correlation structure of the data. Furthermore, our algorithm is motivated by the fact that in the case of non-regularized variable selection methods, if a group contains variables that are highly correlated together, one of them will be chosen with precision.

2 Materials and methods

The SelectBoost algorithm has been designed in a general framework in order to avoid to select non-predictive correlated features. The main goal is to improve the predictive positive value (PPV), i.e. the proportion of selected variables which truly belong to .

2.1 Generate new perturbed design matrix

As we assume that the variables are centered and that for , we know that . Indeed, the normalization puts the variables on the unit sphere . The process of centering can be seen as a projection on the hyperplane with the unit vector as normal vector. Moreover, the intersection between and is . We further define the following isomorphism: where is an orthogonal base of and is the canonical base of . We define: with the canonical base of . Note that , and that is why we can work in and then return in . Here, we make the assumption that a group of correlated variables are independent realizations of the same multivariate Gaussian distribution. As the variables are normalized with respect to the norm, we will use the von Mises–Fisher distribution (Sra, 2012) in thanks to the isomorphism in order to generate new perturbed design matrix. The probability density function of the von Mises–Fisher distribution for the random P-dimensional unit vector is given by: where and the normalization constant is equal to: where I denotes the modified Bessel function of the first kind and order v (Abramowitz and Stegun, 1972). We denote by and the maximum likelihood estimators of the and κ parameters. The multivariate Gaussian distribution assumption is not restrictive. As long as the group of correlated variables is independent realizations of the same distribution, the SelectBoost algorithm can be applied: either directly to assess the stability of the selected variables with perturbed datasets with an increasing noise level, which is the core idea behind the SelectBoost algorithm, or after replacing the von Mises–Fisher distribution with a more relevant one.

2.2 The SelectBoost algorithm

To use the SelectBoost algorithm, we need a grouping method depending on a user-provided constant . This constant determines the strength of the grouping effect. The grouping method maps each variable index to an element of [with the powerset of the set S, i.e. the set which contains all the subsets of S]. Concretely, is the set of all variables, which are considered to be linked to the variable and is the submatrix of X containing the columns which indices are in . We impose the following constraints to the grouping function: Furthermore, we need to have a selection method: which maps the design matrix X and the response variable y to a 0–1 vector of length P with 1 at position p if the method selects the variable p and 0 otherwise. We then use the von Mises–Fisher distribution to generate replacement of the original variables by some simulations (see Algorithm 1) to create B new design matrices . The SelectBoost algorithm then applies the variable selection method select to each of these matrices and returns a vector of length P with the frequency of apparition of each variable. The frequency of apparition of variable , noted ζ is assumed to be an estimator of the probability for this variable to be in . The choice of c0 is crucial. On the one hand, when this constant is too large, the model is not perturbed enough. On the other hand, when this constant is too small, variables are chosen at random. The SelectBoost algorithm returns the vector . Each of these values has to be compared to a threshold to determine which variables are selected: we choose to select a variable p if . The simulation study showed that the choice of the threshold is critical and the algorithm can be improved if we enforce that the ζ values—as functions of c0—are non-increasing, see Figure 1 bottom. This additional requirement makes sense: the more variables the resampling process involves—with smaller c0— the less a given variable will be selected.

Fig. 1.

Top: evolution of the recall, PPV and F-score as a function of for LASSO-based SelectBoost and AICc model selection criterion for Type1 simulated data with a non-increasing post-processing step and a threshold . If models are empty. Bottom: the distribution of the PPV for a 0.25 threshold and for SPLS-based SelectBoost, Type1 data and raw SelectBoost (left) or SelectBoost with a non-increasing post-processing step (right)

2.3 Choosing the parameters of the algorithm

We first have to choose the grouping function. One of the simplest ways to define a grouping function is the following: In other words, the correlation group of the variable p is determined by variables whose correlation with is at least c0. In another way, the structure of correlation may further be taken into account using graph community clustering. Let be the correlation matrix of matrix . Let define as follows: Then, we apply a community clustering algorithm on the undirected network with weighted adjacency matrix defined by . Using a graph community clustering algorithm is helpful with large datasets while still clustering similar variables together. For instance, the fast greedy modularity optimization algorithm for finding community structure (Clauset ) runs in essentially linear time for many real-world networks given that they are sparse and hierarchical. Once the grouping function is chosen, we have to choose parameter c0. Due to the constraints in Equation (6), the SelectBoost algorithm results in the initial variable selection method when . As we will show in the next section, the smaller the parameter c0, the higher the precision of the resulting selected variables. On the other hand, it is obvious that the probability of choosing none of the variables (i.e. resulting in the choice of an empty set) increases as the parameter c0 decreases. In the perspective of experimental planning, the choice of c0 should result of a compromise between precision and proportion of active identified variables. Hence, the c0 parameter can be used to introduce a confidence index γ related to the variable :

Algorithm 1 Pseudo-code for the SelectBoost algorithm

Require: fordo fordo end for end for

3 Numerical studies

We benchmarked the algorithm with a large simulation study with four data generation processes and three real datasets (Table 1). Generated datasets are available upon request and real datasets are available either upon request or online.

Table 1.

Summary of the types of datasets used to benchmark the SelectBoost algorithm

Name	Data	Individuals	Variables
Type1	Simulated	100	1000
Type2	Simulated	100	1000
Type3	Simulated	400	203
Type4	Simulated	750	102
Leukemia	Observed	72	3571
Huntington	Observed	69	17 717
Melanoma	Observed	28	25 268

Simulation with 1000 variables and one linear response. A cluster of 50 variables is linked to the response. Simulation with 1000 variables and one binary response. A cluster of 50 variables is linked to the response. Data are 200 uncorrelated (‘unlinked’) single nucleotide polymorphisms (SNPs) with simulated genotypes, in which the first 20 of them affect the outcome with three covariates; 400 observations; Data are 100 uncorrelated (‘unlinked’) SNPs with simulated genotypes, in which the first 10 of them affect the outcome with two covariates; 750 observations. The leukemia dataset (Golub ) is the preprocessed data of Dettling (2004) retrieved from the Supplementary Material accompanying Friedman ). The Huntington dataset is a real dataset with 28 087 variables observed on 69 individuals. We first applied independent filtering and removed 10 370 variables. We applied the SelectBoost algorithm to 17 717 variables observed on 69 individuals. The melanoma dataset is the GSE78220 dataset from Hugo . Summary of the types of datasets used to benchmark the SelectBoost algorithm For Types 1 and 2, the number of variables is 1000, and the number of observations is 100. The data are generated from a cluster simulation (Bair ; Bastien ). Only 50 first predictors are linked to the response Y and the last 950 variables are randomly generated from a standard normal distribution. For Example 3, the response variable is linear but was turned into a binary variable (+1 when and –1 when ). Examples 1 and 3 are linear regression examples whereas 2, 4, 5, 6 and 7 are logistic regression ones, for which, we will assume a logistic model with a binary response variable (Peng ). We provide results for 12 different settings based on 10 different models, see Supplementary Section S1 for more details. Linear regression (seven types): SPLS [Chun and Keles (2010), with raw and bootstrap corrected coefficients], LASSO, adaptive LASSO, enet and adaptive enet with model choice based on information criteria (AICc, BIC, GCV, Cp), LASSO with model choice based on 5-fold cross-validation () and varbvs linear (Carbonetto and Stephens, 2012; Guan and Stephens, 2011; Zhou ). Logistic regression (five types): logistic LASSO (glmnet based) with model choice based on 5-fold cross-validation, logistic LASSO (glmnet based) with model choice based on information criteria (AICc, BIC), varbvs binomial, SPLSda (Chun and Keles, 2010) and sgpls (Chun and Keles, 2010). The SelectBoost algorithm is based on correlated resampling and hence random. We wanted to assess both the stability and performance of the algorithm. As a consequence, for the four types of simulated data, we focused both on what may be called a repeatability study (a given dataset was analyzed 100 times to estimate the variation only due to the fact that the algorithm is random) and a reproducibility study (100 different datasets were generated and analyzed to estimate the variability due to both data simulation—from the same data generation—and the fact that the algorithm is random). The repeatability issue raised was raised, for instance by Boulesteix (2014) and Magnanensi for PLS models. For those models, random split cross-validation is known to have poor repeatability. We used two types of grouping functions (either determined by variables whose correlation with x is at least c0 -gdirect- or community clustering-based -gcc-). The cost (memory and time) of the random generation step can be limited thanks to a sparse correlated resampling feature. The remaining cost of the algorithm is with B the number of resampling and Nc0 the number of c0 values that are investigated and Time1 the time to fit the model once. To demonstrate the performance of the SelectBoost method, we compared our method with stability selection (Meinshausen and Bühlmann, 2010) and with a naive version of our algorithm, naiveSelectBoost. The naiveSelectBoost algorithm works as follows: estimate β with any variable selection method then if , as defined in Equation (7) e.g. is not reduced to p, shrink to 0. The naiveSelectBoost algorithm is similar to the SelectBoost algorithm, except that it does not take into account the error, which is made choosing at random a variable among a set of correlated variables. We use four indicators to evaluate the abilities of our method on simulated data. We define: recall as the ratio of the number of correctly identified variables (i.e. and ) over the number of variables that should have been discovered (i.e. ). precision as the ratio of correctly identified variables (i.e. and ) over the number of identified variables (i.e. ). F-score as the following ratio: selection as the average number of identified variables (i.e. ). Note that our interest is focused on precision, as our goal is to select reliable variables. As stated before, when c0 is decreasing toward zero, we expect a profit in precision and a decrease in recall. We also compute the F-score, which combines both recall and precision. As an improvement of precision comes with a decrease in the number of identified variables, the best method is the one with the highest precision for a given level of selection.

3.1 Results of the numerical studies

We show the evolution of the four criteria (recall, precision, F-score and selection) with regards to the decrease of c0. When , the SelectBoost algorithm is equivalent to the initial variable selection method. We introduce a post-processing step to enforce that, for a given variable, the proportion of selection is non-increasing. It is the expected behavior since the correlated resampling is not meant to increase the probability of selection for a variable. Such an increase may happen for small c0 values when a variable that is not linked with the response is mixed with a variable that is linked to the response. For all the simulations, this post-processing step increases the PPV of the SelectBoost algorithm, see Figure 1. As our primary focus is PPV, we recommend the use of this post-processing step. More details can be found in the Supplementary Graphs S7–S174. We created precision-recall plots to display the effects of the algorithm on the performance of all the models and criteria used for a given dataset. Identical model fitting criteria share the same colors. The arrows point toward decreasing c0 values. Direct grouping and community grouping lead to similar results, Figure 2 and Supplementary Figures S1, S3 and S4. These Figures also show that the results for a single dataset repeated 100 times are similar to results for 100 different datasets. The Zoom l sequence, which is a 10-step regular grid from the quantile—the maximum value—to —the quantile of order 0.9—achieves high PPV, Supplementary Figure S299.

Fig. 2.

Recall-precision curve. All models and criteria non-increasing SelectBoost. Type 1 data. Direct grouping. A total of 100 different datasets.

Recall-precision curve. All models and criteria non-increasing SelectBoost. Type 1 data. Direct grouping. A total of 100 different datasets. Supplementary Figure S5 displays an example of raw SelectBoost (without the non-increasing post-processing step) for direct grouping and 100 different datasets that should be compared to Figure 2. This effect is even stronger smaller values for the threshold. The non-increasing post-processing step greatly improves the results of the algorithm and leads to monotonic relationships between the recall, the precision and the c0 value. PPV benefit less from smaller values for the threshold, Supplementary Figure S6. All the results of the simulation study showed the good performance and stability of the algorithm, which we then applied once to each of the real datasets. The results of the gdirect and gcc based SelectBoost are similar, the gcc based being a bit more time consuming than the gdirect one. In the following of this article, we reported results and figures for gdirect-based SelectBoost. According to our simulation studies, Supplementary Graphs S7–S174, one should choose c0 between and , see Figure 1 top. In our simulation studies, we used an 11-steps c0 sequence, but, according to our results, it could be limited to 6 steps [from mean() to ] for the biggest datasets. The value of B should not be lower than 10. B = 50 or B = 100 will provide more stable results. As a consequence, the minimal time cost of the SelectBoost algorithm will be 60 times the time cost of the regular model fit, which could be afforded in almost every case. The parallel processing support of the SelectBoost package can help to reduce this time. Hence, the SelectBoost seems feasible with most of the datasets and even omics datasets as we did in our simulation study with the three real datasets. Hence, to assess the performance of the SelectBoost algorithm, we performed comprehensive numerical studies. As stated before, the SelectBoost algorithm can be applied to any existing variable selection method. Figure 1 top shows the result for the lasso selection with a penalty parameter chosen using information criteria for Type 1 datasets. In this example, we improve the precision up to 1. Moreover, as shown by Figures 1 and 3, the proportion of models, for which the precision reaches one, increases with the decrease of c0. The F-score increases, remains either stable or shows a small decrease indicating that the increase of PPV compensates the loss in recall.

Fig. 3.

Top: The average number of identified variables is plotted as a function of the proportion of correctly identified variables for Type1 simulated data and all models. Middle and bottom: effect of the SelectBoost algorithm wrt for adaptive elastic net and AICc model selection criterion with c0 in the range for 100 different (middle, reproducibility) or 100 identical (bottom, repeatability) Type3 simulated data with a non-increasing post-processing step and a threshold . Only results for non-empty models are shown In the previous section, we mentioned the possibility of using SelectBoost to obtain a confidence index, corresponding to one minus the lowest c0 for which a variable is selected. For each c0, we plotted the average number of selected variables as a function of the proportion of correctly identified variables (Fig. 3 and Supplementary Figs S300–S304). As expected, the proportion of correctly identified variables increases with the increase of the confidence index and with the decrease of the average number of identified variables. Therefore, the proportion of non-predictive features decreases with the increase of the confidence index. The SelectBoost algorithm shows its superiority over the naive SelectBoost algorithm. The error made when choosing a variable randomly among a set of correlated variables leads to more incorrect choices of variables. While the intensive simulation of our algorithm allows taking into account this error, the naiveSelectBoost does not. Finally, we compare the SelectBoost algorithm with stability selection. Stability selection uses a resampling algorithm to determine which of the variables included in the model are robust. In our simulation, stability selection shows performance with high precision but also low recall. Moreover, in contrast to the SelectBoost algorithm, stability selection does not allow to choose a convenient precision-PPV trade-off. The timings of the algorithm can be found on Supplementary Figures S175–S222.

4 Application to three real datasets

We applied our algorithm to three real datasets. We studied, with respect to the threshold, the number of non-zero variables, the number of variables selected by SelectBoost and their ratio. We found results that were concordant with those of the simulated datasets. Figure 4 displays those results for a SGPLS-based SelectBoost of the Leukemia dataset with a 0.25 threshold (). See Supplementary Figures S247–S298.

Fig. 4.

% of non-zero coefficients wrt to c0 for SGPLS-based SelectBoost models of the leukemia datasets and threshold

% of non-zero coefficients wrt to c0 for SGPLS-based SelectBoost models of the leukemia datasets and threshold We report the results for the RNA-Seq dataset providing mRNA expressions from Huntington’s disease and neurologically normal individuals. This dataset was downloaded from the GEO database under accession number GSE64810 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GSE64810). This dataset contains 20 Huntington’s disease cases and 49 neurologically normal controls and includes 28 087 genes as explanatory variables. An independent filtering (Bourgon ) preprocessing step was first performed using data-based filtering for replicated high-throughput transcriptome sequencing experiments (Rau ). Then, we applied the lasso selection method to this reduced dataset (see Fig. 5 left for the whole path of the solution). We used cross-validation to choose the appropriate level of penalization [i.e. the λ parameter in Equation (3)].

Fig. 5.

Colors: the green is for the most reliable variables selected by the SelectBoost algorithm [confidence index of 0.3; orange is for intermediate confidence (0.25) and red for low confidence (0.15)]. Left: evolution of the coefficients in the lasso regression when the regularization parameter λ is varying. For the λ range shown, the red, orange and green lines stick to zero. Right: evolution of the probability of being in the support of the regression when the confidence index is varying. The dotted line represents the 0.95 threshold We then applied our SelectBoost algorithm on the lasso method with penalty parameter chosen by cross-validation. We use a range for the c0 parameter starting from 1 to 0.7 with steps of 0.05, which corresponds to a confidence index from 0 to 0.3. For each step, the probability of being included in the support was calculated with 200 simulations as described in Algorithm 1. We set the threshold of being in the support to 0.95 to avoid numerical instability. We classify the selected variables into three categories: those that are identified for each confidence index from 0 to 0.15 (red), those identified from 0 to 0.25 (orange) and those identified from 0 to 0.3 (green). The last category contains the most reliable variables selected by the SelectBoost algorithm because these variables are identified from low to high confidence index. With the lasso selection method, 15 variables were selected. Among them, four genes were identified by SelectBoost into the three different categories of confidence index (see Fig. 5 right): two genes for low confidence (red) (ANXA3 and INTS12), one gene for intermediate confidence (orange) (NUB1) and one gene for high confidence (green) (PUS3). The interesting point, in these three examples, is that the identified variables are neither the first variables selected by the lasso nor the variables with the highest coefficients (see Fig. 5 left). This result demonstrates that our algorithm can be advantageous to select variables with high confidence and not just to select variables with the highest coefficients. Finally, we decided to assess the differential expression of these genes between patients and controls, using the limma package (Linear Models for Microarray and RNA-Seq Data) (Ritchie ). The four identified genes are significantly down-expressed by neurologically healthy controls confirming the result of a logistic model with these four genes.

5 Robust reverse-engineering of networks

Sparsity is a well-known feature of most biological networks (Barabási ). An actor can only be regulated by a small number of other actors, whereas it may regulate any number of other actors. Hence, variable selection methods, such as the lasso, ensure that sparsity feature and are often core components of most of the biological network reverse-engineering tools. As a consequence, we propose to apply the SelectBoost algorithm in two different ways in order to improve the biological network reverse-engineering: as a post-processing step after the inference was made or during the inference itself in order to select the most stable predictors for each node in the network. When used as a post-processing step, one can assess for any of the inferred links between the actors of the network, its confidence index against correlated resampling of the predictors. When used during the inference step, one can infer a model that is only built with links with a high enough confidence index. The former is implemented in the SelectBoost package as a new method for the Cascade package (Jung . The latter is implemented in the new Patterns CRAN package as a dedicated fitting function and is especially useful when trying to find targets for biological intervention that are strongly related to markers of some diseases through the reverse-engineered network and useful and reliable links. We benchmarked those two uses of the algorithm with a particular type of biological networks that we have been using for several years: cascade networks (Vallat ). For the post-inference processing, we first fit a model to a cascade network using the Cascade package inference function. Then, we compute confidence indices for the inferred links using the SelectBoost algorithm, more details, as well as the code, of the simulations can be found in the vignette of the package ‘Towards Confidence Estimates in Cascade Networks using the SelectBoost Package’, available at https://fbertran.github.io/SelectBoost/articles/confidence-indices-Cascade-networks.html. An example of those results is shown in Figure 6 with a cascade network for four time points and four groups of 25 actors.

Fig. 6.

Post-inference analysis of an inferred cascade network. Dark values are tantamount to low confidence. Bright values are tantamount to high confidence. Confidence ranges from 0 (lowest) to 1 (highest). The lower triangular part of the matrix is an area with the highest confidence (1) since we know—and assume so in the model—that for cascade networks those links must be =0 For the use of the SelectBoost algorithm during the fitting step of a cascade network reverse-engineering, we used the Patterns package. Benchmark results were reported as sensitivity, positive predictive value and F-score, shown in Figure 7; the code, the simulation details and the remaining results are part of a vignette of the package ‘Benchmarking the SelectBoost Package for Network Reverse Engineering’, that is available at https://fbertran.github.io/SelectBoost/articles/benchmarking-SelectBoost-networks.html.

Fig. 7.

F-score as a function of the thresholding value: if an inferred coefficient for the network is less than the thresholding value, then it is set to 0. The SelectBoost algorithm is compared to both stability selection and the regular lasso. The upper row displays results for the unweighted version of the algorithms, whereas the lower row displays results for their weighted counterparts We created an unweighted or a weighted version of the algorithm. The weighted version of the algorithm enables the user to include weights in the model, which means to favor or disfavor some links between the actors, in order, for instance, to take into account biological knowledge. The results shown in Figure 7 of the simulation study are a comparison to a standard set up for stability selection and regular lasso both for an unweighted version of the algorithms and a highly correctly weighted version of the same algorithms. By highly correctly weighted, we mean that we included influential weights in the model accordingly to the links that existed in the network that was used for data simulation. This network was randomized from one simulation to another. This weighted setting was used to determine if including correct biological knowledge would help the reverse-engineering algorithm to retrieve the correct network. If correct biological knowledge is included in the model, all three fitting functions lead to similar and outstanding results for the F-score criterion without even requiring the need to search for an optimal thresholding value as we had to do with the Cascade package. For each simulated dataset, vertical dots are displayed to show the optimal threshold level that should be used to maximize the F-score. It is computed with respect to the actual values that are unknown for real datasets. Without weights, SelectBoost shrinks the range of optimal values when compared to the lasso or stability selection. With correct weights, none of the methods still requires to use a cut-off value to maximize F-score. In an unweighted setting, the SelectBoost version of the fitting process shows better performance than stability selection and the lasso as long as the cut-off value is <0.4, which is about the double of the optimal thresholding value.

6 Conclusion

We introduce the SelectBoost algorithm that relies intensive computations to select variables with high precision (PPV). The user of SelectBoost can apply this algorithm to produce a confidence index or choose an appropriate precision-selection trade-off to select variables with high confidence and avoid selecting non-predictive features. The main idea behind our algorithm is to take into account the correlation structure of the data and thus use intensive computation to select reliable variables. The choice of the threshold is critical since such a choice leads to two effects. With a high threshold value—nearing the maximum value of 1—: an increase of the PPV while limiting the decrease of the F-score. With a low or medium threshold value—nearing the mid value of 0.5—: an increase in recall while limiting the decrease of the F-score. We will want the first property to retrieve the stable core of the predictors for models that are known to randomly choose between correlated variables, such as the lasso or adaptive lasso. . Whereas, we will want the second property for models that scarcely select variable, such as variable selection model using variational approximation methods for binary response (varbvs). A non-constant threshold should be also and investigated by those that would like to introduce corrections, for instance FDR-like, such as Holm–Bonferroni, in the variable selection process. We prove the performance of our algorithm through simulation studies in various settings. To get the best results, we recommend the use of c0 in the range of mean() to with the non-increasing post-processing step. It could be useful to decrease the lower bound to for the smallest datasets. The user should never use a c0 value too close to the empty model zone to avoid a decrease in precision. We succeed in improving the PPV, whenever it was possible, of all the 12 selection methods with relative stability on recall and F-score. If the PPV was already nearing 1, then there is almost no negative effect on the PPV and recall when applying SelectBoost. Our results open the perspective of a precision-selection trade-off which may be very useful in some situations where many regressions have to be made (e.g. network reverse-engineering with one regression made per node of the network). In such a context, our algorithm may even be used in an experimental design approach. The application to three real datasets allowed us to show that the most reliable variables are not necessarily those with the highest coefficients. The SelectBoost algorithm is a powerful tool that can be used in every situation where reliable and robust variable selection has to be made.

Funding

This work was supported by grants from the Agence Nationale de la Recherche (ANR) [ANR-11-LABX-0070_TRANSPLANTEX]; the INSERM [UMR_S 1109]; the Institut Universitaire de France (IUF) and the MSD-Avenir grant AUTOGEN, all to S.B.; the European regional development fund (European Union) INTERREG V program (project number 3.2 TRIDIAG) to R.C. and S.B.; the Agence Nationale de la Recherche (ANR) [ANR-11-LABX-0055_IRMIA]; the CNRS [UMR 7501] to F.B. and M.M.-B.; and by the French HPC Center ROMEO [UR 201923174L] to F.B. Conflict of Interest: none declared. Click here for additional data file.

19 in total

1. Regression approaches for microarray data analysis.

Authors: Mark R Segal; Kam D Dahlquist; Bruce R Conklin
Journal: J Comput Biol Date: 2003 Impact factor: 1.479

2. Cascade: a R package to study, predict and simulate the diffusion of a signal through a temporal gene network.

Authors: Nicolas Jung; Frédéric Bertrand; Seiamak Bahram; Laurent Vallat; Myriam Maumy-Bertrand
Journal: Bioinformatics Date: 2013-12-03 Impact factor: 6.937

3. A generalized framework for controlling FDR in gene regulatory network inference.

Authors: Daniel Morgan; Andreas Tjärnberg; Torbjörn E M Nordling; Erik L L Sonnhammer
Journal: Bioinformatics Date: 2019-03-15 Impact factor: 6.937

4. Deviance residuals-based sparse PLS and sparse kernel PLS regression for censored data.

Authors: Philippe Bastien; Frédéric Bertrand; Nicolas Meyer; Myriam Maumy-Bertrand
Journal: Bioinformatics Date: 2014-10-06 Impact factor: 6.937

5. A Selective Overview of Variable Selection in High Dimensional Feature Space.

Authors: Jianqing Fan; Jinchi Lv
Journal: Stat Sin Date: 2010-01 Impact factor: 1.261

6. RANDOM LASSO.

Authors: Sijian Wang; Bin Nan; Saharon Rosset; Ji Zhu
Journal: Ann Appl Stat Date: 2011-03-01 Impact factor: 2.083

7. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

Authors: T R Golub; D K Slonim; P Tamayo; C Huard; M Gaasenbeek; J P Mesirov; H Coller; M L Loh; J R Downing; M A Caligiuri; C D Bloomfield; E S Lander
Journal: Science Date: 1999-10-15 Impact factor: 47.728

8. Regularization Paths for Generalized Linear Models via Coordinate Descent.

Authors: Jerome Friedman; Trevor Hastie; Rob Tibshirani
Journal: J Stat Softw Date: 2010 Impact factor: 6.440

9. Data-based filtering for replicated high-throughput transcriptome sequencing experiments.

Authors: Andrea Rau; Mélina Gallopin; Gilles Celeux; Florence Jaffrézic
Journal: Bioinformatics Date: 2013-07-02 Impact factor: 6.937

10. Polygenic modeling with bayesian sparse linear mixed models.

Authors: Xiang Zhou; Peter Carbonetto; Matthew Stephens
Journal: PLoS Genet Date: 2013-02-07 Impact factor: 5.917

1 in total

1. Temporal multiomic modeling reveals a B-cell receptor proliferative program in chronic lymphocytic leukemia.

Authors: Frederic Bertrand; Laurent Vallat; Cedric Schleiss; Raphael Carapito; Luc-Matthieu Fornecker; Leslie Muller; Nicodème Paul; Ouria Tahar; Angelique Pichot; Manuela Tavian; Alina Nicolae; Laurent Miguet; Laurent Mauvieux; Raoul Herbrecht; Sarah Cianferani; Jean-Noel Freund; Christine Carapito; Myriam Maumy-Bertrand; Seiamak Bahram
Journal: Leukemia Date: 2021-04-08 Impact factor: 11.528

1 in total