| Literature DB >> 33016991 |
Frédéric Bertrand1,2, Ismaïl Aouadi3,4,5, Nicolas Jung1,3, Raphael Carapito3,4,5, Laurent Vallat3,5, Seiamak Bahram3,4,5, Myriam Maumy-Bertrand1.
Abstract
MOTIVATION: With the growth of big data, variable selection has become one of the critical challenges in statistics. Although many methods have been proposed in the literature, their performance in terms of recall (sensitivity) and precision (predictive positive value) is limited in a context where the number of variables by far exceeds the number of observations or in a highly correlated setting.Entities:
Year: 2021 PMID: 33016991 PMCID: PMC8097688 DOI: 10.1093/bioinformatics/btaa855
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Top: evolution of the recall, PPV and F-score as a function of for LASSO-based SelectBoost and AICc model selection criterion for Type1 simulated data with a non-increasing post-processing step and a threshold . If models are empty. Bottom: the distribution of the PPV for a 0.25 threshold and for SPLS-based SelectBoost, Type1 data and raw SelectBoost (left) or SelectBoost with a non-increasing post-processing step (right)
Summary of the types of datasets used to benchmark the SelectBoost algorithm
| Name | Data | Individuals | Variables |
|---|---|---|---|
| Type1 | Simulated | 100 | 1000 |
| Type2 | Simulated | 100 | 1000 |
| Type3 | Simulated | 400 | 203 |
| Type4 | Simulated | 750 | 102 |
| Leukemia | Observed | 72 | 3571 |
| Huntington | Observed | 69 | 17 717 |
| Melanoma | Observed | 28 | 25 268 |
Fig. 2.Recall-precision curve. All models and criteria non-increasing SelectBoost. Type 1 data. Direct grouping. A total of 100 different datasets.
Fig. 3.Top: The average number of identified variables is plotted as a function of the proportion of correctly identified variables for Type1 simulated data and all models. Middle and bottom: effect of the SelectBoost algorithm wrt for adaptive elastic net and AICc model selection criterion with c0 in the range for 100 different (middle, reproducibility) or 100 identical (bottom, repeatability) Type3 simulated data with a non-increasing post-processing step and a threshold . Only results for non-empty models are shown
Fig. 4.% of non-zero coefficients wrt to c0 for SGPLS-based SelectBoost models of the leukemia datasets and threshold
Fig. 5.Colors: the green is for the most reliable variables selected by the SelectBoost algorithm [confidence index of 0.3; orange is for intermediate confidence (0.25) and red for low confidence (0.15)]. Left: evolution of the coefficients in the lasso regression when the regularization parameter λ is varying. For the λ range shown, the red, orange and green lines stick to zero. Right: evolution of the probability of being in the support of the regression when the confidence index is varying. The dotted line represents the 0.95 threshold
Fig. 6.Post-inference analysis of an inferred cascade network. Dark values are tantamount to low confidence. Bright values are tantamount to high confidence. Confidence ranges from 0 (lowest) to 1 (highest). The lower triangular part of the matrix is an area with the highest confidence (1) since we know—and assume so in the model—that for cascade networks those links must be =0
Fig. 7.F-score as a function of the thresholding value: if an inferred coefficient for the network is less than the thresholding value, then it is set to 0. The SelectBoost algorithm is compared to both stability selection and the regular lasso. The upper row displays results for the unweighted version of the algorithms, whereas the lower row displays results for their weighted counterparts