| Literature DB >> 32933477 |
Jan Klosa1, Noah Simon2, Pål Olof Westermark1, Volkmar Liebscher3, Dörte Wittenburg4.
Abstract
BACKGROUND: Statistical analyses of biological problems in life sciences often lead to high-dimensional linear models. To solve the corresponding system of equations, penalization approaches are often the methods of choice. They are especially useful in case of multicollinearity, which appears if the number of explanatory variables exceeds the number of observations or for some biological reason. Then, the model goodness of fit is penalized by some suitable function of interest. Prominent examples are the lasso, group lasso and sparse-group lasso. Here, we offer a fast and numerically cheap implementation of these operators via proximal gradient descent. The grid search for the penalty parameter is realized by warm starts. The step size between consecutive iterations is determined with backtracking line search. Finally, seagull -the R package presented here- produces complete regularization paths.Entities:
Keywords: High-dimensional data; Machine learning; Optimization; R package
Mesh:
Year: 2020 PMID: 32933477 PMCID: PMC7493359 DOI: 10.1186/s12859-020-03725-w
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1a Relationship between observed (chronological) and predicted (methylation) age. Each blue dot represents a sample in each class of observed chronological age (3mos, 4mos, etc.). Mean methylation age and error bars are displayed in black for each class of age. b Methylation age obtained with seagull vs. SGL. Blue dots represent samples; the dashed line is a regression line with slope 1
Performance evaluation
| R package | Accuracy parameter | MSE | Non-zero | Time | |
|---|---|---|---|---|---|
| 10−4 | 0.91 | 12.78 | 8822 | 45 h 20 min | |
| 10−4 | 0.92 | 12.38 | 65,463 | 20 min | |
| 10−5 | 0.92 | 11.57 | 11,823 | 40 min | |
| 10−6 | 0.92 | 11.79 | 5095 | 2 h 13 min | |
| 10−8 | 0.92 | 11.84 | 5072 | 4 h 50 min |
Accuracy parameter refers to a package-dependent convergence parameter; R2 is the squared correlation coefficient and MSE is the mean squared error of chronological and predicted age; Non-zero denotes the number of CpG sites with non-zero effect estimate; Time is the computational time needed to calculate the full regularization path
Fig. 2Path of mean squared error (MSE) of predicted age for each λ. Results of seagull and SGL are represented in blue and violet, respectively. The vertical lines mark the index in the sequence of λ’s with lowest MSE in corresponding color. The respective lowest MSE of seagull and SGL were 11.79 and 12.78