| Literature DB >> 33926217 |
Veronica Tozzo1, Chloé-Agathe Azencott2,3,4, Samuele Fiorini5, Emanuele Fava6, Andrea Trucco6, Annalisa Barla1.
Abstract
More and more biologists and bioinformaticians turn to machine learning to analyze large amounts of data. In this context, it is crucial to understand which is the most suitable data analysis pipeline for achieving reliable results. This process may be challenging, due to a variety of factors, the most crucial ones being the data type and the general goal of the analysis (e.g., explorative or predictive). Life science data sets require further consideration as they often contain measures with a low signal-to-noise ratio, high-dimensional observations, and relatively few samples. In this complex setting, regularization, which can be defined as the introduction of additional information to solve an ill-posed problem, is the tool of choice to obtain robust models. Different regularization practices may be used depending both on characteristics of the data and of the question asked, and different choices may lead to different results. In this article, we provide a comprehensive description of the impact and importance of regularization techniques in life science studies. In particular, we provide an intuition of what regularization is and of the different ways it can be implemented and exploited. We propose four general life sciences problems in which regularization is fundamental and should be exploited for robustness. For each of these large families of problems, we enumerate different techniques as well as examples and case studies. Lastly, we provide a unified view of how to approach each data type with various regularization techniques.Entities:
Keywords: life sciences; regularization; supervised learning; unsupervised learning
Mesh:
Year: 2021 PMID: 33926217 PMCID: PMC8968832 DOI: 10.1089/cmb.2019.0371
Source DB: PubMed Journal: J Comput Biol ISSN: 1066-5277 Impact factor: 1.479
Definition of the Loss Function for Regression and Classification Problems
|
|
FIG. 1.Flux diagram explaining how to reach a specific question. In practice we first need to distinguish if we have labeled data or not, in the first case, we are in a supervised learning setting, while in the second we are in an unsupervised setting. In the supervised setting, we want to predict the labels, and we can simply do this in the best possible way or we may ask which are the best variables to predict. In the unsupervised setting, we can look for patterns in the samples or for relationships among the features.
Applications Related to the Analysis of Omic-Data of Various Nature
| Data type | Citation | Method | Regularization type |
|---|---|---|---|
| Gene expression (microarrays) | Guyon et al. ( | Support vector Machines | Recursive feature Elimination |
| Bøvelstad et al. ( | Ridge regression | Tikhonov | |
| Kursa ( | RF | Tree regularization | |
| Deng and Runger ( | RF | Gini index regularization | |
| Chen et al. ( | DL | Dropout | |
| Mascelli et al. ( | RLS | Elastic-Net | |
| Ma and Huang ( | RLS | Lasso | |
| De Mol et al. (2009b) | RLS | Elastic-Net | |
| Hughey and Butte ( | RLS | Elastic Net | |
| Krämer et al. ( | Network inference | Lasso | |
| Gene expression (RNA-Seq) | citeyu2013shrinkage | Negative binomial distribution | Tikhonov |
| Leung et al. ( | DL | Lasso, dropout | |
| Tang et al. ( | Cox model | Lasso | |
| Cheng et al. ( | Network inference | Group Lasso | |
| Yang et al. (2012a) | Network inference | Lasso | |
| Gene expression, CNV | Žitnik and Zupan (2015) | Network inference | Network integration |
| ncRNA- mRNA | Soulé et al. ( | RF | Ensemble |
| SNPs | Yuan et al. ( | DL | Tikhonov (weight decay) |
| Kratsch and McHardy ( | RLS | Tikhonov | |
| Silver et al. ( | RLS | Group Lasso | |
| He et al. ( | Gradient boosting | Boosting and Lasso | |
| Alexandrov et al. ( | Dictionary learning | Lasso | |
| Piaggio et al. ( | Dictionary learning | Lasso | |
| SNPs, Copy number, methylation | Aben et al. ( | RLS | Elastic-Net |
| Methylation | Johann et al. ( | RF | Bagging |
| Methylation, gene expression | Csala et al. ( | RLS | Elastic-Net |
| DNA sequence | Liu et al. ( | Gradient boosting | Bagging |
| DNA sequence | Libbrecht et al. ( | Network inference | Graph-based regularization |
| Proteomic | Chen et al. ( | DL | Tikhonov (weight decay) |
| Microbioma | Ramanan et al. ( | Network inference | Lasso |
| Protein, tissue, and function information | Zitnik and Leskovec ( | Multilayer network inference | Tikhonov |
| CNV | Nowak et al. ( | Dictionary learning | Fused Lasso |
For each type of datum, we provided the specific type of analyzed data, the citation, the machine learning method, and the type of regularization. Note that recursive feature elimination was never explicitly mentioned, but it is part of the sparsity inducing regularization techniques, details can be found in Guyon and Elisseeff (2003).
CNV, copy number variation; DL, deep learning; RF, Random forest; RLS, Regularized Least Squares.
Applications Related to the Analysis of Biomedical Images and Textual/Clinical Data
| Data category | Data type | Citation | Method | Regularization type |
|---|---|---|---|---|
| Texts | Clinical records | Garg et al. ( | AdaBoost | Bootstrap |
| Structured text | Insurance claims | Fiorini et al. ( | DL | Early stopping and dropout |
| Clinical | Patient-centered outcomes | Fiorini et al. ( | RLS | Nuclear norm, Elastic-Net |
| Images | Brain electron microscopy | Fakhry et al. ( | DL | Tikhonov (weight decay) |
| MRI | Schlemper et al. ( | DL | Data augmentation | |
| MRI | Li et al. ( | DL | Transfer learning | |
| MRI | Tong et al. ( | DL | Graph spectral regularization | |
| fMRI | Jenatton et al. ( | Generalized linear model | Hierarchical group Lasso | |
| MRI | Xin et al. ( | RLS | Generalized fused Lasso | |
| fMRI | Monti et al. ( | Network inference | Joint Lasso | |
| Retinal images | Javidi et al. ( | Dictionary learning | Lasso | |
| sMRI | Li et al. ( | Dictionary learning | Lasso | |
| Retinal images | Zhou et al. ( | Dictionary learning | Group Lasso | |
| MR | Ravishankar and Bresler ( | Dictionary learning |
For each type of datum, we provided the specific type of analyzed data, the citation, the machine learning method, and the type of regularization.
MR, magnetic resonance; MRI, magnetic resonance imaging; fMRI, functional MRI; sMRI, structural MRI.