| Literature DB >> 34680905 |
Giovanna Cilluffo1, Salvatore Fasola1, Giuliana Ferrante2, Velia Malizia1, Laura Montalbano1, Stefania La Grutta1.
Abstract
This narrative review aims to provide an overview of the main Machine Learning (ML) techniques and their applications in pharmacogenetics (such as antidepressant, anti-cancer and warfarin drugs) over the past 10 years. ML deals with the study, the design and the development of algorithms that give computers capability to learn without being explicitly programmed. ML is a sub-field of artificial intelligence, and to date, it has demonstrated satisfactory performance on a wide range of tasks in biomedicine. According to the final goal, ML can be defined as Supervised (SML) or as Unsupervised (UML). SML techniques are applied when prediction is the focus of the research. On the other hand, UML techniques are used when the outcome is not known, and the goal of the research is unveiling the underlying structure of the data. The increasing use of sophisticated ML algorithms will likely be instrumental in improving knowledge in pharmacogenetics.Entities:
Keywords: pharmacogenetics; supervised machine learning; unsupervised machine learning
Mesh:
Year: 2021 PMID: 34680905 PMCID: PMC8535911 DOI: 10.3390/genes12101511
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Word-cloud analysis using the titles of articles obtained based on the following search strategy (PubMed): machine learning AND pharmacogenetics. The pre-processing procedures applied were: (1) removing non-English words or common words that do not provide information; (2) changing words to lower case and (3) removing punctuation and white spaces. The size of the word is proportional to the observed frequency.
Figure 2Summary representation of different SML algorithms: examples of regression and classification methods.
Supervised machine learning approaches: strengths and limitations.
| Methods | Strengths | Limitations |
|---|---|---|
| GLM | The response variable can follow any distribution in the exponential family | Affected by noisy data, missing values, multicollinearity and outliers |
| Ridge | Overcomes multicollinearity issues | Increased bias |
| LASSO | Avoids overfitting | Selects only one feature from a group of correlated features |
| EN | Selects more than n predictors when n (sample size)<<p (# of variables) | Computationally expensive with respect to LASSO and Ridge |
| RT | Easy to implement | Computationally expensive |
| RF | High performance and accuracy | Less interpretability |
| SVR | Easy to implement | Unsuitable for large datasets |
| NB | Suitable for multi-class prediction problems | Independence assumption |
| SVM | Suitable for high dimensional settings | No probabilistic explanation for the classification |
| KNN | Easy to implement | Affected by noisy data, missing values and outliers |
| NN | Robust to outliers | Computationally expensive |
GLM: Generalized Linear Model; LASSO: Least Absolute Shrinkage and Selection Operator; EN: Elastic-net; RT: Regression Tree; RF: Random Forest; SVR: Support Vector Regression; NB: Naïve Bayes; SVM: Support Vector Machine; KNN: K-nearest Neighbor; NN: Neural Network; #: number; * overlapping can arise when samples from different classes share similar attribute values.
Summary of the study using SML approaches.
| Reference | AIM | Included Population | Methodologies | Results |
|---|---|---|---|---|
| Fabbri 2018 | To predict response to antidepressants | 671 patients | NN and RF | The best accuracy among the tested models was achieved by NN |
| Maciukiewicz 2018 | To predict response to antidepressants | 186 patients | RT and SVM | SVM reported the best performance in predicting the antidepressants response. |
| Kim 2019 | To study of the response to anti-cancer drugs | 1235 samples | EN, SVM and RF | Sophisticated machine |
| Cramer 2019 | To study of the response to anti-cancer drugs | 1001 cancer cell lines and 265 drugs | linear regression models | The interaction-based approach contributes to a holistic view on the determining factors of drug response. |
| Su 2019 | To study of the response to anti-cancer drugs | 33,275 cancer cell lines and 24 drugs | Deep learning and RF | The proposed Deep-Resp-Forest has demonstrated the promising use of deep learning and deep forest approach on the drug response prediction tasks. |
| Ma 2018 | To study the warfarin dosage prediction | 5743 patients | NN, Ridge, RF, SVR and LASSO | Novel regression models combining the advantages of distinct machine learning algorithms and significantly improving the prediction accuracy compared to linear regression have been obtained. |
| Liu 2015 | To study the warfarin dosage prediction | 3838 patients | NN, RT, SVR, RF and LASSO | Machine learning-based algorithms tended to perform better in the low- and high- dose ranges than multiple linear regression. |
| Sharabiani | To study the warfarin dosage prediction | 4237 patients | SVM | A novel methodology for predicting the initial dose was proposed, which only relies on patients’ clinical and demographic data. |
| Truda 2021 | To study the warfarin dosage prediction | 5741 patients | Ridge, NN and SVR | SVR was the best performing traditional algorithm, whilst neural networks performed poorly. |
| Li 2015 | To study the warfarin dosage prediction | 1295 patients | Linear regression model, NN, RT, SVR and RF | Multiple linear regression was the best performing algorithm. |
LASSO: Least Absolute Shrinkage and Selection Operator; EN: Elastic-net; RT: Regression Tree; RF: Random Forest; SVR: Support Vector Regression; SVM: Support Vector Machine; KNN: K-nearest Neighbor; NN: Neural Network.
Figure 3Summary representation of different UML algorithms: some examples.
Unsupervised machine learning approaches: strengths and limitations.
| Methods | Strengths | Limitations |
|---|---|---|
| K-means | Reallocation of entities is allowed | |
| K-medoids | Reallocation of entities is allowed | |
| Agglomerative/ Divisive Hierarchical | Easy to implement | Strict hierarchical structure |
| SOM | Reallocation of entities is allowed |
SOM: self-organizing maps.
Summary of the study using UML approaches.
| Reference | AIM | Included population | Methodologies | Results |
|---|---|---|---|---|
| Tao 2020 | To balance the dataset of patients treated with warfarin and improve the predictive accuracy. | 592 patients | Cluster analysis | The algorithm detects the minority group, based on the association between the clinical features/genotypes and the warfarin dosage. |
| Kautzky 2015 | To combine the effects of genetic polymorphisms and clinical parameters on treatment outcome in treatment-resistant depression. | 225 patients | Cluster analysis | Cluster analysis allowed identifying 5 clusters of patients significantly associated with treatment response. |