| Literature DB >> 32477189 |
Yi Xu1, Liyu Cao1, Xinyi Zhao1, Yinghao Yao1, Qiang Liu1, Bin Zhang1, Yan Wang1, Ying Mao1, Yunlong Ma1, Jennie Z Ma2, Thomas J Payne3, Ming D Li1,4, Lanjuan Li1.
Abstract
Smoking is a complex behavior with a heritability as high as 50%. Given such a large genetic contribution, it provides an opportunity to prevent those individuals who are susceptible to smoking dependence from ever starting to smoke by predicting their inherited predisposition with their genomic profiles. Although previous studies have identified many susceptibility variants for smoking, they have limited power to predict smoking behavior. We applied the support vector machine (SVM) and random forest (RF) methods to build prediction models for smoking behavior. We first used 1,431 smokers and 1,503 non-smokers of African origin for model building with a 10-fold cross-validation and then tested the prediction models on an independent dataset consisting of 213 smokers and 224 non-smokers. The SVM model with 500 top single nucleotide polymorphisms (SNPs) selected using logistic regression (p<0.01) as the feature selection method achieved an area under the curve (AUC) of 0.691, 0.721, and 0.720 for the training, test, and independent test samples, respectively. The RF model with 500 top SNPs selected using logistic regression (p<0.01) achieved AUCs of 0.671, 0.665, and 0.667 for the training, test, and independent test samples, respectively. Finally, we used the combined logistic (p<0.01) and LASSO (λ=10-3) regression to select features and the SVM algorithm for model building. The SVM model with 500 top SNPs achieved AUCs of 0.756, 0.776, and 0.897 for the training, test, and independent test samples, respectively. We conclude that machine learning methods are promising means to build predictive models for smoking.Entities:
Keywords: feature selection; machine learning; prediction of smoking; single-nucleotide polymorphisms; smoking
Year: 2020 PMID: 32477189 PMCID: PMC7241440 DOI: 10.3389/fpsyt.2020.00416
Source DB: PubMed Journal: Front Psychiatry ISSN: 1664-0640 Impact factor: 4.157
Characteristics of datasets used for machine learning.
| Characteristic | Training and test samples | Independent test sample (N = 437) |
|---|---|---|
| Mean age (years) | 42.8 ± 13.5 | 39.8 ± 13.3 |
| Females (%) | 1,590 (54.2) | 218 (49.9) |
| No. smokers (%) | 1,431 (48.8) | 213 (48.7) |
Figure 1Flowchart of machine learning process used in the study.
Number of single nucleotide polymorphisms (SNPs) selected from logistic regression, least absolute shrinkage and selection operator (LASSO) regression, and the combined logistic regression and LASSO regression under different significance thresholds.
| Method | Significance thresholds | Selected SNP dataset |
|---|---|---|
| Logistic | 18,078 | |
| 3,808 | ||
| LASSO | λ = 10−3 | 3,518 |
| λ = 10−5 | 9,034 | |
| λ = 10−7 | 46,321 | |
| Logistic + LASSO | 1,149 |
Figure 2Predictive performance of support vector machine (SVM) models based on two feature selection methods with different parameters [logistic regression (Log R) and least absolute shrinkage and selection operator (LASSO) regression (Las R)]. (A) Evaluation of performance on training sample with 10-fold cross-validation; (B) evaluation of performance on test sample; and (C) evaluation of performance on independent test sample.
Figure 3Predictive performance of random forest (RF) models based on feature selection methods with different parameters [logistic regression (Log R) and least absolute shrinkage and selection operator (LASSO) regression (Las R)]. (A) Evaluation of performance on training sample with 10-fold cross-validation; (B) evaluation of performance on test sample; and (C) evaluation of performance on independent test sample.
Area under the curve (AUC) value of machine learning models utilizing logistic regression (P <0.01) as feature selection method.
| No. of SNPs | SVM | Random forest | ||
|---|---|---|---|---|
| Test sample | Independent test sample | Test sample | Independent test sample | |
| 10 | 0.552 | 0.563 | 0.528 | 0.549 |
| 100 | 0.686 | 0.648 | 0.636 | 0.601 |
| 500 | 0.721 | 0.720 | 0.665 | 0.667 |
| 1,000 | 0.742 | 0.738 | 0.681 | 0.673 |
| 1,500 | 0.764 | 0.755 | 0.708 | 0.669 |
| 2,000 | 0.785 | 0.764 | 0.708 | 0.677 |
Comparison of area under the curve (AUC) values under support vector machine (SVM) model with feature selection methods of logistic regression, least absolute shrinkage and selection operator (LASSO) regression, and combined logistic and LASSO regression.
| No. of SNPs | Logistic regression ( | LASSO regression | Logistic ( | |||
|---|---|---|---|---|---|---|
| Test sample | Independent test sample | Test sample | Independent test sample | Test sample | Independent test sample | |
| 10 | 0.552 | 0.563 | 0.502 | 0.503 | 0.586 | 0.608 |
| 100 | 0.686 | 0.648 | 0.611 | 0.550 | 0.684 | 0.764 |
| 500 | 0.721 | 0.720 | 0.773 | 0.546 | 0.776 | 0.897 |
| 1,000 | 0.742 | 0.738 | 0.877 | 0.517 | 0.812 | 0.911 |
Figure 4Predictive performance of support vector machine (SVM) models based on a combination of logistic regression (P < 0.01) and least absolute shrinkage and selection operator (LASSO) regression (λ = 10−3) in training, test, and independent test samples.