| Literature DB >> 36131240 |
James W Baurley1, Andrew W Bergen2,3, Carolyn M Ervin2, Sung-Shim Lani Park4, Sharon E Murphy5, Christopher S McMahan6.
Abstract
BACKGROUND: There is a need to match characteristics of tobacco users with cessation treatments and risks of tobacco attributable diseases such as lung cancer. The rate in which the body metabolizes nicotine has proven an important predictor of these outcomes. Nicotine metabolism is primarily catalyzed by the enzyme cytochrone P450 (CYP2A6) and CYP2A6 activity can be measured as the ratio of two nicotine metabolites: trans-3'-hydroxycotinine to cotinine (NMR). Measurements of these metabolites are only possible in current tobacco users and vary by biofluid source, timing of collection, and protocols; unfortunately, this has limited their use in clinical practice. The NMR depends highly on genetic variation near CYP2A6 on chromosome 19 as well as ancestry, environmental, and other genetic factors. Thus, we aimed to develop prediction models of nicotine metabolism using genotypes and basic individual characteristics (age, gender, height, and weight).Entities:
Keywords: Machine learning; Nicotine biomarkers; Nicotine metabolism; Polygenic risk score; Smoking cessation; Statistical learning
Mesh:
Substances:
Year: 2022 PMID: 36131240 PMCID: PMC9490935 DOI: 10.1186/s12864-022-08884-z
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 4.547
Fitting method and tuning parameter configurations. Provided are the considered training parameters for partial least squares (PLS), project pursuit (PPR), elastic net (ENet), support vector machine with a linear kernel (SVM-L), support vector machine with a radial basis function kernel (SVM-R), gradient boosting machine (GBM), and random forests (RF). Also provided are the model fitting methods
| Model | Tuning Grid | Method |
|---|---|---|
| PLS | Number of components | pls |
| PPR | Number of terms | ppr |
| Enet | Mixing percentage | glmnet |
| Penalty parameter | ||
| SVM-L | Cost parameter | svmLinear |
| SVM-R | Cost parameter | svmRadialSigma |
| RBF kernel parameter | ||
| GBM | Interaction depth | gbm |
| Number of trees | ||
| Shrinkage 0.1 | ||
| Minimum number of Obs. in a node 10 | ||
| RF | Number of randomly selected predictors | rf |
Fig. 1Assessment of the seven models in the training data (MEC). Models were trained using project pursuit (PPR), partial least squares (PLS), support vector machine with a linear kernel (SVM_lin), elastic net (GLMNET), random forests (RF), support vector machine with a radial basis function kernel (SVM_rad_sig), and gradient boosting machine (GBM). Model performances were assessed using mean absolute error (MAE), root mean squared error (RMSE), and R Squared. The boxplots summarizes these metrics across 100 cross validation datasets. Performances were similar across the models justifying use of an average of predictions in the ensemble model
Correlations between predicted and observed NMRs by study, ancestry, and model. The correlations between predicted and observed NMRs were summarized overall and by genomic ancestry (ancestry proportion ). The MEC was the largest and most diverse sample and was used for training using partial least squares (PLS), project pursuit (PPR), elastic net (ENet), support vector machine with a linear kernel (SVM-L), support vector machine with a radial basis function kernel (SVM-R), gradient boosting machine (GBM), and random forests (RF). Predictions from these seven models were averaged in an ensemble model. The MEC trained models were applied to CENIC, HSS, and METs for validation
| PLS | PPR | ENet | SVM-L | SVM-R | GBM | RF | Ensemble | N | |
|---|---|---|---|---|---|---|---|---|---|
| African | 0.60 | 0.67 | 0.58 | 0.75 | 0.58 | 0.76 | 0.97 | 0.76 | 342 |
| Asian | 0.71 | 0.76 | 0.70 | 0.79 | 0.69 | 0.81 | 0.97 | 0.82 | 995 |
| European | 0.50 | 0.56 | 0.49 | 0.63 | 0.48 | 0.64 | 0.97 | 0.67 | 902 |
| Overall | 0.71 | 0.75 | 0.70 | 0.78 | 0.69 | 0.79 | 0.97 | 0.81 | 2239 |
| African | 0.35 | 0.33 | 0.32 | 0.33 | 0.32 | 0.41 | 0.41 | 0.37 | 111 |
| Asian | 0.71 | 0.28 | 0.68 | 0.60 | 0.67 | 0.51 | 0.50 | 0.61 | 9 |
| European | 0.42 | 0.38 | 0.43 | 0.41 | 0.37 | 0.52 | 0.47 | 0.46 | 395 |
| Overall | 0.42 | 0.37 | 0.41 | 0.39 | 0.37 | 0.50 | 0.45 | 0.45 | 515 |
| African | 1 | ||||||||
| Asian | 0.51 | 0.29 | 0.55 | 0.59 | 0.53 | 0.55 | 0.58 | 0.56 | 308 |
| European | 0.42 | 0.33 | 0.42 | 0.37 | 0.36 | 0.42 | 0.39 | 0.43 | 271 |
| Overall | 0.53 | 0.37 | 0.56 | 0.56 | 0.53 | 0.55 | 0.56 | 0.56 | 580 |
| African | 0.43 | 0.30 | 0.50 | 0.55 | 0.41 | 0.46 | 0.45 | 0.52 | 48 |
| Asian | 0.37 | 0.03 | 0.39 | 0.47 | 0.41 | 0.47 | 0.47 | 0.43 | 51 |
| European | 0.38 | 0.10 | 0.40 | 0.45 | 0.42 | 0.44 | 0.41 | 0.43 | 216 |
| Overall | 0.36 | 0.09 | 0.42 | 0.42 | 0.39 | 0.45 | 0.45 | 0.40 | 315 |
Fig. 2Observed versus predicted NMR values for the training (MEC) and validation (CENIC, HSS, and METS) data. The predicted NMR is the averages of the predictions from the seven models (i.e., the ensemble model). The correlation between these values are displayed to the upper left of each scatterplot. The distribution of NMRs were different across studies, yet the correlations were still strong in the validation datasets
Participant characteristics for the training data (MEC) and the three validation datasets (CENIC, HSS, and METS)
| MEC | CENIC | HSS | METS | |
|---|---|---|---|---|
| (N=2239) | (N=515) | (N=580) | (N=315) | |
| Mean (SD) | 63.9 (7.19) | 43.4 (13.2) | 60.6 (9.40) | 33.8 (10.9) |
| Median [Min, Max] | 63.0 [45.0, 86.0] | 44.0 [18.0, 75.0] | 60.6 [19.6, 83.2] | 30.0 [18.0, 69.0] |
| Mean (SD) | 26.3 (5.31) | 30.0 (6.70) | 27.1 (6.10) | 25.6 (4.75) |
| Median [Min, Max] | 25.6 [11.3, 62.8] | 29.2 [15.2, 56.0] | 26.1 [14.4, 54.8] | 24.9 [15.9, 49.1] |
| Missing | 0 (0%) | 3.00 (0.5%) | 0 (0%) | 0 (0%) |
| Male | 1040 (46.4%) | 306 (59.4%) | 284 (49.0%) | 141 (44.8%) |
| Female | 1199 (53.6%) | 209 (40.6%) | 296 (51.0%) | 174 (55.2%) |
| Yes | 2239 (100%) | 515 (100%) | 580 (100%) | 120 (38.1%) |
| No | 0 (0%) | 0 (0%) | 0 (0%) | 195 (61.9%) |
| African American | 364 (16.3%) | 107 (20.8%) | 0 (0%) | 49 (15.6%) |
| American Indian/Alaskan Native | 0 (0%) | 5 (1.0%) | 0 (0%) | 0 (0%) |
| Asian American | 0 (0%) | 6 (1.2%) | 0 (0%) | 51 (16.2%) |
| Multirace | 0 (0%) | 25 (4.9%) | 0 (0%) | 0 (0%) |
| White | 437 (19.5%) | 372 (72.2%) | 197 (34.0%) | 215 (68.3%) |
| Japanese American | 674 (30.1%) | 0 (0%) | 191 (32.9%) | 0 (0%) |
| Native Hawaiian/Pacific Islander | 311 (13.9%) | 0 (0%) | 192 (33.1%) | 0 (0%) |
| Latino | 453 (20.2%) | 0 (0%) | 0 (0%) | 0 (0%) |
| Mean (SD) | 1.11 (0.898) | 1.50 (0.720) | -0.324 (0.959) | -1.58 (0.548) |
| Median [Min, Max] | 1.20 [-3.91, 3.60] | 1.54 [-2.54, 3.35] | -0.267 [-3.30, 2.90] | -1.57 [-3.22, -0.240] |
Fig. 3Summary of the most important candidate variables in the NMR models. Variable importance was ranked for each of the seven models trained on the MEC data. The number of times each variable occurred in the top 20 for each model was enumerated. Asian ancestry and rs56113850 was an influential variable in all the models