| Literature DB >> 30644411 |
Polina Mamoshina1,2,3, Kirill Kochetov1,2,4, Franco Cortese5,6, Anna Kovalchuk2,7,8, Alexander Aliper1,2, Evgeny Putin1,2,4, Morten Scheibye-Knudsen9, Charles R Cantor10, Neil M Skjodt2,7, Olga Kovalchuk11,12, Alex Zhavoronkov13,14,15,16.
Abstract
There is an association between smoking and cancer, cardiovascular disease and all-cause mortality. However, currently, there are no affordable and informative tests for assessing the effects of smoking on the rate of biological aging. In this study we demonstrate for the first time that smoking status can be predicted using blood biochemistry and cell count results andthe recent advances in artificial intelligence (AI). By employing age-prediction models developed using supervised deep learning techniques, we found that smokers exhibited higher aging rates than nonsmokers, regardless of their cholesterol ratios and fasting glucose levels. We further used those models to quantify the acceleration of biological aging due to tobacco use. Female smokers were predicted to be twice as old as their chronological age compared to nonsmokers, whereas male smokers were predicted to be one and a half times as old as their chronological age compared to nonsmokers. Our findings suggest that deep learning analysis of routine blood tests could complement or even replace the current error-prone method of self-reporting of smoking status and could be expanded to assess the effect of other lifestyle and environmental factors on aging.Entities:
Year: 2019 PMID: 30644411 PMCID: PMC6333803 DOI: 10.1038/s41598-018-35704-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Deep learning-based blood-biochemistry clocks accurately predict chronological age. (A) Prediction accuracy of the best-performing model. The model trained on 24 parameters achieved an R2 of 0.57 and an MAE of 5.7 years. (B) The design of the deep learning study that used blood-biochemistry data to predict an individual’s age. Blood samples of nonsmokers were first preprocessed and normalized as previously described[8]. Next, arbitrage ranking based on 320 RF models was applied to facilitate the selection of the most appropriate feature space with maximum samples available. Afterward, missing values were reconstructed using an autoregressive model with a view towards increasing the training sets, and the resulting feature sets were used to train and test DNNs for predicting patient age and smoking status. (C) Feature importance plot. Fasting glucose, sex, and RDW exhibited higher relative importance scores than other features used in model training. Note High-density lipoprotein (HDL) cholesterol, low-density lipoprotein (LDL) cholesterol. RDW for red blood cell distribution width, RBC for red blood cell counts, MCV for mean corpuscular volume, ALT for alanine transaminase, MCHC for mean corpuscular hemoglobin.
Figure 2Deep learning-based hematological clocks demonstrated accelerated aging rates in smokers and revealed patient smoking status. (A) The prediction accuracy of the best-performing model trained on feature space extended with smoking status. The model, trained on 24 parameters, achieved an R2 of 0.60 and an MAE of 5.42 years (B) The log2 aging ratio of smokers to nonsmokers by age and sex groups for the best-performing model. Smokers demonstrated a higher aging rate regardless of sex. However, these differences plateaued after 55 years of age. A log2 aging ratio of 1 means the sample was predicted to be twice as old as a chronological age, and a log2 aging ratio of −1 means the sample was predicted to be half as old as a chronological age. (C) The most important features in the classification of smoking status selected by the PFI method. HDL cholesterol, sex, and hemoglobin exhibited higher relative importance scores than other features used in model training. (D) The model trained on 23 parameters achieved an F1 score of 0.67 and an accuracy of 0.84. Note High-density lipoprotein (HDL) cholesterol, low-density lipoprotein (LDL) cholesterol. RDW for red blood cell distribution width, RBC for red blood cell counts, MCV for mean corpuscular volume, ALT for alanine transaminase, MCHC for mean corpuscular hemoglobin.
Figure 3Confusion matrices. (A) Confusion matrices for the best-performing smoking status classifier, trained on 23 features, in number of samples (left) and percentage (right). Row values show predicted smoking status, and columns show actual smoking status. Most of the error smoking predictions occurred in individuals older than 55 years. (B) Confusion matrices for age prediction by age groups for the best model, trained on 24 parameters, in number of samples (left) and percentage (right). Row values show actual chronological age group, and columns show predicted age group. Smokers of age groups < 30 and 30–40 were mostly predicted to be older.
Prediction accuracy of the three top-performing models after rounds of optimization.
| No. of features |
|
| |||
|---|---|---|---|---|---|
| Age predictor trained on 23 features | 23 | 5.722 | 0.76 | 0.803 | 0.56 |
| Age predictor trained on 20 features | 20 | 5.777 | 0.75 | 0.801 | 0.5376 |
| Age predictor trained on 18 features | 18 | 5.898 | 0.75 | 0.802 | 0.55 |
| Age predictor trained on 24 features | 24 | 5.61 | 0.78 | 0.82 | 0.578 |
| Age predictor trained on 21 | 21 | 5.401 | 0.77 | 0.815 | 0.58 |
| Age predictor trained on 19 features | 19 | 5.416 | 0.77 | 0.817 | 0.60 |
|
|
|
|
|
| |
| Smoking status classifier trained on 23 features | 23 | 0.829 | 0.754 | 0.606 | 0.673 |
| Smoking status classifier trained on 20 features | 20 | 0.822 | 0.726 | 0.61 | 0.664 |
| Smoking status classifier trained on 18 features | 18 | 0.82 | 0.708 | 0.603 | 0.638 |
Figure 4Log2 aging ratios for the four groups Cholesterol ratio > 4 and Fasting Glucose > 5 mmol/L, Cholesterol ratio > 4 and Fasting Glucose <= 5 mmol/L, Cholesterol ratio <= 4 and Fasting Glucose > 5 mmol/L, and Cholesterol ratio > 4 and Fasting Glucose > 5 mmol/L. Smokers of age groups < 30 and 31–40 are predicted older regardless their Cholesterol ratio and Fasting Glucose level. Log2 aging ratio of 1 means that sample is predicted two fold older than a chronological age and log2 aging ratio of −1 means sample is predicted half as old. Bars indicate standard deviation.