| Literature DB >> 30567591 |
Jason G Fleischer1, Roberta Schulte2, Hsiao H Tsai2, Swati Tyagi2, Arkaitz Ibarra3, Maxim N Shokhirev4, Ling Huang4, Martin W Hetzer5, Saket Navlakha6.
Abstract
Biomarkers of aging can be used to assess the health of individuals and to study aging and age-related diseases. We generate a large dataset of genome-wide RNA-seq profiles of human dermal fibroblasts from 133 people aged 1 to 94 years old to test whether signatures of aging are encoded within the transcriptome. We develop an ensemble machine learning method that predicts age to a median error of 4 years, outperforming previous methods used to predict age. The ensemble was further validated by testing it on ten progeria patients, and our method is the only one that predicts accelerated aging in these patients.Entities:
Keywords: Aging; Biological age; Biomarker; Ensemble classifiers; Machine learning; RNA-seq; Skin fibroblasts
Mesh:
Year: 2018 PMID: 30567591 PMCID: PMC6300908 DOI: 10.1186/s13059-018-1599-6
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Predicting age from gene expression data. Rows from top to bottom show age prediction results for LDA Ensemble with 20-year age bins, elastic net, linear regression, and support vector regression. Model parameters are shown in Table 1. Column (A): Leave-one-out cross-validation predictions for 133 healthy individuals. Dots are plotted for each individual showing predicted age (y-axis) vs. true age (x-axis), with a line of best fit overlaid, and a shadow showing the 95% confidence interval of that line determined through bootstrap resampling of the dots. Text on the bottom of each panel shows performance metrics of mean absolute error (MAE), median absolute error (MED), and R2 goodness-of-fit for the line of best fit. The dotted line is the ideal line, where true age equals predicted age. Column (B): The effect of training set size (x-axis) on the mean absolute error of the ensemble (y-axis). The slope of the best fit line indicates the rate at which age prediction error would decrease with additional samples. Dots indicate mean absolute error from each fold of 2 × 10 cross-validation (y-axis) for varying sizes of random subset of the data (x-axis). A line of best fit and 95% confidence interval is shown. Column (C): Box plots of age predictions of progeria patients (red) and leave-one-out cross-validation predictions of age-matched healthy controls (blue). Box limits denote 25th and 75th percentiles, line is median, whiskers are 1.5× interquartile range, and dots are predictions outside the whisker’s range. The ensemble method is the only method that predicts significantly higher ages for progeria patients. Progeria patients: n = 10, mean ± std. of true age 5.5 ± 2.4; age-matched controls: n = 12, mean ± std. of true age 5.0 ± 2.9
Accuracy of age prediction from fibroblast transcriptomes, for various algorithms on two datasets. Cross-validation age prediction metrics are reported for our dataset of 133 individuals between 1 and 94 years old and for dataset E-MTAB-3037 with 22 individuals from newborn to 89 years old. Metrics: mean absolute error (MAE), median absolute error (MED), and R2 goodness-of-fit for the line of best fit. Parameters shown for regression algorithms are the best ones found for reducing MAE from a grid search of the parameter space. LDA ensemble with 20-year bins (in italics) achieves a lower MAE and MED and a higher R2 than competing methods. Other window sizes (15, 25, 35) did not improve performance above that of the 20-year bin size
| Algorithm | Parameters | Mean absolute error | Median absolute error |
| |
|---|---|---|---|---|---|
| Our dataset (133 individuals) | |||||
| LDA ensemble | Age bin width = 10 | 9.5 | 4.0 | 0.68 | |
|
|
|
|
| ||
| Age bin width = 30 | 8.2 | 4.0 | 0.77 | ||
| Gaussian naive Bayes ensemble | Age bin width = 10 | Uninformative priors | 16.5 | 7.0 | 0.20 |
| Age bin width = 20 | 16.0 | 8.0 | 0.27 | ||
| Age bin width = 30 | 15.7 | 7.0 | 0.30 | ||
| k-nearest neighbors ensemble | Age bin width = 10 | Euclidean distance metric | 22.3 | 14.0 | − 0.19 |
| Age bin width = 20 | 19.7 | 11.0 | 0.04 | ||
| Age bin width = 30 | 19.7 | 14.0 | 0.09 | ||
| Random forest ensemble | Age bin width = 10 | n_trees = 100, min_impurity_split =2 | 14.2 | 5.0 | 0.38 |
| Age bin width = 20 | 11.8 | 5.0 | 0.57 | ||
| Age bin width = 30 | 11.8 | 5.0 | 0.55 | ||
| Linear regression | N/A | 12.1 | 10.0 | 0.73 | |
| Elastic net regression | Alpha = 0.1 | 12.0 | 11.0 | 0.73 | |
| Support vector regression | Kernel = second order polynomial | 11.9 | 10.2 | 0.72 | |
| E-MTAB-3037 (22 individuals) | |||||
| LDA ensemble |
|
|
|
| |
| Gaussian naive Bayes ensemble | Age bin width = 20, uninformative prior | 36.4 | 39.5 | − 1.47 | |
| k-nearest neighbors ensemble | Age bin width = 20, Euclidean distance metric | 34.9 | 36 | − 1.25 | |
| Random forest ensemble | Age bin width = 20, n_trees = 100, min_impurity_split =2 | 31.9 | 28 | − 0.82 | |
| Linear regression | N/A | 23.5 | 18.8 | 0.04 | |
| Elastic net regression | Alpha = 1.0 | 20.0 | 18.8 | 0.36 | |
| Support vector regression | Kernel = second order polynomial | 19.7 | 15.4 | 0.31 | |