| Literature DB >> 30134623 |
Xingyan Li1, Weidong Li2, Yan Xu3,4.
Abstract
All tissues of organisms will become old as time goes on. In recent years, epigenetic investigations have found that there is a close correlation between DNA methylation and aging. With the development of DNA methylation research, a quantitative statistical relationship between DNA methylation and different ages was established based on the change rule of methylation with age, it is then possible to predict the age of individuals. All the data in this work were retrieved from the Illumina HumanMethylation BeadChip platform (27K or 450K). We analyzed 16 sets of healthy samples and 9 sets of diseased samples. The healthy samples included a total of 1899 publicly available blood samples (0⁻103 years old) and the diseased samples included 2395 blood samples. Six age-related CpG sites were selected through calculating Pearson correlation coefficients between age and DNA methylation values. We built a gradient boosting regressor model for these age-related CpG sites. 70% of the data was randomly selected as training data and the other 30% as independent data in each dataset for 25 runs in total. In the training dataset, the healthy samples showed that the correlation between predicted age and DNA methylation was 0.97, and the mean absolute deviation (MAD) was 2.72 years. In the independent dataset, the MAD was 4.06 years. The proposed model was further tested using the diseased samples. The MAD was 5.44 years for the training dataset and 7.08 years for the independent dataset. Furthermore, our model worked well when it was applied to saliva samples. These results illustrated that the age prediction based on six DNA methylation markers is very effective using the gradient boosting regressor.Entities:
Keywords: DNA methylation; age prediction; aging; epigenetics; gradient boosting regressor
Year: 2018 PMID: 30134623 PMCID: PMC6162650 DOI: 10.3390/genes9090424
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Sixteen healthy DNA-methylation datasets.
| DNA Origin | Platform | No. | Age Range | Author and Publication Year | Availability |
|---|---|---|---|---|---|
| Whole Blood | 27K | 93 | (49, 74) | Rakyan (2010) | GSE20236 |
| Blood CD4+CD14 | 27K | 50 | (16, 69) | Rakyan (2010) | GSE20242 |
| Blood PBMC 1 | 27K | 398 | (3.6, 18) | Alisch (2012) | GSE27097 |
| Blood Cord | 27K | 168 | (0, 0) | Adkins (2011) | GSE27317 |
| Blood PBMC | 450K | 40 | (0, 103) | Heyn (2012) | GSE30870 |
| Blood PBMC | 450K | 71 | (3.5, 76) | Harretal (2012) | GSE32149 |
| Blood Cord | 27K | 84 | (0, 0) | Khulan (2012) | GSE34257 |
| Blood Cord | 27K | 24 | (0, 0) | Mallon (2012) | GSE34869 |
| Blood PBMC | 450K | 78 | (1, 16) | Alisch (2012) | GSE36064 |
| Blood Cord | 27K | 123 | (0, 0) | Gordon (2012) | GSE36642 |
| Blood Cord | 27K | 48 | (0, 0) | Turan (2012) | GSE36812 |
| Blood PBMC | 27K | 91 | (24, 45) | Lam (2012) | GSE37008 |
| Whole Blood | 450K | 500 | (26, 101) | Hannum (2012) | GSE40279 |
| Whole Blood | 450K | 95 | (18, 65) | Horvath (2012) | GSE41169 |
| Whole blood | 450K | 43 | (47, 59) | Bell (2013) | GSE53128 |
| Blood | 450K | 16 | (21, 32) | Xu (2015) | GSE65638 |
1 Peripheral blood mononuclear cell.
Nine disease DNA-methylation datasets.
| DNA Origin | Platform | No. | Age Range | Author and Publication Year | Availability |
|---|---|---|---|---|---|
| Whole Blood | 27K | 203 | (50, 85) | Song (2010) | GSE19711 |
| Whole Blood | 27K | 194 | (1, 32) | Teschendorff (2010) | GSE20067 |
| Peripheral Blood | 450K | 46 | (3.5, 76) | Harris (2011) | GSE32148 |
| Blood | 450K | 24 | (52, 88) | Athanasios (2012) | GSE40005 |
| Whole Blood | 27K | 498 | (16, 86) | Horvath (2012) | GSE41037 |
| Whole Blood | 450K | 500 | (18, 70) | Liu (2013) | GSE42861 |
| Blood | 27K | 71 | (23, 85) | Day (2013) | GSE49904 |
| Blood | 450K | 499 | (34, 72) | Polidoro (2013) | GSE51032 |
| Peripheral Blood | 450K | 383 | (34, 93) | Lwe (2013) | GSE53740 |
Information of 6 selected age-related CpG sites.
| CpG ID | Gene ID | Chromosome Location 1 | Gene Region 2 | Relation to GpG Island 3 | Correlation Status | Reference |
|---|---|---|---|---|---|---|
| cg09809672 | EDARADD | 1:236557682 | TSS1500 | N_Shore | Negative | [ |
| cg22736354 | NHLRC1 | 6:18122719 | 1stExon | Island | Positive | [ |
| cg02228185 | ASPA | 17:3379567 | 1stExon | -- | Negative | [ |
| cg01820374 | LAG3 | 12:6882083 | Body | N_Shore | Negative | [ |
| cg06493994 | SCGN | 6:25652602 | 1stExon | Island | Positive | [ |
| cg19761273 | CSNK1D | 17:80232096 | TSS1500 | S_Shore | Negative | [ |
1 Chromosome location is referred to the Human genome reference GRCh37 version. 2 TSS: transcription start site. TSS1500: 1500 bp flanking region from the TSS. 3 CpGs island table were downloaded from University of California Santa Cruz (UCSC) browser. Distance of 2kb to CpG islands were defined as CpG island shores (N_Shore: downstream of CpG island and S_Shore: up-stream of the CpG island).
Figure 1Comparison between the real age and the age predicted by the four models in the training dataset of health data. GBR: gradient boosting regresion; MAD: mean absolute deviation; RMSE: root mean square error; SVR: support vector regression.
Figure 2Comparison between the real age and the age predicted by the four models in the validation dataset of healthy data.
Comparison of gradient booster regressor (GBR) with the other three methods on healthy datasets.
| R2 | MAD | MSE | RMSE | |
|---|---|---|---|---|
| Training | ||||
| Gradient Boosting Regressor | 0.9747 | 2.7171 | 20.7243 | 4.5524 |
| BayesianRidge | 0.8055 | 10.2561 | 158.3044 | 12.5819 |
| Support Vector Regression | 0.9267 | 5.1338 | 60.0420 | 7.7487 |
| Multiple Linear Regression | 0.8055 | 10.2448 | 158.2800 | 12.5809 |
| Testing | ||||
| Gradient Boosting Regressor | 0.9523 | 4.0593 | 39.8269 | 6.3109 |
| BayesianRidge | 0.8101 | 10.5654 | 157.8721 | 12.5647 |
| Support Vector Regression | 0.9151 | 5.9267 | 71.2060 | 8.4384 |
| Multiple Linear Regression | 0.8104 | 10.5510 | 157.6726 | 12.5568 |
MAD: mean absolute deviation; MSE: mean square error; RMSE: root mean square error.
Figure 3Comparison between the real age and the age predicted by the four models in the training dataset of disease data.
Figure 4Comparison between the real age and the age predicted by the four models in the validation dataset of disease data.
Results comparison of GBR with the other three methods on disease datasets.
|
| MAD | MSE | RMSE | |
|---|---|---|---|---|
| Training | ||||
| Gradient Boosting Regressor | 0.8186 | 5.4401 | 63.0648 | 7.9413 |
| BayesianRidge | 0.6844 | 7.8944 | 109.6227 | 10.4701 |
| Support Vector Regression | 0.5333 | 9.8583 | 162.6949 | 12.7552 |
| Multiple Linear Regression | 0.6844 | 7.8946 | 109.6222 | 10.4701 |
| Testing | ||||
| Gradient Boosting Regressor | 0.7374 | 7.0832 | 91.7887 | 9.5806 |
| BayesianRidge | 0.6812 | 8.0786 | 111.2896 | 10.5494 |
| Support Vector Regression | 0.5303 | 9.9573 | 164.6747 | 12.8326 |
| Multiple Linear Regression | 0.6812 | 8.0795 | 111.3016 | 10.5500 |
Results comparison of GBR with the other three methods on saliva datasets.
| R2 | MAD | MSE | RMSE | |
|---|---|---|---|---|
| Training | ||||
| Gradient Boosting Regressor | 0.8539 | 2.1040 | 13.7795 | 3.7121 |
| BayesianRidge | 0.4310 | 5.7483 | 52.5169 | 7.2469 |
| Support Vector Regression | 0.0227 | 7.9369 | 99.5273 | 9.9763 |
| Multiple Linear Regression | 0.4333 | 5.6775 | 52.3045 | 7.2322 |
| Testing | ||||
| Gradient Boosting Regressor | 0.4298 | 5.3478 | 56.1291 | 7.4919 |
| BayesianRidge | 0.5423 | 5.5389 | 43.8468 | 6.6217 |
| Support Vector Regression | 0.0308 | 8.4729 | 104.4403 | 10.2196 |
| Multiple Linear Regression | 0.5479 | 5.4662 | 43.3933 | 6.5874 |
Results of GBR and Multiple Linear Regression on saliva samples.
| No. of CpG Sites |
| MAD | |
|---|---|---|---|
| Multiple Linear Regression | 88 | 0.73 | 5.2 |
| Gradient Boosting Regressor | 6 | 0.58 | 3.76 |
Figure 5(a) A histogram of the age distribution for healthy individuals; (b) A histogram of the age distribution for disease individuals.
Figure 6UCSC genome browser view of the genomic location of the CpG cg19761273.