| Literature DB >> 31775313 |
Zahra Momeni1, Mohammad Saniee Abadeh1,2.
Abstract
Genomic biomarkers such as DNA methylation (DNAm) are employed for age prediction. In recent years, several studies have suggested the association between changes in DNAm and its effect on human age. The high dimensional nature of this type of data significantly increases the execution time of modeling algorithms. To mitigate this problem, we propose a two-stage parallel algorithm for selection of age related CpG-sites. The algorithm first attempts to cluster the data into similar age ranges. In the next stage, a parallel genetic algorithm (PGA), based on the MapReduce paradigm (MR-based PGA), is used for selecting age-related features of each individual age range. In the proposed method, the execution of the algorithm for each age range (data parallel), the evaluation of chromosomes (task parallel) and the calculation of the fitness function (data parallel) are performed using a novel parallel framework. In this paper, we consider 16 different healthy DNAm datasets that are related to the human blood tissue and that contain the relevant age information. These datasets are combined into a single unioned set, which is in turn randomly divided into two sets of train and test data with a ratio of 7:3, respectively. We build a Gradient Boosting Regressor (GBR) model on the selected CpG-sites from the train set. To evaluate the model accuracy, we compared our results with state-of-the-art approaches that used these datasets, and observed that our method performs better on the unseen test dataset with a Mean Absolute Deviation (MAD) of 3.62 years, and a correlation (R2) of 95.96% between age and DNAm. In the train data, the MAD and R2 are 1.27 years and 99.27%, respectively. Finally, we evaluate our method in terms of the effect of parallelization in computation time. The algorithm without parallelization requires 4123 min to complete, whereas the parallelized execution on 3 computing machines having 32 processing cores each, only takes a total of 58 min. This shows that our proposed algorithm is both efficient and scalable.Entities:
Keywords: CpG-site selection; GBR Model; MapReduce; age prediction; parallel genetic algorithm
Mesh:
Year: 2019 PMID: 31775313 PMCID: PMC6947642 DOI: 10.3390/genes10120969
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Sixteen healthy blood DNAm datasets.
| Availability | DNA Origin | No. Case | Age Range | Citation | Platform |
|---|---|---|---|---|---|
| GSE30870 | Blood PBMC 1 | 40 | (0, 103) | [ | 450 K |
| GSE32149 | Blood PBMC | 71 | (3.5, 76) | [ | |
| GSE36064 | Blood PBMC | 78 | (1, 16) | [ | |
| GSE40279 | Whole Blood | 500 | (26, 101) | [ | |
| GSE41169 | Whole Blood | 95 | (18, 65) | [ | |
| GSE53128 | Whole Blood | 43 | (47, 59) | [ | |
| GSE65638 | Blood | 16 | (21, 32) | [ | |
| GSE20236 | Whole Blood | 93 | (49, 74) | [ | 27 K |
| GSE20242 | Blood CD4 + CD14 | 50 | (16, 69) | [ | |
| GSE27097 | Blood PBMC 1 | 398 | (3.6, 18) | [ | |
| GSE27317 | Blood Cord | 168 | (0, 0) | [ | |
| GSE34257 | Blood Cord | 84 | (0, 0) | [ | |
| GSE34869 | Blood Cord | 24 | (0, 0) | [ | |
| GSE36642 | Blood Cord | 123 | (0, 0) | [ | |
| GSE36812 | Blood Cord | 48 | (0, 0) | [ | |
| GSE37008 | Blood PBMC | 91 | (24, 45) | [ |
1 Peripheral blood mononuclear cell.
Figure 1Flowchart of general GA.
Statistical criteria calculated in GBR.
| Name | Formula |
|---|---|
| Mean Absolute Deviation |
|
| Mean Square Error |
|
| Root Mean Square Error |
|
|
|
|
Figure 2Flowchart of proposed framework. 1 n denotes number of subgroups for labeling training and test sets.
Figure 3(a) Confusion matrix derived from modeling on 10-labeled train data; (b) Confusion matrix derived from modeling on 5-labeled train data; (c) Confusion matrix derived from modeling on 3-labeled train data.
Figure 4Flowchart of proposed GA.
Figure 5Proposed ParentDifference-based crossover operator.
Parameters of proposed MR-based PGA algorithm.
| Parameter | Value |
|---|---|
| Encoding | Binary |
| String length | 8000 selected CpG-sites using Pearson correlation |
| Generation | 100 |
| Population size | 100 |
| Selection Method | Roulette wheel |
| Crossover Method | ParentDifference-based crossover |
| Mutation method | Presented mutation operator in Step 10 of proposed parallel GA in |
| Elitist strategy | Preserving the top 10 of the best chromosomes in a generation |
Parameters of GBR model.
| Parameter | Value |
|---|---|
| N_estimators | 300 |
| Max_depth | 4 |
| Min_samples_split | 2 |
| Subsample | 0.6 |
| Verbose | 0 |
| Warm_start | true |
| alpha | 0.6 |
| Learning_rate | 0.03 |
| loss | lad |
Figure 6Comparison between the real age and the predicted age by GBR on train and test set (a) age range 0–20; (b) age range 20–50; (c) age range 50–103.
Figure 7(a) Comparison between the real age and the predicted age by GBR on all train set; (b) comparison between the real age and the predicted age by GBR on all test set; (c) comparison between the real age and the predicted age by 3-fold CV GBR on all train set.
Comparison of the regression performance between proposed GBR in this paper and proposed GBR by Li et al. [4].
| Ref. | Validation Type | MAD | MSE | RMSE | R2 |
|---|---|---|---|---|---|
|
| |||||
| Li et al. [ | Split | 2.7171 | 20.7243 | 4.5524 | 0.9747 |
| MR-based PGA | Split | 1.2740 | 6.3339 | 2.5167 | 0.9927 |
|
| |||||
| Li et al. [ | Split | 4.0593 | 39.8269 | 6.3109 | 0.9523 |
| MR-based PGA | Split | 3.6233 | 35.1678 | 5.9302 | 0.9596 |
| MR-based PGA | 3-fold cross-validation on Train | 3.2105 | 23.9033 | 4.3927 | 0.9672 |
Comparison between MAD of GBRs made on three age groups.
| Age Range | MAD |
|---|---|
|
| |
| 0–20 | 1.4138 |
| 20–50 | 4.1451 |
| 50–103 | 5.3504 |
|
| |
| 0–20 | 0.6002 |
| 20–50 | 1.3036 |
| 50–103 | 2.2216 |
|
| |
| 0–20 | 1.3486 |
| 20–50 | 6.1429 |
| 50–103 | 5.5371 |
Figure 8Progress of MR-based PGA, over 100 generations (a) age group 0–20 years; (b) age group 20–50 years; (c) age group 50–103 years.
Performance test for the parallel execution of the proposed algorithm for each age group presented in minutes (each age group on one machine). Each machine has only one processing core.
| Using One Machine | Using Three Machines | |
|---|---|---|
| Minutes | 4123 | 1642 |
Performance test for the execution time of the parallel evolution of chromosomes on three machines presented in minutes (each chromosome runs on one processing core).
| Parallelism | ||||||
|---|---|---|---|---|---|---|
| 1 | 2 | 4 | 8 | 16 | 32 | |
| Minutes | 1642 | 986 | 608 | 359 | 207 | 126 |
Performance test for the execution time of parallel calculation of fitness function using MapReduce on three machines that each machine has 32 processing cores presented in minutes (each mapper runs on one thread).
| Using MR | Without MR | |
|---|---|---|
| Minutes |
| 126 |
Selected CpG-sites in each three age groups.
| Age Range | CpG-Sites |
|---|---|
| 0–20 years | (1) cg14918082, (2) cg27210390, (3) cg01993576, (4) cg19686152, (5) cg19761273, (6) cg13870494, (7) cg19945840, (8) cg09427311, (9) cg17791651, (10) cg06058597, (11) cg10591174, (12) cg23591869, (13) cg21545849, (14) cg15368822, (15) cg20544605, (16) cg03473518, (17) cg09626984, (18) cg03375002, (19) cg00831028, (20) cg08351331, (21) cg16786458, (22) cg19180828. |
| 20–50 years | (1) cg22736354, (2) cg05724065, (3) cg15673110, (4) cg20761322, (5) cg08635242, (6) cg10986043, (7) cg00216361, (8) cg12261786, (9) cg17258195, (10) cg21430666, (11) cg13614181, (12) cg14611174, (13) cg09118625, (14) cg17347389, (15) cg02868123, (16) cg24715735, (17) cg24662961, (18) cg05346899, (19) cg26900154, (20) cg03022541, (21) cg18546419, (22) cg12782180, (23) cg09001953, (24) cg26069252, (25) cg15365950, (26) cg18722841, (27) cg11691938, (28) cg10588377, (29) cg02552572, (30) cg06165395, (31) cg02973263, (32) cg04809787. |
| 50–103 years | (1) cg21296230, (2) cg09809672, (3) cg14094063, (4) cg19560758, (5) cg15297650, (6) cg15399561, (7) cg02228185, (8) cg07944287, (9) cg19945840, (10) cg18815943, (11) cg08005849, (12) cg18113787, (13) cg00635481, (14) cg07091958, (15) cg25809905, (16) cg26508537, (17) cg08395899, (18) cg25671438, (19) cg18630855, (20) cg19722847, (21) cg05361811, (22) cg26526440, (23) cg00915289, (24) cg24490859, (25) cg09462826, (26) cg25490410, (27) cg06885782, (28) cg08158331, (29) cg17022914, (30) cg05140736, (31) cg24110916, (32) cg10940099. |
Figure 9Progress of SFS algorithm during selecting CpG-sites on three age ranges: (a) age group 0–20 years; (b) age group 20–50 years; (c) age group 50–103 years.