| Literature DB >> 34259936 |
Magdalena Kukla-Bartoszek1,2, Paweł Teisseyre3,4, Ewelina Pośpiech5, Joanna Karłowska-Pik6, Piotr Zieliński7, Anna Woźniak8, Michał Boroń8, Michał Dąbrowski9, Magdalena Zubańska10,11, Agata Jarosz5, Rafał Płoski12, Tomasz Grzybowski13, Magdalena Spólnicka8, Jan Mielniczuk3,4, Wojciech Branicki14,15.
Abstract
Increasing understanding of human genome variability allows for better use of the predictive potential of DNA. An obvious direct application is the prediction of the physical phenotypes. Significant success has been achieved, especially in predicting pigmentation characteristics, but the inference of some phenotypes is still challenging. In search of further improvements in predicting human eye colour, we conducted whole-exome (enriched in regulome) sequencing of 150 Polish samples to discover new markers. For this, we adopted quantitative characterization of eye colour phenotypes using high-resolution photographic images of the iris in combination with DIAT software analysis. An independent set of 849 samples was used for subsequent predictive modelling. Newly identified candidates and 114 additional literature-based selected SNPs, previously associated with pigmentation, and advanced machine learning algorithms were used. Whole-exome sequencing analysis found 27 previously unreported candidate SNP markers for eye colour. The highest overall prediction accuracies were achieved with LASSO-regularized and BIC-based selected regression models. A new candidate variant, rs2253104, located in the ARFIP2 gene and identified with the HyperLasso method, revealed predictive potential and was included in the best-performing regression models. Advanced machine learning approaches showed a significant increase in sensitivity of intermediate eye colour prediction (up to 39%) compared to 0% obtained for the original IrisPlex model. We identified a new potential predictor of eye colour and evaluated several widely used advanced machine learning algorithms in predictive analysis of this trait. Our results provide useful hints for developing future predictive models for eye colour in forensic and anthropological studies.Entities:
Keywords: DNA phenotyping; Eye colour; Machine learning algorithms; Predictive modelling; Whole-exome sequencing
Mesh:
Substances:
Year: 2021 PMID: 34259936 PMCID: PMC8523394 DOI: 10.1007/s00414-021-02645-5
Source DB: PubMed Journal: Int J Legal Med ISSN: 0937-9827 Impact factor: 2.686
List of the machine learning approaches evaluated for eye colour prediction
| Algorithm/model | Abbreviation |
|---|---|
| Random classifier | Naive |
| Logistic/multinomial regression with LASSO regularization | LOG REG |
| Logistic/multinomial regression with AIC-based model selection | LOG AIC |
| Logistic/multinomial regression with BIC-based model selection | LOG BIC |
| Logistic/multinomial regression with 1 step (1 SNP) | LOG 1-STEP |
| Logistic/multinomial regression model with all SNPs | LOG FULL |
| Classification and Regression Tree | TREE |
| Random forest | RF |
| Extreme gradient boosting | XGB |
| Multivariate and adaptive regression splines | MARS |
| Neural networks | NN |
| Support vector machine | SVM |
| Naïve Bayes | NB |
Characteristics of the study group
| Discovery cohort [N = 150] | Predictive modelling cohort [N = 849] | Total [N = 999] | ||||
|---|---|---|---|---|---|---|
| Sex | ||||||
| Females | 67 | 44.7% | 259 | 30.5% | 327 | 32.7% |
| Males | 83 | 55.3% | 590 | 69.5% | 673 | 67.4% |
| Age | ||||||
| min | 19 | 19 | 19 | |||
| max | 77 | 62 | 77 | |||
| mean value | 31.5 | 30.4 | 30.6 | |||
| SD | 10.3 | 8.7 | 9.0 | |||
| Eye colour | ||||||
| Blue | 76 | 50.7% | 551 | 64.9% | 627 | 62.8% |
| Intermediate | 28 | 18.7% | 122 | 14.4% | 150 | 15.0% |
| Brown | 42 | 28.0% | 139 | 16.4% | 181 | 18.1% |
| NA | 4 | 2.7% | 37 | 4.4% | 41 | 4.1% |
| PIE score, min | − 1.0 | − 1.0 | − 1.0 | |||
| PIE score, max | 1.0 | 1.0 | 1.0 | |||
| PIE score, mean value | 0.0 | 0.2 | 0.2 | |||
| PIE score, SD | 0.9 | 0.8 | 0.8 | |||
Fig. 1Selection of markers subjected to the predictive modelling
The most important SNP variants selected in > 50 out of 100 data splits, by at least two of variables selection methods for regression models
| SNP_ID | Chromosome position (GRCh38) | Gene | Selection method |
|---|---|---|---|
| rs10874518 | 1:101,806,756 | LASSO, AIC | |
| rs16891982 | 5:33,951,588 | LASSO, AIC, BIC | |
| rs2253104 | 11:6,479,079 | LASSO, AIC | |
| rs12913832 | 15:28,120,472 | LASSO, AIC, BIC | |
| rs1800407 | 15:27,985,172 | LASSO, AIC, BIC | |
| rs74653330 | 15:27,983,407 | LASSO, AIC, BIC | |
| rs885479 | 16:89,919,746 | LASSO, AIC | |
| rs8049897 | 16:89,957,794 | LASSO, AIC |
Fig. 2Predictive marker selected by the LASSO (LOG REG), AIC (LOG AIC) and BIC (LOG BIC) approaches for eye colour prediction. Only stable markers (selected to at least 50, i.e. 50% of models) are presented in the chart
Detailed results of predictive analysis of eye colour with various machine learning approaches
| Accuracy | Naive | LOG REG | LOG AIC | LOG BIC | LOG 1-STEP | LOG FULL | TREE | RF | XGB | MARS | NN | SVM | NB | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mean | 0.33 | 0.85 | 0.79 | 0.84 | 0.79 | 0.74 | 0.82 | 0.83 | 0.81 | 0.82 | 0.77 | 0.79 | 0.77 | |
| SD | 0.03 | 0.02 | 0.03 | 0.03 | 0.02 | 0.03 | 0.03 | 0.02 | 0.02 | 0.03 | 0.03 | 0.03 | 0.02 | |
| AUC | ||||||||||||||
| Mean | Blue | 0.50 | 0.96 | 0.91 | 0.96 | 0.95 | 0.92 | 0.95 | 0.96 | 0.96 | 0.95 | 0.93 | 0.95 | 0.94 |
| SD | 0.04 | 0.01 | 0.03 | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.03 | 0.05 | 0.01 | 0.02 | |
| Mean | Intermediate | 0.51 | 0.85 | 0.75 | 0.85 | 0.83 | 0.67 | 0.82 | 0.84 | 0.82 | 0.80 | 0.77 | 0.82 | 0.80 |
| SD | 0.05 | 0.03 | 0.07 | 0.03 | 0.03 | 0.04 | 0.04 | 0.03 | 0.03 | 0.06 | 0.06 | 0.03 | 0.03 | |
| Mean | Brown | 0.49 | 0.94 | 0.88 | 0.93 | 0.91 | 0.82 | 0.91 | 0.92 | 0.91 | 0.91 | 0.87 | 0.91 | 0.89 |
| SD | 0.05 | 0.02 | 0.07 | 0.02 | 0.01 | 0.03 | 0.03 | 0.01 | 0.02 | 0.03 | 0.08 | 0.02 | 0.02 | |
| Sensitivity | ||||||||||||||
| Mean | Blue | 0.17 | 0.96 | 0.93 | 0.97 | 0.96 | 0.86 | 0.96 | 0.96 | 0.97 | 0.96 | 0.95 | 0.96 | 0.94 |
| SD | 0.03 | 0.01 | 0.03 | 0.01 | 0.01 | 0.04 | 0.01 | 0.01 | 0.01 | 0.02 | 0.02 | 0.01 | 0.02 | |
| Mean | Intermediate | 0.17 | 0.17 | 0.40 | 0.29 | 0.00 | 0.41 | 0.35 | 0.11 | 0.34 | 0.39 | 0.18 | 0.10 | 0.16 |
| SD | 0.06 | 0.09 | 0.09 | 0.13 | 0.00 | 0.08 | 0.11 | 0.05 | 0.09 | 0.09 | 0.24 | 0.09 | 0.16 | |
| Mean | Brown | 0.16 | 0.62 | 0.59 | 0.76 | 0.37 | 0.56 | 0.62 | 0.42 | 0.61 | 0.63 | 0.55 | 0.46 | 0.67 |
| SD | 0.06 | 0.17 | 0.09 | 0.19 | 0.42 | 0.09 | 0.11 | 0.09 | 0.08 | 0.10 | 0.37 | 0.16 | 0.18 | |
| Specificity | ||||||||||||||
| Mean | Blue | 0.84 | 0.94 | 0.85 | 0.93 | 0.94 | 0.88 | 0.93 | 0.94 | 0.93 | 0.92 | 0.82 | 0.88 | 0.79 |
| SD | 0.04 | 0.02 | 0.05 | 0.03 | 0.02 | 0.04 | 0.03 | 0.03 | 0.03 | 0.04 | 0.14 | 0.04 | 0.06 | |
| Mean | Intermediate | 0.83 | 0.97 | 0.88 | 0.96 | 1.00 | 0.85 | 0.91 | 0.98 | 0.92 | 0.91 | 0.94 | 0.97 | 0.95 |
| SD | 0.03 | 0.02 | 0.03 | 0.03 | 0.00 | 0.03 | 0.02 | 0.01 | 0.02 | 0.03 | 0.08 | 0.03 | 0.04 | |
| Mean | Brown | 0.83 | 0.93 | 0.93 | 0.90 | 0.94 | 0.88 | 0.90 | 0.94 | 0.90 | 0.92 | 0.89 | 0.92 | 0.86 |
| SD | 0.02 | 0.03 | 0.02 | 0.03 | 0.09 | 0.03 | 0.03 | 0.02 | 0.02 | 0.02 | 0.07 | 0.03 | 0.05 | |