| Literature DB >> 31781160 |
Wei Li1, Yanbin Yin2, Xiongwen Quan1, Han Zhang1,3.
Abstract
Gene expression profiling has been widely used to characterize cell status to reflect the health of the body, to diagnose genetic diseases, etc. In recent years, although the cost of genome-wide expression profiling is gradually decreasing, the cost of collecting expression profiles for thousands of genes is still very high. Considering gene expressions are usually highly correlated in humans, the expression values of the remaining target genes can be predicted by analyzing the values of 943 landmark genes. Hence, we designed an algorithm for predicting gene expression values based on XGBoost, which integrates multiple tree models and has stronger interpretability. We tested the performance of XGBoost model on the GEO dataset and RNA-seq dataset and compared the result with other existing models. Experiments showed that the XGBoost model achieved a significantly lower overall error than the existing D-GEX algorithm, linear regression, and KNN methods. In conclusion, the XGBoost algorithm outperforms existing models and will be a significant contribution to the toolbox for gene expression value prediction.Entities:
Keywords: XGBoost; absolute error; gene expression value; landmark gene; regression method; target gene
Year: 2019 PMID: 31781160 PMCID: PMC6861218 DOI: 10.3389/fgene.2019.01077
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Detailed parameters configuration.
| Parameters | Initialization value | Search space |
|---|---|---|
|
| 300 | [300, 330, 350, 370, 400] |
| 0 | [0, 0.1, 0.2, 0.3, 0.4] | |
| 1 | [1, 2, 3, 4, 5, 6] | |
| 5 | [6, 7, 8, 9, 10, 11] | |
| 0.6 | [0.6, 0.7, 0.8, 0.9] | |
| 0.8 | [0.6, 0.7, 0.8, 0.9] | |
| 0.1 | [0.01, 0.05, 0.08, 0.1] |
Figure 2The absolute error of CHAD validation set decreases as n_estimators increases.
Absolute errors of validation set corresponding to different γ.
| γ | Absolute error |
|---|---|
| 0 | 0.1712 |
| 0.1 | |
| 0.2 | 0.1709 |
| 0.3 | 0.1718 |
| 0.4 | 0.1709 |
| 0.5 | 0.1714 |
The figure in bold represents the lowest absolute error.
Optimal values of all parameters.
| Parameters | Optimal value |
|---|---|
| 350 | |
| 0.1 | |
| 1 | |
| 8 | |
| 0.8 | |
| 0.8 | |
| 0.1 |
Figure 3Top 10 landmark genes with the highest importance scores in the CHAD gene expression prediction task and their specific scores.
Figure 4The Mean Absolute Error (MAE) distribution boxplot of the six algorithms on the test set.
Figure 5The Mean Absolute Error (MAE) score of each target gene predicted by XGBoost model compared with D-GEX on the test set. The x-axis is the MAE score of XGBoost model, and the y-axis is the MAE score of D-GEX.
The overall error of six algorithms on validation set and test set.
| Algorithm | ||
|---|---|---|
| Validation set | Test set | |
| LR | 0.378 | 0.378 |
| LR-L1 | 0.377 | 0.378 |
| LR-L2 | 0.378 | 0.378 |
| KNN | 0.586 | 0.587 |
| D-GEX | 0.312 | 0.320 |
| XGBoost | ||
The figures in bold represent the best results on validation set and test set, respectively.
The overall error of six algorithms on 1,000G data and GTEx data.
| Algorithm | Overall error | |
|---|---|---|
| 1,000G data | GTEx data | |
| LR | 0.805 | 0.470 |
| LR-L1 | 0.746 | 0.567 |
| LR-L2 | 0.805 | 0.470 |
| KNN | 0.747 | 0.652 |
| D-GEX | 0.749 | 0.453 |
| XGBoost | ||
The figures in bold represent the best results on 1000G data and GTEx data, respectively.