| Literature DB >> 35820811 |
Abstract
BACKGROUND: Understanding the regulatory role of enhancer-promoter interactions (EPIs) on specific gene expression in cells contributes to the understanding of gene regulation, cell differentiation, etc., and its identification has been a challenging task. On the one hand, using traditional wet experimental methods to identify EPIs often means a lot of human labor and time costs. On the other hand, although the currently proposed computational methods have good recognition effects, they generally require a long training time.Entities:
Keywords: Bioinformatics; Enhancer–promoter interaction; Feature extraction; Machine learning; Stacking strategy
Mesh:
Year: 2022 PMID: 35820811 PMCID: PMC9277947 DOI: 10.1186/s12859-022-04821-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1StackEPI overview. It includes a data preprocessing, b feature extraction, and c integrated framework
Fig. 2AUROC of the baseline models combined with 6 feature encodings and 5 machine learning algorithms on 6 cell lines
Fig. 3AUPR of the baseline models combined with 6 feature encodings and 5 machine learning algorithms on 6 cell lines
Fig. 4F1-score of the baseline models combined with 6 feature encodings and 5 machine learning algorithms on 6 cell lines
Detailed performance evaluation of StackEPI's candidate meta-classifiers in 6 cell lines
| Cell lines | Meta-models | AUROC | AUPR | F1-score |
|---|---|---|---|---|
| GM12878 | DF | 0.925 | 0.760 | 0.585 |
| LightGBM | 0.929 | 0.716 | 0.672 | |
| RF | 0.865 | 0.708 | 0.701 | |
| SVM | 0.936 | 0.773 | 0.705 | |
| XGBoost | 0.915 | 0.722 | 0.439 | |
| LR | 0.932 | 0.772 | ||
| MLP | 0.715 | |||
| HeLa-S3 | DF | 0.948 | 0.810 | 0.665 |
| LightGBM | 0.945 | 0.783 | 0.719 | |
| RF | 0.898 | 0.759 | 0.736 | |
| SVM | 0.948 | 0.816 | 0.730 | |
| XGBoost | 0.893 | 0.697 | 0.380 | |
| LR | 0.949 | 0.815 | 0.746 | |
| MLP | ||||
| HUVEC | DF | 0.902 | 0.632 | 0.426 |
| LightGBM | 0.892 | 0.587 | 0.382 | |
| RF | 0.822 | 0.544 | 0.518 | |
| SVM | 0.838 | 0.616 | ||
| XGBoost | 0.854 | 0.567 | 0.577 | |
| LR | 0.909 | 0.641 | 0.574 | |
| MLP | 0.601 | |||
| IMR90 | DF | 0.931 | 0.717 | 0.568 |
| LightGBM | 0.860 | 0.625 | 0.620 | |
| RF | 0.849 | 0.665 | 0.624 | |
| SVM | 0.922 | 0.629 | 0.449 | |
| XGBoost | 0.882 | 0.696 | 0.281 | |
| LR | 0.686 | |||
| MLP | 0.933 | 0.738 | ||
| K562 | DF | 0.924 | 0.735 | 0.560 |
| LightGBM | 0.929 | 0.688 | 0.449 | |
| RF | 0.859 | 0.677 | 0.655 | |
| SVM | 0.928 | 0.745 | 0.662 | |
| XGBoost | 0.900 | 0.697 | 0.655 | |
| LR | 0.927 | 0.745 | 0.680 | |
| MLP | ||||
| NHEK | DF | 0.980 | 0.904 | 0.780 |
| LightGBM | 0.984 | 0.892 | 0.688 | |
| RF | 0.953 | 0.871 | 0.841 | |
| SVM | 0.984 | 0.913 | ||
| XGBoost | 0.966 | 0.681 | ||
| LR | 0.913 | |||
| MLP | 0.914 | 0.844 |
Detailed performance evaluation of StackEPI-faster's candidate meta-classifiers in 6 cell lines
| Cell lines | Meta-models | AUROC | AUPR | F1-score |
|---|---|---|---|---|
| GM12878 | LR | |||
| MLP | 0.775 | 0.713 | ||
| HeLa-S3 | LR | 0.808 | ||
| MLP | 0.961 | 0.733 | ||
| HUVEC | LR | 0.935 | 0.642 | 0.578 |
| MLP | ||||
| IMR90 | LR | |||
| MLP | 0.944 | 0.734 | 0.686 | |
| K562 | LR | 0.684 | ||
| MLP | 0.758 | |||
| NHEK | LR | |||
| MLP | 0.989 | 0.912 | 0.830 |
AUROC of EPIVAN, EPI-DLMH, StackEPI(-faster) in 6 cell lines
| Methods | GM12878 | HeLa-S3 | HUVEC | IMR90 | K562 | NHEK |
|---|---|---|---|---|---|---|
| EPIVAN | 0.919 | 0.954 | 0.933 | 0.887 | 0.937 | 0.986 |
| EPI-DLMH | 0.929 | 0.962 | 0.936 | 0.893 | 0.939 | 0.987 |
| StackEPI | 0.939 | 0.957 | 0.910 | 0.933 | 0.930 | 0.985 |
| StackEPI-faster | 0.945 | 0.962 | 0.937 | 0.946 | 0.944 | 0.990 |
AUPR of EPIVAN, EPI-DLMH, StackEPI(-faster) in 6 cell lines
| Methods | GM12878 | HeLa-S3 | HUVEC | IMR90 | K562 | NHEK |
|---|---|---|---|---|---|---|
| EPIVAN | 0.756 | 0.819 | 0.640 | 0.688 | 0.752 | 0.910 |
| EPI-DLMH | 0.789 | 0.863 | 0.709 | 0.712 | 0.767 | 0.911 |
| StackEPI | 0.779 | 0.822 | 0.651 | 0.738 | 0.748 | 0.914 |
| StackEPI-faster | 0.777 | 0.810 | 0.644 | 0.737 | 0.760 | 0.913 |
F1-score of EPIVAN, EPI-DLMH, StackEPI(-faster) in 6 cell lines
| Methods | GM12878 | HeLa-S3 | HUVEC | IMR90 | K562 | NHEK |
|---|---|---|---|---|---|---|
| EPIVAN | 0.700 | 0.717 | 0.590 | 0.628 | 0.678 | 0.852 |
| EPI-DLMH | 0.751 | 0.809 | 0.619 | 0.692 | 0.712 | 0.860 |
| StackEPI | 0.715 | 0.753 | 0.601 | 0.692 | 0.717 | 0.844 |
| StackEPI-faster | 0.725 | 0.737 | 0.601 | 0.691 | 0.701 | 0.836 |
Training duration (h) of EPIVAN, EPI-DLMH, StackEPI(-faster) in 6 cell lines
| Methods | GM12878 | HeLa-S3 | HUVEC | IMR90 | K562 | NHEK |
|---|---|---|---|---|---|---|
| EPIVAN | 1.870 | 1.299 | 1.460 | 1.063 | 1.664 | 1.079 |
| EPI-DLMH | 16.891 | 12.304 | 14.105 | 10.174 | 16.041 | 10.431 |
| StackEPI | 1.135 | 0.807 | 0.694 | 0.498 | 0.983 | 0.515 |
| StackEPI-faster | 0.088 | 0.046 | 0.043 | 0.031 | 0.060 | 0.036 |
Number of positive and negative samples for 6 cell lines in the original dataset
| Cell lines | Positive samples | Negative samples |
|---|---|---|
| GM12878 | 2113 | 42,200 |
| HeLa-S3 | 1740 | 34,800 |
| HUVEC | 1524 | 30,400 |
| IMR90 | 1254 | 25,000 |
| K562 | 1977 | 39,500 |
| NHEK | 1291 | 25,600 |
| Total | 9899 | 197,500 |
The six human cell lines, GM12878, HeLa-S3, HUVEC, IMR90, K562 and NHEK, represent lymphoblastoid cells, umbilical vein endothelial cells, ectoderm-lineage cells from a patient with cervical cancer, fetal lung fibroblasts, mesoderm-lineage cells from a patient with leukemia, and epidermal keratinocytes, respectively. Total represents the sum of six cell lines
Fig. 5DF model structure. It consists of four main components: Binner, Cascade Layer, Estimator, and Predictor. Binner is used to reduce feature input, Estimator is used to form cascade layers, and Cascade Layer is used to process layer by layer and generate the next layer of feature information. Predictor is an estimator for the final prediction result, which is optional