| Literature DB >> 31161194 |
Jiyun Zhou1,2, Qin Lu2, Lin Gui3, Ruifeng Xu1, Yunfei Long2, Hongpeng Wang1.
Abstract
MOTIVATION: The prediction of transcription factor binding sites (TFBSs) is crucial for gene expression analysis. Supervised learning approaches for TFBS predictions require large amounts of labeled data. However, many TFs of certain cell types either do not have sufficient labeled data or do not have any labeled data.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31161194 PMCID: PMC6954652 DOI: 10.1093/bioinformatics/btz451
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Architecture of multi-task learning for TFBS prediction. (A) Fully shared method and (B) shared-private method
Fig. 2.Box plot depicting the AUC performance of data augmentation by the baseline method, the fully shared model and MTTFsite on TFs in the five cell types
Details of the AUC comparison between MTTFsite and the baseline method for data augmentation
| Cell type | GM12878 | H1-hESC | HeLa-S3 | HepG2 | K562 | Average |
|---|---|---|---|---|---|---|
| Sample total | 56 | 42 | 37 | 43 | 63 | 48.2 |
| Improvement total | 52 | 37 | 34 | 41 | 60 | 44.8 |
| Improvement (%) | 92.9 | 88.1 | 79.1 | 95.3 | 95.2 | 92.9 |
| Maximum | 31.7 | 12.7 | 22.1 | 37.8 | 17.9 | 24.5 |
| Average | 3.6 | 3.5 | 3.2 | 3.8 | 2.9 | 3.4 |
The micro average over the total number of samples.
The maximum improvement.
The average improvement.
Details of the AUC comparison between MTTFsite and the fully shared model for data augmentation
| Cell type | GM12878 | H1-hESC | HeLa-S3 | HepG2 | K562 | Average |
|---|---|---|---|---|---|---|
| Sample total | 56 | 42 | 37 | 43 | 63 | 48.2 |
| Improved total | 47 | 39 | 37 | 39 | 59 | 44.2 |
| Improvement (%) | 83.9 | 92.9 | 100 | 90.7 | 93.7 | 91.7 |
| Maximum | 2.2 | 2.8 | 2.9 | 2.2 | 4.7 | 3.1 |
| Average | 0.6 | 1.2 | 0.7 | 0.6 | 0.8 | 0.8 |
The micro average over the total number of samples.
The maximum improvement.
The average improvement.
The AUC of five state-of-the-art methods and MTTFsite on five TFs in five cell types
| TF | Cell type | PWM | DWM | DanQ | DanQ-J | DeepSEA | MTTFsite |
|---|---|---|---|---|---|---|---|
| CTCF | GM12878 | 0.586 | 0.578 |
| 0.731 | 0.677 |
|
| H1-hESC | 0.566 | 0.575 |
| 0.758 | 0.689 |
| |
| HeLa-S3 | 0.505 | 0.509 |
| 0.698 | 0.670 |
| |
| HepG2 | 0.523 | 0.527 |
| 0.757 | 0.697 |
| |
| K562 | 0.923 |
| 0.728 | 0.693 | 0.635 |
| |
| GABP | GM12878 | 0.844 | 0.844 | 0.797 |
| 0.791 |
|
| H1-hESC | 0.721 | 0.740 |
|
| 0.763 | 0.729 | |
| HeLa-S3 |
| 0.875 | 0.658 | 0.681 | 0.630 |
| |
| HepG2 | 0.786 | 0.791 | 0.794 |
| 0.795 |
| |
| K562 | 0.756 | 0.754 | 0.775 |
| 0.763 |
| |
| JunD | GM12878 | 0.906 |
| 0.621 | 0.606 | 0.589 |
|
| H1-hESC | 0.557 | 0.566 |
| 0.686 | 0.643 |
| |
| HeLa-S3 |
| 0.860 | 0.777 | 0.788 | 0.711 |
| |
| HepG2 |
|
| 0.813 | 0.826 | 0.738 | 0.829 | |
| K562 | 0.684 |
| 0.655 | 0.653 | 0.595 |
| |
| REST | GM12878 | 0.906 |
| 0.621 | 0.606 | 0.589 |
|
| HeLa-S3 | 0.899 |
| 0.602 | 0.597 | 0.559 |
| |
| HepG2 | 0.886 |
| 0.630 | 0.603 | 0.602 |
| |
| K562 | 0.867 |
| 0.646 | 0.645 | 0.623 |
| |
| USF2 | GM12878 | 0.891 |
| 0.673 | 0.698 | 0.615 |
|
| H1-hESC | 0.841 |
| 0.729 | 0.752 | 0.662 |
| |
| HeLa-S3 | 0.908 |
| 0.641 | 0.654 | 0.561 |
| |
| HepG2 | 0.952 |
| 0.697 | 0.751 | 0.591 |
| |
| K562 | 0.921 |
| 0.660 | 0.715 | 0.580 |
|
Note: DanQ-J denotes DanQ-JASPAR. The bold and underscore numbers denote the best performer and second best performer, respectively.
Fig. 3.Box plot depicting the AUC performance of cross-cell type prediction by the baseline method, the fully shared model and MTTFsite on TFs in the five cell types
Details of the AUC comparison between MTTFsite and the baseline method for cross-cell-type prediction
| Cell type | GM12878 | H1-hESC | HeLa-S3 | HepG2 | K562 | Average |
|---|---|---|---|---|---|---|
| Sample total | 56 | 42 | 37 | 43 | 63 | 48.2 |
| Improvement total | 46 | 31 | 29 | 35 | 54 | 39 |
| Improvement (%) | 82.1 | 73.8 | 78.4 | 81.4 | 85.7 | 80.9 |
| Maximum | 40.9 | 31.0 | 25.7 | 42.0 | 34.7 | 36.9 |
| Average | 5.1 | 8.0 | 4.1 | 5.1 | 4.0 | 5.1 |
The micro average over the total number of samples.
The maximum improvement.
The average improvement.
Details of the AUC comparison between MTTFsite and the fully shared model for cross-cell-type prediction
| Cell type | GM12878 | H1-hESC | HeLa-S3 | HepG2 | K562 | Average |
|---|---|---|---|---|---|---|
| Sample total | 56 | 42 | 37 | 43 | 63 | 48.2 |
| Improvement total | 54 | 37 | 36 | 41 | 59 | 45.4 |
| Improvement (%) | 96.4 | 88.1 | 97.3 | 95.3 | 93.7 | 94.2 |
| Maximum | 4.2 | 3.6 | 3.5 | 4.0 | 4.4 | 4.0 |
| Average | 1.2 | 1.5 | 1.2 | 1.4 | 1.3 | 1.3 |
The micro average over the total number of samples.
The maximum improvement.
The average improvement.
Fig. 4.(A) Scatter plot depicting the distribution of the AUC performance for cell type shared TFBSs and cell-type-specific TFBSs. (B) Box plot depicting the AUC performance for cell type shared TFBSs and cell-type-specific TFBSs on TFs in the five cell types
Fig. 5.Cosine similarities of cell-type-specific TFBSs in different cell types
The AUC of the gene expression predictions on the 20 cell types from RMEC
| Cells | TFBS | Histone | Combine |
|---|---|---|---|
| Breast_vHMEC | 0.779 |
|
|
| Fetal_Brain | 0.764 |
|
|
| Fetal_Muscle_Leg | 0.773 |
|
|
| Fetal_Muscle_Trunk | 0.759 |
|
|
| Gastric | 0.752 |
|
|
| H1_BMP4_Derived_Mesendoderm_Cultured_Cells | 0.746 |
|
|
| H1_BMP4_Derived_Trophoblast_Cultured_Cells | 0.751 |
|
|
| H1_Cell_Line | 0.754 |
|
|
| H1_Derived_Mesenchymal_Stem_Cells | 0.782 |
|
|
| H1_Derived_Neuronal_Progenitor_Cultured_Cells | 0.752 |
|
|
| IMR90_Cell_Line | 0.789 |
|
|
| iPS_DF_19.11_Cell_Line | 0.744 |
|
|
| iPS_DF_6.9_Cell_Line | 0.746 |
|
|
| Mobilized_CD34_Primary_Cells | 0.797 |
|
|
| Pancreas | 0.754 |
|
|
| Penis_Foreskin_Fibroblast_Primary_Cells | 0.815 |
|
|
| Penis_Foreskin_Keratinocyte_Primary_Cells | 0.794 |
|
|
| Penis_Foreskin_Melanocyte_Primary_Cells | 0.801 |
|
|
| Psoas_Muscle | 0.767 |
|
|
| Small_Intestine | 0.767 |
|
|
Note: The bold and underscore numbers denote the best performer and second best performer, respectively.