| Literature DB >> 36119976 |
Quang H Nguyen1, Hoang V Tran1, Binh P Nguyen2, Trang T T Do3.
Abstract
Transcription factors (TFs) play an important role in gene expression and regulation of 3D genome conformation. TFs have ability to bind to specific DNA fragments called enhancers and promoters. Some TFs bind to promoter DNA fragments which are near the transcription initiation site and form complexes that allow polymerase enzymes to bind to initiate transcription. Previous studies showed that methylated DNAs had ability to inhibit and prevent TFs from binding to DNA fragments. However, recent studies have found that there were TFs that could bind to methylated DNA fragments. The identification of these TFs is an important steppingstone to a better understanding of cellular gene expression mechanisms. However, as experimental methods are often time-consuming and labor-intensive, developing computational methods is essential. In this study, we propose two machine learning methods for two problems: (1) identifying TFs and (2) identifying TFs that prefer binding to methylated DNA targets (TFPMs). For the TF identification problem, the proposed method uses the position-specific scoring matrix for data representation and a deep convolutional neural network for modeling. This method achieved 90.56% sensitivity, 83.96% specificity, and an area under the receiver operating characteristic curve (AUC) of 0.9596 on an independent test set. For the TFPM identification problem, we propose to use the reduced g-gap dipeptide composition for data representation and the support vector machine algorithm for modeling. This method achieved 82.61% sensitivity, 64.86% specificity, and an AUC of 0.8486 on another independent test set. These results are higher than those of other studies on the same problems.Entities:
Year: 2022 PMID: 36119976 PMCID: PMC9475634 DOI: 10.1021/acsomega.2c03696
Source DB: PubMed Journal: ACS Omega ISSN: 2470-1343
Datasets for Classification of TFs Versus Non-TFs and TFPMs Versus TFPNMs
| TFs vs non-TFs | TFPM vs TFPNM | |||
|---|---|---|---|---|
| dataset | positive | negative | positive | negative |
| training set | 416 | 416 | 270 | 146 |
| independent test set | 106 | 106 | 69 | 37 |
Figure 1Data representation for TF classification.
Figure 2Distribution of amino acids in the protein sequences in the training set, showing that 97.23% of the sequences have lengths less than 1500.
Figure 3Architecture of the CNN models used for TF classification.
Figure 4Ensemble learning for TF classification.
Figure 5Data representation for TFPM classification.
Op(13) Grouping for Amino Acids
| index | group of amino acids | symbols for the group |
|---|---|---|
| 1 | G (glycine) | G |
| 2 | I (isoleucine), V (valine) | I |
| 3 | F (phenylalanine), Y(tyrosine), W (tryptophan) | F |
| 4 | A (alanine) | A |
| 5 | L (leucine) | L |
| 6 | M (methionine) | M |
| 7 | E (glutamic acid) | E |
| 8 | Q (glutamine), R (arginine), K (lysine) | Q |
| 9 | P (proline) | P |
| 10 | N (asparagine), D (aspartic acid) | N |
| 11 | H (histidine), S (serine) | H |
| 12 | T (threonine) | T |
| 13 | C (cysteine) | C |
Performances of the First Model with Different Input Lengths
| input length | sensitivity | specificity | accuracy | MCC | AUC |
|---|---|---|---|---|---|
| 1000 | 0.7710 | 0.9518 | 0.8614 | 0.7349 | 0.9577 |
| 1500 | 0.7831 | 0.9638 | 0.8734 | 0.7594 | 0.9585 |
| 2000 | 0.7590 | 0.9518 | 0.8554 | 0.7244 | 0.9522 |
Figure 6Best AUC values from repeated fivefold cross-validation for different settings of reduced GDC.
Comparison of the Proposed Frameworks with Other Methods on the Same Independent Test Set
| problem | method | sensitivity | specificity | accuracy | MCC | AUC |
|---|---|---|---|---|---|---|
| TF vs non-TF | Liu et al.[ | 0.8019 | 0.8302 | 0.6614 | 0.9116 | |
| Li et al.[ | 0.8868 | 0.8396 | 0.8663 | 0.7272 | 0.9130 | |
| PSSM + CNN (Ours) | 0.8396 | |||||
| TFPM vs TFPNM | Liu et al.[ | 0.7101 | 0.6486 | 0.6887 | 0.3471 | 0.7356 |
| Li et al.[ | 0.7826 | 0.7359 | 0.8324 | |||
| RGDC + SVM (Ours) | 0.6486 | 0.4778 |