| Literature DB >> 33784978 |
K C Kishan1, Sridevi K Subramanya2, Rui Li1, Feng Cui3.
Abstract
BACKGROUND: Most transcription factors (TFs) compete with nucleosomes to gain access to their cognate binding sites. Recent studies have identified several TF-nucleosome interaction modes including end binding (EB), oriented binding, periodic binding, dyad binding, groove binding, and gyre spanning. However, there are substantial experimental challenges in measuring nucleosome binding modes for thousands of TFs in different species.Entities:
Keywords: Machine learning; Nucleosome binding modes; Transcription factors
Mesh:
Substances:
Year: 2021 PMID: 33784978 PMCID: PMC8008688 DOI: 10.1186/s12859-021-04093-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1A model to predict nucleosome-TF binding patterns. The sequence is projected to a multivariate Gaussian distribution with mean and covariance matrix . Similarity between these multivariate Gaussian distributions is computed to form kernel matrix and a multi-label classifier is trained to model binding preferences using the kernel matrix
Fig. 2A workflow to represent an amino acid sequence as a matrix of ProtVec embeddings. A sequence is split into the subsequences with length and the embeddings of these subsequences from the ProtVec model are used to obtain a feature matrix
Datasets used in the study
| Binding preferences | No. of positive samples | No. of negative samples |
|---|---|---|
| End preference | 121 | 46 |
| Periodic preference | 98 | 69 |
| Groove preference | 45 | 122 |
| Dyad preference | 10 | 157 |
| Gyre spanning | 3 | 164 |
| Orientational preference | 12 | 155 |
| Nucleosome stability | 71 | 96 |
The parameters and set of values for various off-the-shelf baselines
| Method | Tuning parameters |
|---|---|
| Logistic regression | The norm used in the penalization: none, L1, L2, elastic net Regularization coefficient: 100, 10, 1, 0.1, 0.01 Subsequence length: 3, 4, 5, 6 |
| k-nearest neighbors | Number of neighbors to use: 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 Contribution of members in the neighborhood: uniform, distance Distance metric: Euclidean, Manhattan, Minkowski Subsequence length: 3, 4, 5, 6 |
| Support vector machine | Kernel: Linear, Polynomial, RBF, Sigmoid Regularization parameter (C): 50, 10, 1.0, 0.1, 0.01 Subsequence length: 3, 4, 5, 6 |
| Random forest | The number of features to consider when looking for the best split: sqrt(num features), log2(num features) The number of trees in the forest: 10, 100, 200, 500, 1000 Subsequence length: 3, 4, 5, 6 |
Fig. 3Performance comparison of the model trained on full-length sequences of TFs. a Micro-AUPR and accuracy comparison between models trained on all binding patterns with different subsequence lengths. b Micro-AUPR and accuracy comparison between models trained with subsequences with the length of 4 residues for individual binding patterns
Fig. 4Impact of on the performance of proposed methods measured by various performance metrics such as a micro-AUPR, b accuracy. controls the impact of similarity between mean vectors and covariance matrices. If , only the similarity between covariance matrices is considered. In contrast, the only similarity between mean vectors is considered when
Performance comparison of our proposed method (in bold) with other baselines for all binding mode data
| Method | m-AUPR | Accuracy |
|---|---|---|
| LR | 0.536 ± 0.013 | 0.129 ± 0.011 |
| kNN | 0.532 ± 0.007 | 0.132 ± 0.012 |
| SVM | 0.544 ± 0.018 | 0.155 ± 0.026 |
| RF | 0.558 ± 0.022 | 0.157 ± 0.016 |
Performance comparison of the proposed method (in bold) and various baselines for end binding mode data
| Method | m-AUPR | Accuracy | MCC |
|---|---|---|---|
| SVM | 0.739 | 0.764 | 0.322 |
| RF | 0.718 | 0.713 | 0.122 |
| LR | 0.738 | 0.726 | 0.269 |
| KNN | 0.716 | 0.709 | 0.075 |
Fig. 5Profiles of nucleosome occupancy around a RFX5 ChIP clusters showing dip at centre (dip), b NFATC1 ChIP clusters showing peak at centre (peak), and c FOXK2 ChIP clusters showing no clear peak or dip at centre (questionable)
Fig. 6Categorization of nucleosome occupancy profiles for 24 TFs with known E-MI penetration (lig147), in terms of peak (red), dip (green), and questionable (orange) nucleosome occupancy profiles. TFs with an E-MI penetration (lig147) less than 20 are defined as having end preference (7)
Performance comparison of proposed method (in bold) and other baselines for nucleosome occupancy data
| Method | m-AUPR | Accuracy | MCC |
|---|---|---|---|
| SVM | 0.864 | 0.822 | -0.073 |
| RF | 0.86 | 0.851 | 0.14 |
| LR | 0.863 | 0.733 | 0.072 |
| KNN | 0.863 | 0.851 | 0.14 |
Comparison of confusion matrix for proposed method (in bold) and other baselines on nucleosome occupancy data
| Method | True positives | True negatives | False negatives | False positives |
|---|---|---|---|---|
| SVM | 76 | 2 | 10 | 13 |
| RF | 78 | 3 | 8 | 12 |
| LR | 68 | 4 | 18 | 11 |
| KNN | 86 | 0 | 0 | 15 |
Summary of EB mode prediction for transcription factors (TF) in different species
| Species | Predicted end preference | Total | |
|---|---|---|---|
| Yes | No | ||
| Human | 1470 (88.34%) | 194 (11.66%) | 1664 |
| Mouse | 1414 (86.43%) | 222 (13.57%) | 1636 |
| Fruit fly | 625 (96.01%) | 26 (3.99%) | 651 |
| Nematode | 715 (95.59%) | 33 (4.41%) | 748 |
| Yeast | 293 (98.98%) | 3 (1.01%) | 296 |
Fig. 7A model for the end binding mode of TFs to a nucleosome. Over 88% of TFs in five model species are predicted to bind nucleosomal DNA ends or free DNA, whereas less than 12% of the TFs are predicted not to have the end binding preference. The fraction of the TFs binding to nucleosomal DNA ends or free DNA is the highest in yeast (~ 99%) and the lowest in mammals (~ 88%)