| Literature DB >> 31428122 |
Lei Deng1,2, Wenyi Yang1, Hui Liu3.
Abstract
Protein-RNA interactions play essential roles in many biological aspects. Quantifying the binding affinity of protein-RNA complexes is helpful to the understanding of protein-RNA recognition mechanisms and identification of strong binding partners. Due to experimentally measured protein-RNA binding affinity data available is still limited to date, there is a pressing demand for accurate and reliable computational approaches. In this paper, we propose a computational approach, PredPRBA, which can effectively predict protein-RNA binding affinity using gradient boosted regression trees. We build a dataset of protein-RNA binding affinity that includes 103 protein-RNA complex structures manually collected from related literature. Then, we generate 37 kinds of sequence and structural features and explore the relationship between the features and protein-RNA binding affinity. We find that the binding affinity mainly depends on the structure of RNA molecules. According to the type of RNA associated with proteins composed of the protein-RNA complex, we split the 103 protein-RNA complexes into six categories. For each category, we build a gradient boosted regression tree (GBRT) model based on the generated features. We perform a comprehensive evaluation for the proposed method on the binding affinity dataset using leave-one-out cross-validation. We show that PredPRBA achieves correlations ranging from 0.723 to 0.897 among six categories, which is significantly better than other typical regression methods and the pioneer protein-RNA binding affinity predictor SPOT-Seq-RNA. In addition, a user-friendly web server has been developed to predict the binding affinity of protein-RNA complexes. The PredPRBA webserver is freely available at http://PredPRBA.denglab.org/.Entities:
Keywords: binding affinity; computational approaches; gradient boosted regression tree; protein-RNA interactions; sequence and structural features
Year: 2019 PMID: 31428122 PMCID: PMC6688581 DOI: 10.3389/fgene.2019.00637
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1The flowchart of the PredPRBA method for predicting the binding affinity of protein-RNA complexes. It involves four steps: (A) collection of complexes with experimentally measured binding affinities from publications. (B) Classification of complexes according to the type of RNAs interacting with proteins. (C) Feature extraction from sequence and structure from proteins and RNAs for building a predictive model. (D) Training gradient boosting regression tree models.
Selected features to predict protein-RNA binding affinity of each class of protein-RNA complexes.
| Class I | Class II | Class III | Class IV | Class V | Class VI | |
|---|---|---|---|---|---|---|
| molecular weight of RNA | √ | |||||
| total value of the relative solvent accessible surface area | √ | √ | ||||
| number of hydrophilic residues in the protein | √ | √ | ||||
| number of hydrophobic residues in the protein | √ | |||||
| % of hydrophilic residues in the protein | √ | |||||
| % of hydrophobic residues in the protein | √ | √ | √ | √ | ||
| % of the aromatic and positively charged residues in the protein | √ | |||||
| number of the aromatic and positively charged residues in the protein | √ | |||||
| number of the charged residues in protein | √ | √ | ||||
| number of the polar residues in protein | √ | √ | ||||
| molecular weight of | √ | √ | ||||
| molecular weight of | √ | |||||
| number of cWW | √ | |||||
| relative frequency of cWW | √ | √ | √ | |||
| frequency of the MFE structure | √ |
Performance of models built on the best one and two features for six classes of protein-RNA complexes.
| Number of complexes | Maximum correlation coefficient(r) | ||
|---|---|---|---|
| Single property | Two properties | ||
| Class I | 21 | 0.565 | 0.725 |
| Class II | 34 | 0.452 | 0.546 |
| Class III | 8 | 0.567 | 0.669 |
| Class IV | 9 | 0.616 | 0.663 |
| Class V | 11 | 0.422 | 0.521 |
| Class VI | 20 | 0.511 | 0.615 |
| All | 103 | 0.178 | 0.332 |
Performance measures of Pred PRBA on leave-one-outcrossvalidations.
| Correlation coefficient(r) | Mean absolute error(MAE) | Coefficient of determination(R2) | |
|---|---|---|---|
| Class I | 0.818 | 1.215 | 0.623 |
| Class II | 0.731 | 1.145 | 0.518 |
| Class III | 0.894 | 1.270 | 0.288 |
| Class IV | 0.803 | 0.749 | 0.489 |
| Class V | 0.768 | 1.425 | 0.255 |
| Class VI | 0.762 | 0.879 | 0.531 |
| Average value | 0.796 | 1.114 | 0.451 |
Figure 2Scatterplot in the coordinate of experimental vs predicted binding affinities of six classes of protein-RNA complexes.
Figure 3The predicted and actual binding affinities, represented by ΔG, of each protein-RNA complex in six classes of complexes.
Performance comparison of PredPRBA to protein-based and RNA-based prediction models.
| Protein-based model | RNA-based model | PredPRBA | |
|---|---|---|---|
| Class I | 0.562 | 0.818 |
|
| Class II | 0.652 | 0.436 |
|
| Class III | 0.894 | 0.634 |
|
| Class IV | 0.642 | 0.621 |
|
| Class V | 0.768 | 0.547 |
|
| Class VI | 0.762 | 0.635 |
|
| Average | 0.71 | 0.62 |
|
Performance comparison of PredPRBA to sequence feature-based and structur efeature-based models.
| Sequence-based model | Structure-based model | PredPRBA | |
|---|---|---|---|
| Class I | 0.661 | 0.711 |
|
| Class II | 0.618 | 0.635 |
|
| Class III | 0.883 | 0.765 |
|
| Class IV | 0.696 | 0.735 |
|
| Class V | 0.661 | 0.697 |
|
| Class VI | 0.736 | 0.665 |
|
| Average | 0.71 | 0.70 |
|
Comparison of correlation coefficients between PredPRBA and other regression algorithms.
| SVR | DTR | LR | KNNR | ERRT | RFR | PredPRBA | |
|---|---|---|---|---|---|---|---|
| Class I | 0.541 | 0.356 | 0.604 | 0.411 | 0.760 | 0.641 |
|
| Class II | 0.356 | 0.621 | 0.456 | 0.476 | 0.685 | 0.695 |
|
| Class III | 0.708 | 0.449 | 0.634 | 0.628 | 0.458 | 0.535 |
|
| Class IV | 0.389 | 0.669 | 0.696 | 0.602 | 0.588 | 0.724 |
|
| Class V | 0.366 | 0.395 | 0.432 | 0.492 | 0.215 | 0.343 |
|
| Class VI | 0.157 | 0.377 | 0.374 | 0.636 | 0.519 | 0.400 |
|
| Average | 0.42 | 0.52 | 0.53 | 0.54 | 0.54 | 0.56 |
|
Figure 4Comparison of mean correlation coefficients over six classes of protein-RNA complexes between PredPRBA and typical regression methods.
Comparison of correlation coefficients between SPOT-Seq-RNA method and Pred PRBA.
| Number of complexes | Correlation coefficient(r) | ||
|---|---|---|---|
| SPOT-Seq-RNA | PredPRBA | ||
| Class I | 21 | 0.442 | 0.818 |
| Class II | 34 | -0.044 | 0.731 |
| Class III | 8 | -0.038 | 0.894 |
| Class IV | 9 | 0.172 | 0.803 |
| Class V | 11 | 0.756 | 0.768 |
| Class VI | 20 | 0.386 | 0.762 |
| Average | 17 | 0.276 | 0.796 |