| Literature DB >> 28361677 |
Daesik Choi1, Byungkyu Park1, Hanju Chae1, Wook Lee1, Kyungsook Han2.
Abstract
BACKGROUND: Motivated by the increased amount of data on protein-RNA interactions and the availability of complete genome sequences of several organisms, many computational methods have been proposed to predict binding sites in protein-RNA interactions. However, most computational methods are limited to finding RNA-binding sites in proteins instead of protein-binding sites in RNAs. Predicting protein-binding sites in RNA is more challenging than predicting RNA-binding sites in proteins. Recent computational methods for finding protein-binding sites in RNAs have several drawbacks for practical use.Entities:
Keywords: Prediction method; Protein-binding region; RNA-protein interaction
Mesh:
Substances:
Year: 2017 PMID: 28361677 PMCID: PMC5374631 DOI: 10.1186/s12918-017-0386-4
Source DB: PubMed Journal: BMC Syst Biol ISSN: 1752-0509
Number of RNA sequences in training and test datasets
| P:N | 1:1 | 1:2 | 1:4 | 1:6 | 1:8 | 1:10 |
|---|---|---|---|---|---|---|
| Training | ||||||
| Dataset | 3,372:3,679 | 3,372:7,200 | 3,372:13,611 | 3,372:19,065 | 3,372:22,826 | 3,372:26,212 |
| Subtotal | 7,051 | 10,572 | 16,983 | 22,473 | 26,198 | 29,584 |
| Test | ||||||
| Dataset | 1,000:1,000 | 1,000:2,000 | 1,000:3,998 | 1,000:5,998 | 1,000:7,998 | 1,000:9,998 |
| Subtotal | 2,000 | 3,000 | 4,998 | 6,998 | 8,998 | 10,998 |
| Total | 9,051 | 13,572 | 21,981 | 29,435 | 35,196 | 40,582 |
Since similar sequences were removed separately in each 1:n dataset, the number of negative data (N) is not an exact multiple of the number of positive data (P)
Fig. 1Construction of mono-nucleotide position weight matrix (mPWM). Both binding and non-binding sequences are used to generate an mPWM, in which each element (i,j) represents the log-odds score of the i-th nucleotide (i=A, C, G and U) in the j-th position (j=1,2,…, sequence length n). F in PWM +, PWM − and mPWM denotes the frequency of a nucleotide at a position
Fig. 2Structure of a feature vector. For a sequence of n nucleotides, mPWM and dPWM are represented by n and n−1 elements, respectively. Compositions represent the frequency of each mono-nucleotide (4 elements), di-nucleotide (16 elements) and tri-nucleotide (64 elements) in the RNA sequence. A protein sequence is represented by 63 elements (7 compositions, 21 transitions and 35 distributions)
Results of testing our model and DeepBind on RNA sequences of 25 nucleotides. catRAPID could not be tested on RNA sequences of 25 nucleotides since the minimum length of an RNA sequence required by catRAPID is 50 nucleotides
| #RBP-binding | |||||||
|---|---|---|---|---|---|---|---|
| RBP | RNA regions | Sensitivity | Specificity | Accuracy | PPV | NPV | MCC |
| Our model | |||||||
| FUS | 64 | 93.75% | 94.00% | 93.90% | 90.91% | 95.92% | 0.873 |
| FXR1 | 67 | 97.01% | 94.00% | 95.21% | 91.55% | 97.92% | 0.902 |
| FXR2 | 80 | 66.25% | 94.00% | 81.67% | 89.83% | 77.69% | 0.638 |
| IGF2BP2 | 79 | 74.68% | 94.00% | 85.47% | 90.77% | 82.46% | 0.709 |
| LIN28A | 82 | 85.37% | 94.00% | 90.11% | 92.11% | 88.68% | 0.801 |
| QKI | 77 | 84.42% | 94.00% | 89.83% | 91.55% | 88.68% | 0.793 |
| TARDBP | 94 | 12.77% | 94.00% | 54.64% | 66.67% | 53.41% | 0.117 |
| Weighted average |
|
|
|
|
|
| |
| DeepBind | |||||||
| FUS | 64 | 32.81% | 42.00% | 38.41% | 26.58% | 49.41% | -0.246 |
| FXR1 | 67 | 11.94% | 44.00% | 31.14% | 12.50% | 42.72% | -0.444 |
| FXR2 | 80 | 15.00% | 55.00% | 37.22% | 21.05% | 44.72% | -0.320 |
| IGF2BP2 | 79 | 41.77% | 51.00% | 46.93% | 40.24% | 42.58% | -0.072 |
| LIN28A | 82 | 12.20% | 52.00% | 34.07% | 17.24% | 41.94% | -0.382 |
| QKI | 77 | 83.12% | 75.00% | 78.53% | 71.91% | 85.23% | 0.576 |
| TARDBP | 94 | 52.13% | 92.00% | 72.68% | 85.96% | 67.15% | 0.484 |
| Weighted average |
|
|
|
|
|
|
The specificity of our method is the same for all RBPs because it used a same set of negative data for all RBPs with a single model, whereas DeepBind has distinct models for each RBP
Results of testing our model, DeepBind and catRAPID on RNA sequences of 51 nucleotides
| #RBP-binding | |||||||
|---|---|---|---|---|---|---|---|
| RBP | RNA regions | Sensitivity | Specificity | Accuracy | PPV | NPV | MCC |
| Our model | |||||||
| FUS | 100 | 79.00% | 70.00% | 74.50% | 72.48% | 76.92% | 0.492 |
| FXR1 | 97 | 88.66% | 70.00% | 79.19% | 74.14% | 86.42% | 0.596 |
| FXR2 | 93 | 69.89% | 70.00% | 69.95% | 68.42% | 71.43% | 0.399 |
| IGF2BP2 | 94 | 55.32% | 70.00% | 62.89% | 63.41% | 62.50% | 0.256 |
| LIN28A | 96 | 58.33% | 70.00% | 64.29% | 65.12% | 63.64% | 0.285 |
| QKI | 100 | 78.00% | 70.00% | 74.00% | 72.22% | 76.09% | 0.482 |
| TARDBP | 100 | 22.00% | 70.00% | 46.00% | 42.31% | 47.30% | –0.091 |
| Weighted average |
|
|
|
|
|
| |
| DeepBind | |||||||
| FUS | 100 | 32.00% | 33.00% | 32.50% | 32.32% | 32.67% | –0.350 |
| FXR1 | 97 | 32.99% | 42.00% | 37.56% | 35.56% | 39.25% | –0.251 |
| FXR2 | 93 | 43.01% | 73.00% | 58.55% | 59.70% | 57.94% | 0.168 |
| IGF2BP2 | 94 | 48.94% | 59.00% | 54.12% | 52.87% | 55.14% | 0.080 |
| LIN28A | 96 | 36.46% | 53.00% | 44.90% | 42.68% | 46.49% | -0.107 |
| QKI | 100 | 82.00% | 81.00% | 81.50% | 81.19% | 81.82% | 0.630 |
| TARDBP | 100 | 50.00% | 86.00% | 68.00% | 78.12% | 63.24% | 0.386 |
| Weighted average |
|
|
|
|
|
| |
| catRAPID | |||||||
| DP value | |||||||
| FUS | 10 | 16.40% | – | – | – | – | – |
| FXR1 | 10 | 17.60% | – | – | – | – | – |
| FXR2 | 10 | 22.30% | – | – | – | – | – |
| IGF2BP2 | 10 | 16.70% | – | – | – | – | – |
| LIN28A | 10 | 19.10% | – | – | – | – | – |
| QKI | 10 | 15.50% | – | – | – | – | – |
| TARDBP | 10 | 18.10% | – | – | – | – | – |
| Weighted average |
| – | – | – | – | – |
Sensitivity is shown for our model and DeepBind, and discriminative power (DP) value is shown for catRAPID. The specificity of our method is the same for all RBPs because it used a same set of negative data for all RBPs with a single model, whereas DeepBind has distinct models for each RBP. Due to the speed of the catRAPID server, catRAPID was tested on 10 RBP-binding sequences of 51 nucleotides for each RBP, whereas both our model and DeepBind were tested on all the RBP-binding sequences. Detailed results are available in Additional file 12
Comparison of different combinations of features in 10-fold cross validation
| Sensitivity | Specificity | Accuracy | PPV | NPV | MCC | |
|---|---|---|---|---|---|---|
| mPWM | 89.09% | 90.60% | 89.87% | 89.67% | 90.06% | 0.797 |
| dPWM | 90.48% | 92.06% | 91.31% | 91.27% | 91.34% | 0.826 |
| compositions | 71.44% | 88.23% | 80.20% | 84.76% | 77.12% | 0.608 |
| mPWM + dPWM | 91.46% | 91.98% | 91.73% | 91.27% | 92.16% | 0.834 |
| mPWM + compositions | 91.31% | 91.55% | 91.43% | 90.83% | 92.00% | 0.828 |
| dPWM + compositions | 91.07% | 92.53% | 91.83% | 91.78% | 91.88% | 0.836 |
| mPWM + dPWM + compositions |
|
|
|
|
|
|
Using all 3 features showed the best performance. mPWM: mono-nucleotide position weight matrix, dPWM: di-nucleotide position weight matrix, compositions: frequency of mono-nucleotides, di-nucleotides, and tri-nucleotides in the RNA sequence
Results of 10-fold cross validations of SVM and random forest on 6 datasets with different P:N ratios of positive to negative instances
| P:N | Sensitivity | Specificity | Accuracy | PPV | NPV | MCC |
|---|---|---|---|---|---|---|
| SVM | ||||||
| 1:1 |
| 92.39% | 92.02% | 91.69% | 92.31% | 0.840 |
| 1:2 |
| 92.17% | 91.91% | 84.53% | 95.80% | 0.819 |
| 1:4 |
| 92.33% | 92.09% | 74.64% | 97.68% | 0.777 |
| 1:6 |
| 91.95% | 91.84% | 66.71% | 98.34% | 0.736 |
| 1:8 |
| 91.92% | 91.83% | 62.52% | 98.61% | 0.713 |
| 1:10 |
| 91.54% | 91.50% | 58.11% | 98.78% | 0.686 |
| Random forest | ||||||
| 1:1 |
| 92.06% | 91.62% | 91.32% | 91.89% | 0.832 |
| 1:2 |
| 95.21% | 92.09% | 89.31% | 93.32% | 0.816 |
| 1:4 |
| 97.18% | 93.85% | 87.59% | 95.24% | 0.802 |
| 1:6 |
| 97.77% | 94.78% | 86.01% | 96.16% | 0.788 |
| 1:8 |
| 98.01% | 95.18% | 84.95% | 96.51% | 0.777 |
| 1:10 |
| 98.14% | 95.53% | 83.90% | 96.86% | 0.770 |
PPV positive prediction value, NPV negative prediction value, MCC Matthews correlation coefficient
Results of LOPO cross validation of our method with respect to 14 RBPs
| TP | TN | FP | FN | Sensitivity | Specificity | Accuracy | PPV | NPV | MCC | |
|---|---|---|---|---|---|---|---|---|---|---|
| AGO1 | 37 | 50 | 3 | 18 | 67.27% | 94.34% | 80.56% | 92.50% | 73.53% | 0.638 |
| AGO2 | 39 | 49 | 2 | 18 | 68.42% | 96.08% | 81.48% | 95.12% | 73.13% | 0.664 |
| EWSR1 | 200 | 198 | 14 | 14 | 93.46% | 93.40% | 93.43% | 93.46% | 93.40% | 0.869 |
| FUS | 468 | 534 | 46 | 19 | 96.10% | 92.07% | 93.91% | 91.05% | 96.56% | 0.879 |
| FXR1 | 3 | 7 | 0 | 1 | 75.00% | 100.00% | 90.91% | 100.00% | 87.50% | 0.810 |
| FXR2 | 25 | 33 | 1 | 11 | 69.44% | 97.06% | 82.86% | 96.15% | 75.00% | 0.688 |
| IGF2BP2 | 57 | 55 | 7 | 15 | 79.17% | 88.71% | 83.58% | 89.06% | 78.57% | 0.678 |
| LIN28A | 221 | 263 | 25 | 57 | 79.50% | 91.32% | 85.51% | 89.84% | 82.19% | 0.714 |
| LIN28B | 2214 | 2343 | 329 | 227 | 90.70% | 87.69% | 89.13% | 87.06% | 91.17% | 0.783 |
| QKI | 3 | 5 | 0 | 1 | 75.00% | 100.00% | 88.89% | 100.00% | 83.33% | 0.791 |
| TAF15 | 11 | 16 | 1 | 2 | 84.62% | 94.12% | 90.00% | 91.67% | 88.89% | 0.796 |
| TARDBP | 39 | 159 | 14 | 149 | 20.74% | 91.91% | 54.85% | 73.58% | 51.62% | 0.179 |
| YTHDF2 | 35 | 39 | 5 | 6 | 85.37% | 88.64% | 87.06% | 87.50% | 86.67% | 0.741 |
| ZC3H7B | 388 | 438 | 43 | 94 | 80.50% | 91.06% | 85.77% | 90.02% | 82.33% | 0.720 |
| Total | 3,740 | 4,189 | 490 | 632 | ||||||
| Weighted average |
|
|
|
|
|
|
The weighted average was computed from the total values of TP, TN, FP and FN of all runs. TP: true positive, TN true negative, FP false positive, FN false negative, PPV positive prediction value, NPV negative prediction value, MCC Matthews correlation coefficient
Results of independent testing of our method on 6 datasets with different P:N ratios of positive to negative instances
| P:N | Sensitivity | Specificity | Accuracy | PPV | NPV | MCC |
|---|---|---|---|---|---|---|
| 1:1 |
|
|
|
|
|
|
| 1:2 | 72.40% | 91.80% | 85.33% |
| 86.93% |
|
| 1:4 | 74.10% | 91.10% | 87.70% |
| 83.36% |
|
| 1:6 | 77.00% | 90.26% | 88.37% |
| 95.92% |
|
| 1:8 | 77.80% | 89.68% | 88.36% |
| 97.00% |
|
| 1:10 | 79.10% | 89.70% | 88.73% |
| 97.72% |
|
PPV positive prediction value, NPV negative prediction value, MCC Matthews correlation coefficient
Fig. 3ROC curves of 10-fold cross validation and independent testing of the RBF-SVM and the linear SVM. Both in 10-fold cross validation and independent testing, the SVM model with the RBF kernel yielded a slightly larger area under the ROC curve (AUC) than the SVM model with linear kernel