| Literature DB >> 23641141 |
Chiou-Yi Hor1, Chang-Biau Yang, Chia-Hung Chang, Chiou-Ting Tseng, Hung-Hsin Chen.
Abstract
The Prediction of RNA secondary structures has drawn much attention from both biologists and computer scientists. Many useful tools have been developed for this purpose. These tools have their individual strengths and weaknesses. As a result, based on support vector machines (SVM), we propose a tool choice method which integrates three prediction tools: pknotsRG, RNAStructure, and NUPACK. Our method first extracts features from the target RNA sequence, and adopts two information-theoretic feature selection methods for feature ranking. We propose a method to combine feature selection and classifier fusion in an incremental manner. Our test data set contains 720 RNA sequences, where 225 pseudoknotted RNA sequences are obtained from PseudoBase, and 495 nested RNA sequences are obtained from RNA SSTRAND. The method serves as a preprocessing way in analyzing RNA sequences before the RNA secondary structure prediction tools are employed. In addition, the performance of various configurations is subject to statistical tests to examine their significance. The best base-pair accuracy achieved is 75.5%, which is obtained by the proposed incremental method, and is significantly higher than 68.8%, which is associated with the best predictor, pknotsRG.Entities:
Keywords: RNA; feature selection; secondary structure; statistical test; support vector machine
Year: 2013 PMID: 23641141 PMCID: PMC3629938 DOI: 10.4137/EBO.S10580
Source DB: PubMed Journal: Evol Bioinform Online ISSN: 1176-9343 Impact factor: 1.625
Figure 1The nested (top) and pseudoknotted (bottom) bonded RNA structures.
An example of the BKS table
| Prediction | True class | |||
|---|---|---|---|---|
|
|
| |||
| 3 | 3 | |||
| 3 | 0 | |||
| 4 | 0 | |||
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 2 | 2 | |||
| 0 | 1 | |||
The feature sets and the 50 top-ranking features by mRMR and mRR
| ID | Feature set | Dimension | Top ranking feature names, mutual information and standard errors |
|---|---|---|---|
| 1 | The compositional factor | 6 | – |
| 2 | The bi-transitional factor | 18 | |
| 3 | The distributional factor | 20 | |
| 4 | The tri-transitional factor | 66 | |
| 5 | The spaced bi-gram factor | 18 | – |
| 6 | The potential base-pairing factor | 3 | |
| 7 | The asymmetry of direct-complementary triplets | 3 | |
| 8 | The nucleotide proportional factor | 12 | – |
| 9 | The potential single-stranded factor | 3 | – |
| 10 | The sequence specific score | 1 | The sequence specific score: 0.0089 ± 0.0049 |
| 11 | The segmental factor | 40 | Normalized |
| 12 | The sequence moment | 15 | |
| 13 | The spectral properties | 20 | |
| 14 | The wavelet features | 20 | |
| 15 | The 2D-dynamic representation | 19 | |
| 16 | The protein features | 375 | |
| 17 | The co-occurrence factor | 10 | – |
| 18 | The 2D graphical representation | 36 | |
| 19 | The dinucleotides factor | 32 | |
| 20 | The wavelet encoding for 2D graphical representation | 24 | |
| 21 | The sequence length | 1 | – |
| Total | 742 | 50 |
| |
| |
| |
| |
| |
| |
| |
| |
|
|
| |
| |
| Output |
| |
| |
| |
| |
| |
| Calculated the conditional mutual information, |
| Use |
| Eliminate every cluster |
| Select the feature with the highest mutual information from each remaining cluster. Remove the least relevant feature from all selected features. |
| Let the selected features be |
| Output |
Number of sequences and predicted base-pair accuracies in each tool preference class
| pk | rn | nu | ||
|---|---|---|---|---|
| Sequences | 359 | 212 | 149 | Total = 720 |
| BP accuracy | 68.80% | 64.55% | 60.92% | Extreme = 79.20% |
Classification accuracies of various fusion configurations
| 1 | 1 + 2 | 1 + 2 + 3 | 1 + 2 + 3+ 4 | |
|---|---|---|---|---|
| a. mRMR | 68.3 | 69.7 (+1.4) | 71.0 (+1.3) | 72.4 (+1.4) |
| b. ImRMR | 68.3 | 71.8 (+3.5) | 73.4 (+1.6) | 74.0 (+0.6) |
| c. mRR | 68.1 | 69.2 (+1.1) | 70.3 (+1.1) | 70.0 (−0.3) |
| d. ImRR | 68.1 | 71.1 (+3.0) | 73.1 (+2.0) | 74.4 (+1.3) |
| e. mRMR | 68.3 | 68.8 (+0.5) | 68.2 (−0.6) | 67.8 (−0.4) |
| f. ImRMR | 68.3 | 69.3 (+1.0) | 69.6 (+0.3) | 70.2 (+0.6) |
| g. mRR | 68.1 | 68.3 (+0.2) | 67.2 (−1.1) | 66.9 (−0.3) |
| h. ImRR | 68.1 | 68.5 (+0.4) | 69.2 (+0.7) | 69.6 (+0.4) |
| 200 features from b. | 69.2 | 71.3 (+2.1) | 72.8 (+1.5) | 72.8 (+0.0) |
The classification accuracies of combined features
| Feature configurations | Percentage |
|---|---|
| a (200) | 68.1 |
| b (200) | 69.2 |
| c (200) | 68.3 |
| d (200) | 68.9 |
| 742 | 66.3 |
| |
| |
| |
| |
| |
| Generate data set E from D by bootstrap sampling. |
| Partition E into |
| |
| Perform k-fold cross-validation with configuration |
| Calculate average base-pair accuracy |
| |
| |
Paired t-test of base-pair accuracies for incremental versus non-incremental ones for BKS or WMJ fusion
| Configurations | BKS | WMJ |
|---|---|---|
| ImRMR vs. mRMR | 0.11 | 0.09 |
| ImRR vs. mRR | 0.08 | 0.09 |
The classification and base-pair prediction accuracies of various configurations
| Configuration | Features (#) | Classification accuracy (%) | Base-pair accuracy (%) |
|---|---|---|---|
| pknotsRG | – | – | 68.8 |
| All features | 742 | 66.3 | 72.2 (+3.4) |
| mRMR | 50 | 68.3 | 72.9 (+4.1) |
| mRR | 50 | 68.1 | 72.5 (+3.7) |
| Adaboost | 200 | 72.8 | 73.8 (+5.0) |
| ImRMR + WMJ | 50 × 4 | 70.2 | 73.0 (+4.2) |
| ImRR + WMJ | 50 × 4 | 69.6 | 73.2 (+4.4) |
| ImRMR + BKS | 50 × 4 | 74.0 | 75.5 (+6.7) |
| ImRR + BKS | 50 × 4 | 74.4 | 75.2 (+6.4) |
TukeyHSD test for base-pair accuracies
| pknotsRG | All features | mRMR | mRR | ImRMR + WMJ | ImRR + WMJ | Adaboost | ImRMR + BKS | |
|---|---|---|---|---|---|---|---|---|
| All features | 0.000 (++) | |||||||
| mRMR | 0.000 (++) | 1.000 | ||||||
| mRR | 0.000 (++) | 1.000 | 1.000 | |||||
| ImRMR + WMJ | 0.000 (++) | 0.000 (++) | 0.000 (++) | 0.000 (++) | ||||
| ImRR + WMJ | 0.000 (++) | 0.000 (++) | 0.000 (++) | 0.000 (++) | 1.000 | |||
| Adaboost | 0.000 (++) | 0.000 (++) | 0.000 (++) | 0.000 (++) | 0.976 | 0.999 | ||
| ImRMR + BKS | 0.000 (++) | 0.000 (++) | 0.000 (++) | 0.000 (++) | 0.140 | 0.353 | 0.776 | |
| imRR + BKS | 0.000 (++) | 0.000 (++) | 0.000 (++) | 0.000 (++) | 0.055 (+) | 0.175 | 0.545 | 1.000 |
The upper triangles of the M/M and L/L matrices of the sequence AUGGUGCA
| Base | A | U | G | G | U | G | C | A |
|---|---|---|---|---|---|---|---|---|
| A | 0.000 | 1.414 | 1.414 | 1.202 | 1.031 | 1.077 | 1.118 | 1.000 |
| U | 0.000 | 1.414 | 1.118 | 1.000 | 1.031 | 1.077 | 1.014 | |
| G | 0.000 | 1.000 | 1.118 | 1.000 | 1.031 | 1.077 | ||
| G | 0.000 | 1.414 | 1.000 | 1.054 | 1.118 | |||
| U | 0.000 | 1.414 | 1.414 | 1.054 | ||||
| G | 0.000 | 1.414 | 1.414 | |||||
| C | 0.000 | 3.162 | ||||||
| A | 0.000 | |||||||
| A | 0.000 | 1.000 | 1.000 | 0.942 | 0.787 | 0.809 | 0.831 | 0.623 |
| U | 0.000 | 1.000 | 0.926 | 0.784 | 0.787 | 0.809 | 0.620 | |
| G | 0.000 | 1.000 | 0.962 | 0.784 | 0.787 | 0.641 | ||
| G | 0.000 | 1.000 | 0.707 | 0.745 | 0.604 | |||
| U | 0.000 | 1.000 | 1.000 | 0.528 | ||||
| G | 0.000 | 1.000 | 0.618 | |||||
| C | 0.000 | 1.000 | ||||||
| A | 0.000 | |||||||