| Literature DB >> 25338210 |
Yang Shu1, Ning Zhang2, Xiangyin Kong1, Tao Huang3, Yu-Dong Cai4.
Abstract
RNA editing is a post-transcriptional RNA process that provides RNA and protein complexity for regulating gene expression in eukaryotes. It is challenging to predict RNA editing by computational methods. In this study, we developed a novel method to predict RNA editing based on a random forest method. A careful feature selection procedure was performed based on the Maximum Relevance Minimum Redundancy (mRMR) and Incremental Feature Selection (IFS) algorithms. Eighteen optimal features were selected from the 77 features in our dataset and used to construct a final predictor. The accuracy and MCC (Matthews correlation coefficient) values for the training dataset were 0.866 and 0.742, respectively; for the testing dataset, the accuracy and MCC were 0.876 and 0.576, respectively. The performance was higher using 18 features than all 77, suggesting that a small feature set was sufficient to achieve accurate prediction. Analysis of the 18 features was performed and may shed light on the mechanism and dominant factors of RNA editing, providing a basis for future experimental validation.Entities:
Mesh:
Year: 2014 PMID: 25338210 PMCID: PMC4206426 DOI: 10.1371/journal.pone.0110607
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1The IFS curves in the training dataset.
The plot shows the MCC values of the predictors constructed using different numbers of top features selected from the corresponding mRMR table during the IFS process. When the first 18 features were selected, the MCC reached its maximum value of 0.7415.
The prediction performance of the final model using 18 features, by 10-fold cross validation.
| Dataset | Features |
|
|
|
|
| Training | 18 | 0.945 | 0.787 | 0.866 | 0.742 |
| Testing | 18 | 0.897 | 0.756 | 0.876 | 0.576 |
The 18 optimal features selected in this study and their descriptions.
| Rank | Name | Description |
| 1 | wt_treads | The total number of reads from the ‘total’ alignments (sense+antisense) (containing G, A, T, C and gaps) in wild-type RNA |
| 2 | wt_3AGagap | The ratio of the number of gaps in ‘3AG’ antisense alignments to the total number of reads in ‘3AG’ antisense alignments (containing G, A, T, C and gaps) in wild-type RNA |
| 3 | ad_naC | The ratio of the number of C reads in ‘normal’ antisense alignments to the total number of reads in ‘normal’ antisense alignments (containing G, A, T, C and gaps) in ADAR- RNA |
| 4 | G3let_tot_rat | The number of G reads from ‘3AG’ and ‘3TC’ alignments (both sense and antisense) divided by the number of G reads from the ‘total’ (sense+antisense) alignments in wild-type RNA |
| 5 | wt_3AGsT | The ratio of the number of T reads in ‘3AG’ sense alignments to the total number of reads in ‘3AG’ sense alignments (containing G, A, T, C and gaps) in wild-type RNA |
| 6 | wt_3TCsT | The ratio of the number of T reads in ‘3TC’ sense alignments to the total number of reads in ‘3TC’ sense alignments (containing G, A, T, C and gaps) in wild-type RNA |
| 7 | ad_3AGsT | The ratio of the number of T reads in ‘3AG’ sense alignments to the total number of reads in ‘3AG’ sense alignments (containing G, A, T, C and gaps) in ADAR- RNA |
| 8 | ad_nsT | The ratio of the number of T reads in ‘normal’ sense alignments to the total number of reads in ‘normal’ sense alignments (containing G, A, T, C and gaps) in ADAR- RNA |
| 9 | A3let_nG_rat | The number of A reads from ‘3AG’ and ‘3TC’ alignments divided by the number of G reads from normal alignments in wild-type RNA |
| 10 | wtnsas_ratG | The ratio wt_nsG/(wt_naG+0.001)) |
| 11 | ad_3TCsgap | The ratio of the number of gaps in ‘3TC’ sense alignments to the total number of reads in ‘3TC’ sense alignments (containing G, A, T, C and gaps) in ADAR- RNA |
| 12 | repeat | 1 if the site falls within a region designated as a repeat, 0 if it does not |
| 13 | wt_3TCaT | The ratio of the number of T reads in ‘3TC’ antisense alignments to the total number of reads in ‘3TC’ antisense alignments (containing G, A, T, C and gaps) in wild-type RNA |
| 14 | wt2adG | The ratio of the number of G reads in the ‘total’ (sense+antisense) alignments in wild-type RNA to the number of G reads in the ‘total’ (sense+antisense) alignments in ADAR = RNA |
| 15 | ad_naG | The ratio of the number of G reads in ‘normal’ antisense alignments to the total number of reads in ‘normal’ antisense alignments (containing G, A, T, C and gaps) in ADAR- RNA |
| 16 | wt3tA | The ratio of (wt_3AGsA+wt_3TCsA+ wt_3AGaA+wt_3TCaA)/(the number of A reads from the ‘total’ (sense+antisense) alignments) in wild-type RNA |
| 17 | wt_3TCsgap | The ratio of the number of gaps in ‘3TC’ sense alignments to the total number of reads in ‘3TC’ sense alignments (containing G, A, T, C and gaps) in wild-type RNA |
| 18 | wt_t_AGRL | The average length of G reads for the ‘total’ (sense+antisense) alignments in wild-type RNA |