| Literature DB >> 22110674 |
Justin Bo-Kai Hsu1, Neil Arvin Bretaña, Tzong-Yi Lee, Hsien-Da Huang.
Abstract
Regulation of pre-mRNA splicing is achieved through the interaction of RNA sequence elements and a variety of RNA-splicing related proteins (splicing factors). The splicing machinery in humans is not yet fully elucidated, partly because splicing factors in humans have not been exhaustively identified. Furthermore, experimental methods for splicing factor identification are time-consuming and lab-intensive. Although many computational methods have been proposed for the identification of RNA-binding proteins, there exists no development that focuses on the identification of RNA-splicing related proteins so far. Therefore, we are motivated to design a method that focuses on the identification of human splicing factors using experimentally verified splicing factors. The investigation of amino acid composition reveals that there are remarkable differences between splicing factors and non-splicing proteins. A support vector machine (SVM) is utilized to construct a predictive model, and the five-fold cross-validation evaluation indicates that the SVM model trained with amino acid composition could provide a promising accuracy (80.22%). Another basic feature, amino acid dipeptide composition, is also examined to yield a similar predictive performance to amino acid composition. In addition, this work presents that the incorporation of evolutionary information and domain information could improve the predictive performance. The constructed models have been demonstrated to effectively classify (73.65% accuracy) an independent data set of human splicing factors. The result of independent testing indicates that in silico identification could be a feasible means of conducting preliminary analyses of splicing factors and significantly reducing the number of potential targets that require further in vivo or in vitro confirmation.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22110674 PMCID: PMC3217973 DOI: 10.1371/journal.pone.0027567
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1System Flow.
Data abundancy after using CD-HIT.
| Sequence identity | Positive data of training set | Positive data of independent test set | Negative data |
| 100% (original) | 283 | 99 | 19512 |
| 90% | 274 | 94 | 18897 |
| 80% | 268 | 94 | 18447 |
| 70% | 256 | 94 | 17727 |
| 60% | 242 | 88 | 16710 |
| 50% | 226 | 82 | 15255 |
| 40% | 202 | 80 | 13333 |
| 30% | 173 | 65 | 11113 |
Figure 2The detailed process of generating 400-dimensional PSSM vector by the PSSM profile.
Figure 3Percent composition of 20 amino acids between positive data (splicing factors) and negative data (non-splicing proteins).
Five-fold cross validation performance of basic features.
| Features | Sensitivity | Specificity | Accuracy |
| Amino acid composition | 76.90% | 80.33% | 80.22% |
| Dipeptide composition | 78.62% | 78.53% | 78.53% |
| Statistically significant dipeptides | 76.31% | 79.07% | 78.98% |
| PSSM | 79.81% | 79.48% | 79.49% |
| Functional domain | 38.75% | 93.82% | 92.16% |
Figure 4Probability difference of 20×20 amino acid pairs between splicing factors and non-splicing proteins.
The amino acid pair with red box indicates an over-representation in splicing factors; on the other hand, green box means an under-representation.
Statistics of InterPro functional annotations in 173 splicing factors.
| InterPro ID | Description | Number of splicing factors |
| IPR012677 | Nucleotide-bd a/b plait | 46 |
| IPR000504 | RRM domain | 45 |
| IPR010920 | LSM-related domain | 11 |
| IPR001163 | LSM domain | 11 |
| IPR006649 | LSM domain euk/arc | 10 |
| IPR015943 | WD40/YVTN repeat-like domain | 10 |
| IPR001680 | WD40 repeat | 9 |
| IPR011046 | WD40 repeat-like domain | 9 |
| IPR019782 | WD40 repeat 2 | 9 |
| IPR017986 | WD40 repeat domain | 9 |
| IPR019781 | WD40 repeat sg | 9 |
| IPR015880 | Znf C2H2-like | 7 |
| IPR019775 | WD40 repeat CS | 7 |
| IPR020472 | G-protein beta WD-40 repeat | 6 |
| IPR013083 | Xnf RING/FYVE/PHD | 6 |
InterPro classifies sequences at superfamily, family and subfamily levels and annotates the occurrence of functional domains, repeats and important sites. The annotations which occur in more than five splicing factors are presented with the information of InterPro ID, description, and number of splicing factors.
Five-fold cross-validation performance of hybrid features.
| Hybrid features | Sensitivity | Specificity | Accuracy |
|
| |||
| Amino acid composition | 77.47% | 82.94% | 81.77% |
| Dipeptide composition | 79.81% | 78.46% | 79.47% |
| Statistically significant dipeptides | 79.81% | 78.46% | 79.47% |
|
| |||
| Amino acid composition | 75.15% | 82.25% | 82.04% |
| Dipeptide composition | 75.76% | 77.34% | 77.29% |
| Statistically significant dipeptides | 75.12% | 76.96% | 76.91% |
|
| |||
| Amino acid composition | 82.68% | 81.77% | 81.79% |
| Dipeptide composition | 82.68% | 81.77% | 81.79% |
| Statistically significant dipeptides | 82.68% | 81.78% | 81.81% |
Predictive performance of basic features on an independent testing data.
| Features | Sensitivity | Specificity | Accuracy |
| Amino acid composition | 68.07% | 68.17% | 68.17% |
| Dipeptide composition | 69.61% | 69.64% | 69.63% |
| Statistically significant dipeptides | 69.61% | 70.46% | 70.45% |
| PSSM | 72.69% | 72.20% | 72.21% |
| Functional domain | 21.53% | 93.63% | 92.79% |
Predictive performance of hybrid features on an independent testing data.
| Hybrid features | Sensitivity | Specificity | Accuracy |
|
| |||
| Amino acid composition | 72.69% | 72.16% | 72.17% |
| Dipeptide composition | 72.69% | 72.20% | 72.21% |
| Statistically significant dipeptides | 72.69% | 72.20% | 72.21% |
|
| |||
| Amino acid composition | 68.07% | 68.10% | 68.10% |
| Dipeptide composition | 68.07% | 70.53% | 70.50% |
| Statistically significant dipeptides | 66.53% | 66.57% | 66.57% |
|
| |||
| Amino acid composition | 74.23% | 73.10% | 73.61% |
| Dipeptide composition | 74.23% | 73.64% | 73.65% |
| Statistically significant dipeptides | 74.23% | 73.64% | 73.65% |