| Literature DB >> 28317874 |
Fei He1,2,3, Ye Han4,5, Jianting Gong1,3, Jiazhi Song1,3, Han Wang1,3, Yanwen Li1,3.
Abstract
Small interfering RNAs (siRNAs) may induce to targeted gene knockdown, and the gene silencing effectiveness relies on the efficacy of the siRNA. Therefore, the task of this paper is to construct an effective siRNA prediction method. In our work, we try to describe siRNA from both quantitative and qualitative aspects. For quantitative analyses, we form four groups of effective features, including nucleotide frequencies, thermodynamic stability profile, thermodynamic of siRNA-mRNA interaction, and mRNA related features, as a new mixed representation, in which thermodynamic of siRNA-mRNA interaction is introduced to siRNA efficacy prediction for the first time to our best knowledge. And then an F-score based feature selection is employed to investigate the contribution of each feature and remove the weak relevant features. Meanwhile, we encode the siRNA sequence and existed empirical design rules as a qualitative siRNA representation. These two kinds of siRNA representations are combined to predict siRNA efficacy by supported Vector Regression (SVR) at score level. The experimental results indicate that our method may select the features with powerful discriminative ability and make the two kinds of siRNA representations work at full capacity. The prediction results also demonstrate that our method can outperform other popular siRNA efficacy prediction algorithms.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28317874 PMCID: PMC5357899 DOI: 10.1038/srep44836
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
The brief introduction of F .
| Group | Feature | Dimension |
|---|---|---|
| Nucleotide frequencies | Single-nucleotide frequencies | 4 |
| Dinucleotide frequencies | 16 | |
| Trinucleotide frequencies | 64 | |
| Thermodynamic stability profile | Watson-Crick pair free energy | 18 |
| The sum of all the siRNA local duplex | 1 | |
| The difference of duplex formation at the 5′ and 3′ end of siRNA for 5 terminal nucleotides. | 1 | |
| Thermodynamic of siRNA-mRNA interaction | the energy necessary to make a potential binding region accessible | 2 |
| the energy gained from siRNA-mRNA interaction | 1 | |
| mRNA related features | Single-nucleotide frequencies in mRNA | 4 |
| Dinucleotide frequencies in mRNA | 16 | |
| Trinucleotide frequencies in mRNA | 64 | |
| Single-nucleotide frequencies in near siRNA binding site region of mRNA | 4 | |
| Dinucleotide frequencies in near siRNA binding site region of mRNA | 16 | |
| Trinucleotide frequencies in near siRNA binding site region of mRNA | 64 |
Figure 1The PCCs between parts of features and siRNA inhibitions on Huesken’s dataset.
Figure 2The distributions between active siRNA and inactive siRNA of (a) ΔG (b) ΔG (c) ΔG.
The brief introduction of F .
| group | Encoding rule | Dimension of features |
|---|---|---|
| Sequence codes | Map nucleotides at each sequence position to four dimensions in vector space | 84 |
| Rule codes | Encode nucleotides at each sequence position with rule sets | 19 |
The encoding for nucleotide at each position in light of empirical rules.
| Position | Nucleotide | Encoding | Rule providers |
|---|---|---|---|
| 1 | A | −1 | Ui-Tei, Amarzguioui, Takasaki, Svetlana, Matveeva |
| C | +1 | Ui-Tei, Amarzguioui, Jagla 1, Jagla 2, Jagla 3, Matveeva | |
| G | +1 | Ui-Tei, Amarzguioui, Takasaki, Svetlana, Jagla 1, Jagla 2, Jagla 3, Matveeva, Jiang | |
| U | −1 | Ui-Tei, Amarzguioui, Takasaki, Svetlana, Matveeva, Jiang | |
| 2 | A | −1 | Amarzguioui |
| C | 0 | ||
| G | +1 | Svetlana, Jiang | |
| U | −1 | Amarzguioui, Matveeva | |
| 3 | A | +1 | Reynolds |
| C | −1 | Matveeva | |
| G | +1 | Svetlana, Jiang | |
| U | −1 | Amarzguioui, Svetlana, Jiang | |
| 4 | A | 0 | |
| C | −1 | Svetlana | |
| G | 0 | ||
| U | +1 | Matveeva | |
| 5 | A | +1 | Jagla 4 |
| C | 0 | ||
| G | 0 | ||
| U | +1 | Jagla 4 | |
| 6 | A | +1 | Amarzguioui, Takasaki, Svetlana, Jagla 4, Matveeva, Jiang |
| C | −1 | Hsieh, Takasaki, Svetlana, Matveeva, Jiang | |
| G | −1 | Svetlana, Svetlana | |
| U | +1 | Svetlana, Jagla 4, Matveeva, Jiang | |
| 7 | A | +1 | Svetlana, Matveeva, Jiang |
| C | −1 | Svetlana, Matveeva, Jiang | |
| G | +1 | Takasaki | |
| U | −1 | Takasaki | |
| 8 | A | +1 | Takasaki |
| C | 0 | ||
| G | −1 | Takasaki | |
| U | 0 | ||
| 9 | A | 0 | |
| C | 0 | ||
| G | −1 | Takasaki, Matveeva | |
| U | −1 | Jagla 1, Jiang | |
| 10 | A | +1 | Jagla 1 |
| C | +1 | Jagla 2 | |
| G | +1 | Jagla 2 | |
| U | +1 | Reynolds, Svetlana, Jagla 1, Matveeva, Jiang | |
| 11 | A | 0 | |
| C | +1 | Hsieh, Jagla 3 | |
| G | +1 | Hsieh, Jagla 3 | |
| U | 0 | ||
| 12 | A | +1 | Matveeva |
| C | 0 | ||
| G | −1 | Matveeva | |
| U | 0 | ||
| 13 | A | +1 | Svetlana, Matveeva, Jiang |
| C | −1 | Svetlana, Jiang | |
| G | −1 | Reynolds, Svetlana, Jiang | |
| U | +1 | Svetlana, Matveeva, Jiang | |
| 14 | A | 0 | |
| C | −1 | Svetlana, Jiang | |
| G | 0 | ||
| U | 0 | ||
| 15 | A | +1 | Svetlana, Jiang |
| C | −1 | Matveeva | |
| G | 0 | ||
| U | −1 | Svetlana, Jiang | |
| 16 | A | 0 | |
| C | 0 | ||
| G | +1 | Hsieh | |
| U | +1 | Matveeva | |
| 17 | A | +1 | Amarzguioui, Svetlana, Matveeva, Jiang |
| C | 0 | ||
| G | −1 | Matveeva | |
| U | +1 | Amarzguioui | |
| 18 | A | +1 | Amarzguioui, Svetlana, Matveeva, Jiang |
| C | −1 | Svetlana, Matveeva, Jiang | |
| G | −1 | Matveeva | |
| U | +1 | Svetlana | |
| 19 | A | +1 | Ui-Tei, Amarzguioui, Svetlana, Jagla 1, Jagla 2, Jagla 4, Matveeva, Jiang |
| C | −1 | Reynolds, Ui-Tei, Matveeva, Jiang | |
| G | −1 | Reynolds, Ui-Tei, Amarzguioui, Hsieh, Takasaki, Svetlana, Matveeva, Jiang | |
| U | +1 | Ui-Tei, Amarzguioui, Hsieh, Svetlana, Jagla 1, Jagla 2, Jagla 4, Matveeva, Jiang |
+1: Preference for high siRNA efficacy. −1: Preference for low siRNA efficacy. 0: No rule followed.
The processes of binary search for the optimal subset features .
| Iteration | Number of features | Pearson Correlation Coefficient |
|---|---|---|
| 1 | 275 | 0.670 |
| 2 | 275/2 = 137 | 0.682 |
| 4 | 68/2 = 34 | 0.684 |
| 5 | 34 + (68–34)/2 = 51 | 0.688 |
| 6 | 51 + (68–51)/2 = 59 | 0.687 |
| 7 | 59 + (68–59)/2 = 63 | 0.685 |
| 8 | 63 + (68–63)/2 = 65 | 0.684 |
| 9 | 65 + (68–65)/2 = 66 | 0.687 |
Figure 3The comparisons between two linear-SVR models using (a) F and (b) .
Figure 4The 68 dimensional selective features by F-scores.
The PCCs produced by the SVR models with different kernels and different inputs on Hencken_test dataset.
| Input | PCC | |||
|---|---|---|---|---|
| Linear | polynomial | RBF | sigmoid | |
| 0.613 | 0.401 | 0.017 | ||
| 0.430 | 0.589 | 0.366 | ||
| 0.667 | 0.697 | 0.007 | ||
| 0.577 | 0.454 | 0.002 | ||
Figure 5The predicted results from the models for (a) (b) F (c) F and (d) our proposed fusion method.
Figure 6The ROC curves of the five algorithms.
The details of performance of the five algorithms.
| Method | PCC | AUC | Sensitivity | Specificity |
|---|---|---|---|---|
| Biopredsi | 0.660 | 0.867 | 45.2% | 90.7% |
| 17% | 96.9% | |||
| 9.6% | 99.0% | |||
| i-score | 0.654 | 0.863 | 48.1% | 90.7% |
| 24.4% | 96.9% | |||
| 8.9% | 99.0% | |||
| ThermoComposition-21 | 0.659 | 0.858 | 50.4% | 90.7% |
| 28.9% | 96.9% | |||
| 16.5% | 99.0% | |||
| DSIR | 0.670 | 0.874 | 58.5% | 90.7% |
| 25.9% | 96.9% | |||
| 14.8% | 99.0% | |||
Figure 7The comparisons of five algorithms testing on the three independent datasets of Vickers, Reynolds and Harborth.