| Literature DB >> 28624202 |
Bin Liu1, Fan Yang2, Kuo-Chen Chou3.
Abstract
Involved with important cellular or gene functions and implicated with many kinds of cancers, piRNAs, or piwi-interacting RNAs, are of small non-coding RNA with around 19-33 nt in length. Given a small non-coding RNA molecule, can we predict whether it is of piRNA according to its sequence information alone? Furthermore, there are two types of piRNA: one has the function of instructing target mRNA deadenylation, and the other does not. Can we discriminate one from the other? With the avalanche of RNA sequences emerging in the postgenomic age, it is urgent to address the two problems for both basic research and drug development. Unfortunately, to the best of our knowledge, so far no computational methods whatsoever could be used to deal with the second problem, let alone deal with the two problems together. Here, by incorporating the physicochemical properties of nucleotides into the pseudo K-tuple nucleotide composition (PseKNC), we proposed a powerful predictor called 2L-piRNA. It is a two-layer ensemble classifier, in which the first layer is for identifying whether a query RNA molecule is piRNA or non-piRNA, and the second layer for identifying whether a piRNA is with or without the function of instructing target mRNA deadenylation. Rigorous cross-validations have indicated that the success rates achieved by the proposed predictor are quite high. For the convenience of most biologists and drug development scientists, the web server for 2L-piRNA has been established at http://bioinformatics.hitsz.edu.cn/2L-piRNA/, by which users can easily get their desired results without the need to go through the mathematical details.Entities:
Keywords: PseKNC; cancers; mRNA deadenylation; non-coding RNA; physicochemical properties; piRNA; web server
Year: 2017 PMID: 28624202 PMCID: PMC5415553 DOI: 10.1016/j.omtn.2017.04.008
Source DB: PubMed Journal: Mol Ther Nucleic Acids
A Comparison of the Proposed Predictor with the Existing State-of-the-Art Methods in Identifying piRNAs, First Layer, and Their Functional Types, Second Layer
| Method | Sn (%) | Sp (%) | Acc (%) | MCC |
|---|---|---|---|---|
| 2L-piRNA | 88.3 | 83.9 | 86.1 | 0.723 |
| Accurate piRNA prediction | 83.1 | 82.1 | 82.6 | 0.651 |
| GA-WE | 90.6 | 78.3 | 84.4 | 0.694 |
| 2L-piRNA | 79.1 | 76.0 | 77.6 | 0.552 |
| Accurate piRNA prediction | N/A | N/A | N/A | N/A |
| GA-WE | N/A | N/A | N/A | N/A |
All of the data listed were obtained by the 5-fold cross-validation on the same benchmark dataset (Supplemental Information). N/A means “not available,” namely, the corresponding method fails to yield any result for the second-layer prediction.
See Equation 15 for the metrics’ definition.
The new method presented in this paper.
The existing state-of-the-art method proposed by Luo et al.
The existing state-of-the-art method proposed by Li et al.
Figure 1The Performances of the First- and Second-Layer Ensemble Sub-predictors in Comparison with Their Respective Individual Four Basic Predictors
(A and B) A graphical illustration to show the performances of (A) the first-layer ensemble sub-predictor and (B) the second ensemble sub-predictor predictor in comparison with their respective individual four basic predictors (cf. Equation 14). The performances are illustrated by means of the ROC curves.29, 30 The greater the area under the ROC curve (AUC) value is, the better the performance will be.
Figure 2Length Distribution of the Sequences in the Benchmark Dataset
The Original Values of Rise, Roll, Shift, Slide, Tilt, and Twist for the 16 Dinucleotides in RNA51, 142
| Dimer | Physicochemical Property | |||||
|---|---|---|---|---|---|---|
| Rise | Roll | Shift | Slide | Tilt | Twist | |
| AA | 3.18 | 7.0 | −0.08 | −1.27 | −0.8 | 31 |
| AC | 3.24 | 4.8 | 0.23 | −1.43 | 0.8 | 32 |
| AG | 3.30 | 8.5 | −0.04 | −1.50 | 0.5 | 30 |
| AU | 3.24 | 7.1 | −0.06 | −1.36 | 1.1 | 33 |
| CA | 3.09 | 9.9 | 0.11 | −1.46 | 1 | 31 |
| CC | 3.32 | 8.7 | −0.01 | −1.78 | 0.3 | 32 |
| CG | 3.30 | 12.1 | 0.30 | −1.89 | −0.1 | 27 |
| CU | 3.30 | 8.5 | −0.04 | −1.50 | 0.5 | 30 |
| GA | 3.38 | 9.4 | 0.07 | −1.70 | 1.3 | 32 |
| GC | 3.22 | 6.1 | 0.07 | −1.39 | 0.0 | 35 |
| GG | 3.32 | 12.1 | −0.01 | −1.78 | 0.3 | 32 |
| GU | 3.24 | 4.8 | 0.23 | −1.43 | 0.8 | 32 |
| UA | 3.26 | 10.7 | −0.02 | −1.45 | −0.2 | 32 |
| UC | 3.38 | 9.4 | 0.07 | −1.70 | 1.3 | 32 |
| UG | 3.09 | 9.9 | 0.11 | −1.46 | 1.0 | 31 |
| UU | 3.18 | 7.0 | −0.08 | −1.27 | −0.8 | 31 |
The Normalized Values Obtained from Table 2 via the Standard Conversion of Equation 8
| Dimer | Physicochemical Property | |||||
|---|---|---|---|---|---|---|
| Rise | Roll | Shift | Slide | Tilt | Twist | |
| AA | −0.862 | −0.689 | −1.163 | 1.386 | −1.896 | −0.270 |
| AC | −0.149 | −1.698 | 1.545 | 0.510 | 0.555 | 0.347 |
| AG | 0.565 | 0.000 | −0.813 | 0.127 | 0.096 | −0.888 |
| AU | −0.149 | −0.643 | −0.988 | 0.894 | 1.015 | 0.965 |
| CA | −1.931 | 0.643 | 0.497 | 0.346 | 0.862 | −0.270 |
| CC | 0.802 | 0.092 | −0.551 | −1.407 | −0.211 | 0.347 |
| CG | 0.565 | 1.652 | 2.156 | −2.009 | −0.823 | −2.741 |
| CU | 0.565 | 0.000 | −0.813 | 0.127 | 0.096 | −0.888 |
| GA | 1.515 | 0.413 | 0.147 | −0.969 | 1.321 | 0.347 |
| GC | −0.386 | −1.102 | 0.147 | 0.729 | −0.670 | 2.201 |
| GG | 0.802 | 1.652 | −0.551 | −1.407 | −0.211 | 0.347 |
| GU | −0.149 | −1.698 | 1.545 | 0.510 | 0.555 | 0.347 |
| UA | 0.089 | 1.010 | −0.639 | 0.401 | −0.977 | 0.347 |
| UC | 1.515 | 0.413 | 0.147 | −0.969 | 1.321 | 0.347 |
| UG | −1.931 | 0.643 | 0.497 | 0.346 | 0.862 | −0.270 |
| UU | −0.862 | −0.689 | −1.163 | 1.386 | −1.896 | −0.270 |
Figure 3A Flowchart to Show How the 2L-piRNA Predictor Is Working
The input query sequences are first identified by the first-layer sub-predictor as of piRNA or non-piRNA. Subsequently, the predicted or asserted piRNAs are further identified by the second-layer sub-predictor because they have the function to instruct target mRNA deadenylation or not. Dataset 1 and dataset 2 refer to S and S+ of Equation 1, respectively.
Figure 4A Flowchart to Show the Process of How to Select the Four Representative Classifiers in Equation 13 from the 150 Individual Basic Classifier in Equation 10 for the First and Seconds Layers, Respectively
List of the Four Individual Representative Base Classifiers Selected by Using the Affinity Propagation Clustering Algorithm for Each of the Two Layers Concerned
| Base Classifier | Feature | Dimension | Voting Weighted Factor V | Acc (%) |
|---|---|---|---|---|
| PseKNC | 17 | 0.200 | 84.1 | |
| PseKNC | 21 | 0.100 | 84.0 | |
| PseKNC | 69 | 0.300 | 84.6 | |
| PseKNC | 257 | 0.400 | 82.1 | |
| PseKNC | 17 | 0.100 | 73.8 | |
| PseKNC | 69 | 0.800 | 77.0 | |
| PseKNC | 4,101 | 0.000 | 71.4 | |
| PseKNC | 4,113 | 0.100 | 70.1 | |
The optimal parameters were K = 2, λ = 1, w = 0.1, C = 27, γ = 2.
The optimal parameters were K = 2, λ = 5, w = 0.3, C = 215, γ = 2−1.
The optimal parameters were K = 3, λ = 5, w = 0.1, C = 213, γ = 2−1.
The optimal parameters were K = 4, λ = 1, w = 0.3, C = 213, γ = 2−1.
The optimal parameters were K = 2, λ = 1, w = 0.9, C = 213, γ = 2.
The optimal parameters were K = 3, λ = 5, w = 0.1, C = 29, γ = 2.
The optimal parameters were K = 6, λ = 5, w = 0.7, C = 27, γ = 23.
The optimal parameters were K = 6, λ = 17, w = 0.9, C = 211, γ = 23.
Figure 5A Semi-screen Shot to Show the Top Page of the Web Server 2L-piRNA
Its website address is http://bioinformatics.hitsz.edu.cn/2L-piRNA/.