Literature DB >> 28624202

2L-piRNA: A Two-Layer Ensemble Classifier for Identifying Piwi-Interacting RNAs and Their Function.

Abstract

Involved with important cellular or gene functions and implicated with many kinds of cancers, piRNAs, or piwi-interacting RNAs, are of small non-coding RNA with around 19-33 nt in length. Given a small non-coding RNA molecule, can we predict whether it is of piRNA according to its sequence information alone? Furthermore, there are two types of piRNA: one has the function of instructing target mRNA deadenylation, and the other does not. Can we discriminate one from the other? With the avalanche of RNA sequences emerging in the postgenomic age, it is urgent to address the two problems for both basic research and drug development. Unfortunately, to the best of our knowledge, so far no computational methods whatsoever could be used to deal with the second problem, let alone deal with the two problems together. Here, by incorporating the physicochemical properties of nucleotides into the pseudo K-tuple nucleotide composition (PseKNC), we proposed a powerful predictor called 2L-piRNA. It is a two-layer ensemble classifier, in which the first layer is for identifying whether a query RNA molecule is piRNA or non-piRNA, and the second layer for identifying whether a piRNA is with or without the function of instructing target mRNA deadenylation. Rigorous cross-validations have indicated that the success rates achieved by the proposed predictor are quite high. For the convenience of most biologists and drug development scientists, the web server for 2L-piRNA has been established at http://bioinformatics.hitsz.edu.cn/2L-piRNA/, by which users can easily get their desired results without the need to go through the mathematical details.

Entities: Chemical Disease Gene Species

Keywords: PseKNC; cancers; mRNA deadenylation; non-coding RNA; physicochemical properties; piRNA; web server

Year: 2017 PMID： 28624202 PMCID： PMC5415553 DOI： 10.1016/j.omtn.2017.04.008

Source DB: PubMed Journal: Mol Ther Nucleic Acids

Introduction

With a length of around 19–33 nt, piRNAs (piwi-interacting RNAs) distinctly belong to the largest class of small non-coding RNA molecules in animal cells.1, 2, 3, 4 They are involved with many cellular or gene functions including the transposon silencing, specific protein translation, gene expression regulation, and the formation and maintenance of germ cells.5, 6, 7 Moreover, many studies (see, e.g., Mei et al., Cheng et al., Moyano and Stefani, and Hashim et al.) have shown that piRNAs have been implicated with many kinds of cancers. Therefore, knowledge about piRNAs and their functions is very important for drug development, as well as for RNA biology and many other relevant areas. Given an RNA molecule, can we identify whether it belongs to piRNA? Lee et al. and Nishibu et al. had developed some experimental methods to address this problem, greatly stimulating the development of this area. But purely using experimental methods alone to do the sequence analyses is not only inefficient and expensive, but also insensitive for many cases (e.g., it is difficult to get sufficient quantity of samples for observation). Facing the explosive growth of RNA sequences in the postgenomic age, to make the piRNA analysis in a more efficient way, as well as in a faster pace and at a deeper level, we could not help but resort to the computational approach. Actually, several computational methods have been proposed for classifying piRNA from non-piRNA sequences. For instance, by combining the k-mer scheme and support vector machine (SVM), Zhang et al. proposed a model called piRNApredictor. Three years later, Wang et al. proposed a different model for predicting piRNAs by using the transposon interaction and SVM. Recently, two more papers were published for identifying piRNAs. One was authored by Luo et al., who considered the physicochemical properties of RNA sequences, and the other was authored by Li et al., who used the powerful ensemble approach. Both methods were quite powerful, reaching the state-of-the-art performance. It is instructive to point out, however, that there are two types of piRNA in the real world. One has the function of instructing target mRNA deadenylation and the other does not. But none of the aforementioned methods has the function to distinguish these two types. The present study was initiated in an attempt to fill in this empty area by developing a new predictor that not only can be used to identify piRNAs, but also can be used to identify their functional types.

Results and Discussion

Listed in Table 1 are the success rates measured by the four metrics of Equation 15 that have been achieved by the proposed two-layer predictor 2L-piRNA on the benchmark datasets and + of Equation 1, respectively. For facilitating comparison, listed in the table are also the corresponding results obtained by the powerful existing state-of-the-art methods16, 17 published very recently. From Table 1, we can clearly see the following: (1) for the first-layer prediction, the new predictor 2L-piRNA is superior to the existing state-of-the-art methods in both accuracy (Acc) and Matthews correlation coefficient (MCC), the two most important metrics; the former reflects the overall accuracy of a predictor, and the latter reflects its stability; (2) it is slightly better or comparable with the existing state-of-the-art methods in Sn (sensitivity) and Sp (specificity); and (3) for the second-layer prediction, 2L-piRNA is overwhelmingly better because the existing state-of-the-art methods simply did not have the function to yield any results at this step. Accordingly, the significance of the newly proposed predictor is self-evident.

Table 1

A Comparison of the Proposed Predictor with the Existing State-of-the-Art Methods in Identifying piRNAs, First Layer, and Their Functional Types, Second Layer

Method	Sn (%)a	Sp (%)a	Acc (%)a	MCCa
First Layer

2L-piRNAb	88.3	83.9	86.1	0.723
Accurate piRNA predictionc	83.1	82.1	82.6	0.651
GA-WEd	90.6	78.3	84.4	0.694

Second Layer

2L-piRNAb	79.1	76.0	77.6	0.552
Accurate piRNA predictionc	N/A	N/A	N/A	N/A
GA-WEd	N/A	N/A	N/A	N/A

All of the data listed were obtained by the 5-fold cross-validation on the same benchmark dataset (Supplemental Information). N/A means “not available,” namely, the corresponding method fails to yield any result for the second-layer prediction.

See Equation 15 for the metrics’ definition.

The new method presented in this paper.

The existing state-of-the-art method proposed by Luo et al.

The existing state-of-the-art method proposed by Li et al.

A Comparison of the Proposed Predictor with the Existing State-of-the-Art Methods in Identifying piRNAs, First Layer, and Their Functional Types, Second Layer All of the data listed were obtained by the 5-fold cross-validation on the same benchmark dataset (Supplemental Information). N/A means “not available,” namely, the corresponding method fails to yield any result for the second-layer prediction. See Equation 15 for the metrics’ definition. The new method presented in this paper. The existing state-of-the-art method proposed by Luo et al. The existing state-of-the-art method proposed by Li et al. To further show the advantage of the current 2L-piRNA in using the ensemble classifier approach, we adopted the graphic analysis because it is particularly useful for studying complicated biological systems, as demonstrated by a series of previous studies in many different fields (see, e.g., Jiang et al., Chou and Forsén, Zhou and Deng, Chou, Althaus et al.,23, 24 Wu et al., Chou et al., Zhou, and Zhou et al.). Shown in Figure 1 is the graph of receiver operating characteristic (ROC).29, 30 As we can see from the figure, the area under the ROC curve (AUC) for the ensemble classifier is remarkably larger than any of the individual ones in both the first-layer case (Figure 1A) and second-layer case (Figure 1B), once again demonstrating the merit of 2L-piRNA via the intuitive graphical approach.

Figure 1

The Performances of the First- and Second-Layer Ensemble Sub-predictors in Comparison with Their Respective Individual Four Basic Predictors

(A and B) A graphical illustration to show the performances of (A) the first-layer ensemble sub-predictor and (B) the second ensemble sub-predictor predictor in comparison with their respective individual four basic predictors (cf. Equation 14). The performances are illustrated by means of the ROC curves.29, 30 The greater the area under the ROC curve (AUC) value is, the better the performance will be.

The Performances of the First- and Second-Layer Ensemble Sub-predictors in Comparison with Their Respective Individual Four Basic Predictors (A and B) A graphical illustration to show the performances of (A) the first-layer ensemble sub-predictor and (B) the second ensemble sub-predictor predictor in comparison with their respective individual four basic predictors (cf. Equation 14). The performances are illustrated by means of the ROC curves.29, 30 The greater the area under the ROC curve (AUC) value is, the better the performance will be.

Conclusions

It is anticipated that the 2L-piRNA will become a very useful high-throughput tool in genome analysis and drug development, particularly in those areas involved with non-coding RNAs.

Materials and Methods

Benchmark Dataset

According to Chou's five-step rule for developing a really useful statistical predictor, the first important thing is to construct or select a reliable benchmark dataset. In literature the benchmark dataset usually consists of a training dataset and a testing dataset: the former is for the usage of training a model, whereas the latter is for testing the model. But as elucidated in a comprehensive review, there is no need to artificially separate a benchmark dataset into the aforementioned two parts if the prediction model is examined by the jackknife test or subsampling (K-fold) cross-validation because the outcome thus obtained is actually from a combination of many different independent dataset tests. Thus, the benchmark datasets for the current study can be formulated aswhere − is the negative subset that contains the non-piRNA samples only, is the symbol for union in the set theory, + is the subset that contains the piRNA samples only, is the sub-subset that contains piRNA samples having the function of instructing target mRNA deadenylation, whereas is the sub-subset without such function. The concrete procedures to construct the benchmark dataset of Equation 1 are as follows: (1) The piRNA sequences were taken from piRBase; (2) collected for are only those samples that were annotated with piRNA having the function of instructing target mRNA deadenylation; (3) collected for are only those samples that were annotated with piRNA without the function of instructing target mRNA deadenylation; (4) the corresponding non-piRNA sequences for the negative subset S− were taken from Bu et al.; (5) the CD-HIT software with the cutoff threshold 0.8 was used to remove the redundancy for each of the aforementioned subsets; and (6) to minimize the negative effect caused by the skewed benchmark dataset,35, 36, 37, 38 the random sampling method was applied to balance out each of the subsets with its counterpart. The final benchmark dataset obtained by strictly following the above procedures contains 2,836 samples, of which 709 belong to , 709 to , and 1,418 to −. Shown in Figure 2 is the sequence length distribution of the samples in the benchmark dataset; their detailed sequences and the relevant codes are given in the Supplemental Information.

Figure 2

Length Distribution of the Sequences in the Benchmark Dataset

Pseudo K-Tuple Nucleotide Composition

With a good benchmark dataset, the next thing we need to consider is how to formulate the samples therein. Actually, this is one of the most challenging problems in computational biology. This is because all the existing machine learning algorithms were designed to handle the discrete models or vectors only. But a biological sequence expressed by a vector may completely miss its sequence order or pattern, so as to limit the prediction quality. The pseudo amino acid composition (PseAAC) was proposed to deal with such a dilemma.40, 41, 42, 43, 44, 45 Ever since the concept of PseAAC was introduced, it has been rapidly and widely used in nearly all the areas of computational proteomics (see, e.g., Du et al., Lin and Lapointe, Chou, Khan et al., and Meher et al. and a long list of references cited in these papers). Inspired by the great successes of using PseAAC to represent protein-peptide sequences, the PseKNC (pseudo K-tuple nucleotide composition) was introduced to represent DNA/RNA sequences.50, 51, 52, 53, 54 Likewise, since its introduction, PseKNC has also been increasingly applied in many areas of genome analysis.37, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68 For an RNA sample with L nucleotide, its sequence expression is generally given bywheredenotes the nucleotide at the i-th sequence position, and is the a symbol in the set theory meaning “member of.” According to a recent review paper, the general form of PseKNC for R of Equation 2 can be formulated aswhere the components and Z is an integer; their values will depend on how the desired features are extracted from the RNA sample; and T is the transposing operator to a matrix or vector. In this study, we takewhere is the u-th component of the K-tuple nucleotide composition for the RNA sample sequence, andIn Equation 6, the correlation function or coupling factor is given bywhere μ is the number of physicochemical properties considered, whereas is the numerical value of the ξ-th physicochemical property for the K-mer in the RNA sequence, and so forth. In this study, we consider pseudo dinucleotide composition. Thus, we can substitute K = 2 into Equations 5, 6, and 7. Also, we used the values of the following six RNA dimer’s physicochemical properties: rise, roll, shift, slide, tilt, and twist (Table 2). Thus, we can substitute μ = 6 as well as Rise (NN), Slide (NN), Shift (NN), Twist (NN), Roll (NN), Tilt(NN), and so forth into Equation 7 to get the coupling factors.

Table 2

The Original Values of Rise, Roll, Shift, Slide, Tilt, and Twist for the 16 Dinucleotides in RNA51, 142

Dimer	Physicochemical Property
Dimer	Rise	Roll	Shift	Slide	Tilt	Twist
AA	3.18	7.0	−0.08	−1.27	−0.8	31
AC	3.24	4.8	0.23	−1.43	0.8	32
AG	3.30	8.5	−0.04	−1.50	0.5	30
AU	3.24	7.1	−0.06	−1.36	1.1	33
CA	3.09	9.9	0.11	−1.46	1	31
CC	3.32	8.7	−0.01	−1.78	0.3	32
CG	3.30	12.1	0.30	−1.89	−0.1	27
CU	3.30	8.5	−0.04	−1.50	0.5	30
GA	3.38	9.4	0.07	−1.70	1.3	32
GC	3.22	6.1	0.07	−1.39	0.0	35
GG	3.32	12.1	−0.01	−1.78	0.3	32
GU	3.24	4.8	0.23	−1.43	0.8	32
UA	3.26	10.7	−0.02	−1.45	−0.2	32
UC	3.38	9.4	0.07	−1.70	1.3	32
UG	3.09	9.9	0.11	−1.46	1.0	31
UU	3.18	7.0	−0.08	−1.27	−0.8	31

The Original Values of Rise, Roll, Shift, Slide, Tilt, and Twist for the 16 Dinucleotides in RNA51, 142 Note that before substituting them into Equation 7, all the original values in Table 2 were subjected to a standard conversion, as described by the following equations:where the symbol “< >” means taking the average of the quantity therein over 16 different dinucleotides. The converted values obtained by Equation 8 will have a zero mean value over the 16 different dinucleotides and will remain unchanged if going through the same conversion procedure again. Listed in Table 3 are the corresponding values obtained via the standard conversion of Equation 8 from those of Table 2.

Table 3

The Normalized Values Obtained from Table 2 via the Standard Conversion of Equation 8

Dimer	Physicochemical Property
Dimer	Rise	Roll	Shift	Slide	Tilt	Twist
AA	−0.862	−0.689	−1.163	1.386	−1.896	−0.270
AC	−0.149	−1.698	1.545	0.510	0.555	0.347
AG	0.565	0.000	−0.813	0.127	0.096	−0.888
AU	−0.149	−0.643	−0.988	0.894	1.015	0.965
CA	−1.931	0.643	0.497	0.346	0.862	−0.270
CC	0.802	0.092	−0.551	−1.407	−0.211	0.347
CG	0.565	1.652	2.156	−2.009	−0.823	−2.741
CU	0.565	0.000	−0.813	0.127	0.096	−0.888
GA	1.515	0.413	0.147	−0.969	1.321	0.347
GC	−0.386	−1.102	0.147	0.729	−0.670	2.201
GG	0.802	1.652	−0.551	−1.407	−0.211	0.347
GU	−0.149	−1.698	1.545	0.510	0.555	0.347
UA	0.089	1.010	−0.639	0.401	−0.977	0.347
UC	1.515	0.413	0.147	−0.969	1.321	0.347
UG	−1.931	0.643	0.497	0.346	0.862	−0.270
UU	−0.862	−0.689	−1.163	1.386	−1.896	−0.270

The Normalized Values Obtained from Table 2 via the Standard Conversion of Equation 8

Operation Engine

Below, let us consider the third step of the five-step rule, i.e., what kind of algorithms should be used to operate the training and predicting.

Support Vector Machine

Being widely used in many different areas of computational biology (see, e.g., Feng et al. Han et al., Liu et al., Qumar et al., Kiu et al., Liu et al.,75, 76 Rahimi et al., and Chen et al.), SVM is a powerful algorithm in cluster analysis. Its basic idea has been elaborated in the aforementioned papers, and hence there is no need to repeat it here. For those who are interested in knowing more about SVM, refer to the previous papers79, 80 or a monograph. In this study, we used the Scikit-learn as the implementation of the LIBSVM with the radial basis function (RBF) kernel.

Two-Layer Classification Framework

Inspired by the recent study, we constructed a two-layer classification framework as done in Chou and Shen,84, 85, 86 Wang et al., Xiao et al.,88, 89, 90 and Shen and Chou.91, 92, 93 The SVM model in the first-layer classifier was trained with S of Equation 1, serving to predict a query RNA sample as of piRNA or non-piRNA; the SVM model in the second layer was trained with + of Equation 1 to further identify whether the predicted piRNA sample is with the function of instructing target mRNA deadenylation. Shown in Figure 3 is a flowchart to illustrate how the two-layer classifier is working.

Figure 3

A Flowchart to Show How the 2L-piRNA Predictor Is Working

The input query sequences are first identified by the first-layer sub-predictor as of piRNA or non-piRNA. Subsequently, the predicted or asserted piRNAs are further identified by the second-layer sub-predictor because they have the function to instruct target mRNA deadenylation or not. Dataset 1 and dataset 2 refer to S and S+ of Equation 1, respectively.

A Flowchart to Show How the 2L-piRNA Predictor Is Working The input query sequences are first identified by the first-layer sub-predictor as of piRNA or non-piRNA. Subsequently, the predicted or asserted piRNAs are further identified by the second-layer sub-predictor because they have the function to instruct target mRNA deadenylation or not. Dataset 1 and dataset 2 refer to S and S+ of Equation 1, respectively.

Ensemble Learning

As we can see from Equations 5 and 6, the RNA sample defined by the PseKNC approach in this study contains three uncertain parameters: K, λ, and w. In this study, the ranges considered for these parameters are In other words, K may be 1, 2, 3, 4, 5, and 6; may be 1, 5, 9, 13, 17, and 19; w may be 0.1, 0.3, 0.5, 0.7, and 0.9. Accordingly, there are a total of 5 × 6 × 5 = 150 individual classifiers for each layer. Suppose each of these individual classifiers is expressed by , their ensemble classifier can be formulated aswhere the symbol denotes the fusing operator. The ensemble predictor formed by fusing an array of individual predictors via a voting system can yield much better prediction quality, as demonstrated by a series of previous studies including signal peptide prediction,86, 92 membrane protein type classification,84, 94 protein subcellular location prediction,95, 96 protein fold pattern recognition, enzyme functional classification, protein-proteins interaction prediction, protein-protein binding site identification, and DNA recombination spot identification. Unfortunately, if all of the 150 classifiers in Equation 10 were directly used to form an ensemble predictor by the voting approach, it would be not only computationally inefficient, but also might reduce the success rate because of too much noise. One of the effective approaches is to select some key classifiers from them. To realize this, let us introduce the concept of “complementing degree” between two individual classifiers, C(i) and C(j), or their “mutually strengthening degree,” D(i,j), as defined below:where m represents the number of training samples, and In Equation 12, p(i) denotes the probability or output when applying the classifier C(i) on the t-th sample, p(j) the corresponding output for C(j), and “both fail” means that both predicted results are incorrect. By means of Equations 11 and 12, all of the 150 classifiers in each layer were clustered with the AP (affinity propagation) clustering algorithm using the default parameters. Four clusters were thus obtained for each of the two layers. Subsequently, the classifiers in the four cluster centers were selected as the representative classifiers, respectively, that have the highest complementing/strengthening degrees, as illustrated by the flowchart in Figure 4. Suppose the four representative classifiers thus selected for the first and second layers are denoted by

Figure 4

A Flowchart to Show the Process of How to Select the Four Representative Classifiers in Equation 13 from the 150 Individual Basic Classifier in Equation 10 for the First and Seconds Layers, Respectively Listed in Table 4 are the detailed values of their parameters for the first and second layers, respectively. Thus, instead of Equation 10, the final ensemble classifier should be formulated as

Table 4

List of the Four Individual Representative Base Classifiers Selected by Using the Affinity Propagation Clustering Algorithm for Each of the Two Layers Concerned

Base Classifier	Feature	Dimension	Voting Weighted Factor V_w	Acc (%)
First Layer

C∗(1st,1)	PseKNCa	17	0.200	84.1
C∗(1st,2)	PseKNCb	21	0.100	84.0
C∗(1st,3)	PseKNCc	69	0.300	84.6
C∗(1st,4)	PseKNCd	257	0.400	82.1

Second Layer

C∗(2nd,1)	PseKNCe	17	0.100	73.8
C∗(2nd,2)	PseKNCf	69	0.800	77.0
C∗(2nd,3)	PseKNCg	4,101	0.000	71.4
C∗(2nd,4)	PseKNCh	4,113	0.100	70.1

The optimal parameters were K = 2, λ = 1, w = 0.1, C = 27, γ = 2.

The optimal parameters were K = 2, λ = 5, w = 0.3, C = 215, γ = 2−1.

The optimal parameters were K = 3, λ = 5, w = 0.1, C = 213, γ = 2−1.

The optimal parameters were K = 4, λ = 1, w = 0.3, C = 213, γ = 2−1.

The optimal parameters were K = 2, λ = 1, w = 0.9, C = 213, γ = 2.

The optimal parameters were K = 3, λ = 5, w = 0.1, C = 29, γ = 2.

The optimal parameters were K = 6, λ = 5, w = 0.7, C = 27, γ = 23.

The optimal parameters were K = 6, λ = 17, w = 0.9, C = 211, γ = 23.

List of the Four Individual Representative Base Classifiers Selected by Using the Affinity Propagation Clustering Algorithm for Each of the Two Layers Concerned The optimal parameters were K = 2, λ = 1, w = 0.1, C = 27, γ = 2. The optimal parameters were K = 2, λ = 5, w = 0.3, C = 215, γ = 2−1. The optimal parameters were K = 3, λ = 5, w = 0.1, C = 213, γ = 2−1. The optimal parameters were K = 4, λ = 1, w = 0.3, C = 213, γ = 2−1. The optimal parameters were K = 2, λ = 1, w = 0.9, C = 213, γ = 2. The optimal parameters were K = 3, λ = 5, w = 0.1, C = 29, γ = 2. The optimal parameters were K = 6, λ = 5, w = 0.7, C = 27, γ = 23. The optimal parameters were K = 6, λ = 17, w = 0.9, C = 211, γ = 23. Note that different from the ensemble classifiers formed in Chou and Shen102, 103, 104, 105 and Qiu et al.,106, 107 the voting weighted factors V were included during the fusion process for each layer, and their optimal values can be easily derived by optimizing success rates during the validation process as shown in Table 4 (Voting Weighted Factor V column). The predictor developed via the above procedures is called 2L-piRNA, where 2L represents the two-layer ensemble classifier and piRNA represents the piwi-interacting RNA and its function.

Prediction Quality Measurement

How to measure the prediction quality is one of the five indispensable steps in developing a new prediction method for a biological system. It consists of two issues: What scales should be used to measure the predictor’s quality? And what test method should be adopted to score them? Below, let us address the two problems one by one.

Formulation of Measurement Scales

The following metrics were widely used in the literature to measure the prediction quality from four different aspects: (1) Acc that was used for checking the overall accuracy of a predictor, (2) MCC for its stability, (3) Sn for its sensitivity, and (4) Sp for its specificity. Unfortunately, the four metrics’ original formulations copied directly from mathematical books are difficult to understand for most biologists due to lack of intuitiveness. Fortunately, by using the scales defined by Chou in studying signal peptides, Xu et al. and Chen et al. had successfully converted them into a set of intuitive equations that are much easier for most biologists to understand, as given below:where represents the total number of the positive samples investigated, whereas is the number of the positive samples incorrectly predicted to be negative, and is the total number of the negative samples investigated, whereas is the number of the negative samples incorrectly predicted to be positive. Based on the definition of Equation 15, the meanings of Sn, Sp, Acc, and MCC have become much more intuitive and easier to understand, as discussed and used in a series of recent studies in various biological areas (see, e.g., Jia et al.,35, 36, 99, 100, 111, 112, 113 Liu et al.,37, 75, 114, 115 Xiao et al., Lin et al., Chen et al.,61, 116 Qiu et al.,106, 107, 117, 118 Xu et al.,119, 120, 121, 122 and Ding et al.). It should be pointed out, however, that for the multi-label systems (see, e.g., Xiao et al., Qiu et al., Xiao et al., Chou et al., Lin et al., and Cheng et al.), a much more sophisticated set of scales is needed as elaborated by Chou.

Cross-Validation

There are three different cross-validation methods that are widely used in literature: (1) jackknife test, (2) subsampling (or K-fold cross-validation) test, and (3) independent dataset test. Of these three, however, the jackknife is the least arbitrary that can always yield a unique outcome for a given benchmark dataset, as elaborated by Chou and widely recognized and increasingly adopted by researchers to analyze the quality of various predictors (see, e.g., Kabir and Hayat, Kumar et al., Chen et al., Ali and Hayat, Khan et al., Mondal and Pai, Dehzangi et al., Ahmad et al., Ju et al., and Behbahani et al.). In this study, however, to reduce the computational time, we adopted the 5-fold cross-validation method for each layer in 2L-piRNA, as done by many investigators with SVM as the prediction engine. For each layer, the benchmark dataset was divided into five subsets; for each run, four subsets were used as the training set, and the remaining one was used as the test set to evaluate the performance. This process was repeated five times until each subset was used as a test set once. To do this, we first randomly divided the benchmark datasets in Equation 1 into five subsets with approximately the same size. For instance, for the first benchmark dataset in Equation 1, we havewhere , , and represent the symbols for union, intersection, and empty set in the set theory,95, 137 respectively, andwithwhere denotes the number of samples (or cardinalities) in , and so forth. Then, each of the five sub-benchmark datasets was singled out one by one and tested by the model trained with the remaining four sub-benchmark datasets. The cross-validation process was repeated five times, with their average as the final outcome. In other words, during the process of 5-fold cross-validation, both the training dataset and testing dataset were actually open, and each sub-benchmark dataset was in turn moved between the two. The 5-fold cross-validation test can exclude the “memory” effect, just like conducting five different independent dataset tests.

Web Server and User Guide

In Chou's five-step rule for developing a useful predictor, the last one is to establish a user-friendly web server. This not only represents the future direction for developing any computational methods, but is also particularly important for most experimental scientists working in drug development. Accordingly, as done in a series of recent studies,63, 66, 67, 107, 112, 117, 127, 139, 140, 141 the web server for 2L-piRNA has been established as well. Moreover, to maximize users’ convenience, a step-by-step guide is provided below. Step 1. Open the web server at http://bioinformatics.hitsz.edu.cn/2L-piRNA/ and you will see its top page as shown in Figure 5. Click on the Read Me button to see a brief introduction about the server and the caveat when using it.

Figure 5

A Semi-screen Shot to Show the Top Page of the Web Server 2L-piRNA

Its website address is http://bioinformatics.hitsz.edu.cn/2L-piRNA/.

A Semi-screen Shot to Show the Top Page of the Web Server 2L-piRNA Its website address is http://bioinformatics.hitsz.edu.cn/2L-piRNA/. Step 2. You can either type or copy/paste the query RNA sequence into the input box. You can also directly upload your input data via the Browse button. The input sequence should be in the FASTA format. For the examples of sequences in the FASTA format, click the Example button right above the input box. Step 3. Click on the Submit button to see the predicted results. For example, if you use the four query RNA sequences in the Example window as the input, you will see on your computer screen that the first and second query sequences are of non-piRNA. The third one is of piRNA with the function for instructing target mRNAs deadenylation. The fourth one is of piRNA, but without that function. All these predicted results are fully consistent with the experimental observations as reported in Gou et al.

Author Contributions

B.L. conceived of the study and designed the experiments, participated in drafting the manuscript and performing the statistical analysis. F.Y. participated in coding the experiments and drafting the manuscript. K.-C.C. participated in revising the manuscript. All authors read and approved the final manuscript.

132 in total

1. pRNAm-PC: Predicting N(6)-methyladenosine sites in RNA sequences via physical-chemical properties.

Authors: Zi Liu; Xuan Xiao; Dong-Jun Yu; Jianhua Jia; Wang-Ren Qiu; Kuo-Chen Chou
Journal: Anal Biochem Date: 2015-12-31 Impact factor: 3.365

2. ProtIdent: a web server for identifying proteases and their types by fusing functional domain and sequential evolution information.

Authors: Kuo-Chen Chou; Hong-Bin Shen
Journal: Biochem Biophys Res Commun Date: 2008-09-05 Impact factor: 3.575

3. Identification of protein-protein binding sites by incorporating the physicochemical properties and stationary wavelet transforms into pseudo amino acid composition.

Authors: Jianhua Jia; Zi Liu; Xuan Xiao; Bingxiang Liu; Kuo-Chen Chou
Journal: J Biomol Struct Dyn Date: 2015-10-29

4. Prediction of β-lactamase and its class by Chou's pseudo-amino acid composition and support vector machine.

Authors: Ravindra Kumar; Abhishikha Srivastava; Bandana Kumari; Manish Kumar
Journal: J Theor Biol Date: 2014-10-22 Impact factor: 2.691

5. Analysis and comparison of lignin peroxidases between fungi and bacteria using three different modes of Chou's general pseudo amino acid composition.

Authors: Mandana Behbahani; Hassan Mohabatkar; Mokhtar Nosrati
Journal: J Theor Biol Date: 2016-09-08 Impact factor: 2.691

6. Identification of real microRNA precursors with a pseudo structure status composition approach.

Authors: Bin Liu; Longyun Fang; Fule Liu; Xiaolong Wang; Junjie Chen; Kuo-Chen Chou
Journal: PLoS One Date: 2015-03-30 Impact factor: 3.240

7. iPhos-PseEn: identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier.

Authors: Wang-Ren Qiu; Xuan Xiao; Zhao-Chun Xu; Kuo-Chen Chou
Journal: Oncotarget Date: 2016-08-09

8. iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences.

Authors: Wei Chen; Pengmian Feng; Hui Yang; Hui Ding; Hao Lin; Kuo-Chen Chou
Journal: Oncotarget Date: 2017-01-17

9. Some remarks on protein attribute prediction and pseudo amino acid composition.

Authors: Kuo-Chen Chou
Journal: J Theor Biol Date: 2010-12-17 Impact factor: 2.691

10. A genetic algorithm-based weighted ensemble method for predicting transposon-derived piRNAs.

Authors: Dingfang Li; Longqiang Luo; Wen Zhang; Feng Liu; Fei Luo
Journal: BMC Bioinformatics Date: 2016-08-31 Impact factor: 3.169

48 in total

Review 1. Computational Methods and Online Resources for Identification of piRNA-Related Molecules.

Authors: Yajun Liu; Aimin Li; Guo Xie; Guangming Liu; Xinhong Hei
Journal: Interdiscip Sci Date: 2021-04-22 Impact factor: 2.233

2. Predicting membrane proteins and their types by extracting various sequence features into Chou's general PseAAC.

Authors: Ahmad Hassan Butt; Nouman Rasool; Yaser Daanial Khan
Journal: Mol Biol Rep Date: 2018-09-20 Impact factor: 2.316

Review 3. Structural Variability in the RLR-MAVS Pathway and Sensitive Detection of Viral RNAs.

Authors: Qiu-Xing Jiang
Journal: Med Chem Date: 2019 Impact factor: 2.745

4. Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction: a comprehensive revisit and benchmarking of existing methods.

Authors: Fuyi Li; Yanan Wang; Chen Li; Tatiana T Marquez-Lago; André Leier; Neil D Rawlings; Gholamreza Haffari; Jerico Revote; Tatsuya Akutsu; Kuo-Chen Chou; Anthony W Purcell; Robert N Pike; Geoffrey I Webb; A Ian Smith; Trevor Lithgow; Roger J Daly; James C Whisstock; Jiangning Song
Journal: Brief Bioinform Date: 2019-11-27 Impact factor: 11.622

5. Mal-Light: Enhancing Lysine Malonylation Sites Prediction Problem Using Evolutionary-based Features.

Authors: Wakil Ahmad; Easin Arafat; Ghazaleh Taherzadeh; Alok Sharma; Shubhashis Roy Dipta; Abdollah Dehzangi; Swakkhar Shatabda
Journal: IEEE Access Date: 2020-04-22 Impact factor: 3.367

6. Quokka: a comprehensive tool for rapid and accurate prediction of kinase family-specific phosphorylation sites in the human proteome.

Authors: Fuyi Li; Chen Li; Tatiana T Marquez-Lago; André Leier; Tatsuya Akutsu; Anthony W Purcell; A Ian Smith; Trevor Lithgow; Roger J Daly; Jiangning Song; Kuo-Chen Chou
Journal: Bioinformatics Date: 2018-12-15 Impact factor: 6.937

7. Evolutionary insights into the active-site structures of the metallo-β-lactamase superfamily from a classification study with support vector machine.

Authors: Lili Wang; Ling Yang; Yu-Lan Feng; Hao Zhang
Journal: J Biol Inorg Chem Date: 2020-09-18 Impact factor: 3.358

8. 2lpiRNApred: a two-layered integrated algorithm for identifying piRNAs and their functions based on LFE-GM feature selection.

Authors: Yun Zuo; Quan Zou; Jianyuan Lin; Min Jiang; Xiangrong Liu
Journal: RNA Biol Date: 2020-03-05 Impact factor: 4.652

9. Structural insights of dipeptidyl peptidase-IV inhibitors through molecular dynamics-guided receptor-dependent 4D-QSAR studies.

Authors: Rajesh B Patil; Euzebio G Barbosa; Jaiprakash N Sangshetti; Vishal P Zambre; Sanjay D Sawant
Journal: Mol Divers Date: 2018-03-13 Impact factor: 2.943

Review 10. The biogenesis and biological function of PIWI-interacting RNA in cancer.

Authors: Silu Chen; Shuai Ben; Junyi Xin; Shuwei Li; Rui Zheng; Hao Wang; Lulu Fan; Mulong Du; Zhengdong Zhang; Meilin Wang
Journal: J Hematol Oncol Date: 2021-06-12 Impact factor: 17.388