Literature DB >> 32655289

Computational Identification of piRNAs Using Features Based on RNA Sequence, Structure, Thermodynamic and Physicochemical Properties.

Abstract

RATIONALE: PIWI-interacting RNAs (piRNAs) are a recently-discovered class of small non-coding RNAs (ncRNAs) with a length of 21-35 nucleotides. They play a role in gene expression regulation, transposon silencing, and viral infection inhibition. Once considered as "dark matter" of ncRNAs, piRNAs emerged as important players in multiple cellular functions in different organisms. However, our knowledge of piRNAs is still very limited as many piRNAs have not been yet identified due to lack of robust computational predictive tools.
METHODS: To identify novel piRNAs, we developed piRNAPred, an integrated framework for piRNA prediction employing hybrid features like k-mer nucleotide composition, secondary structure, thermodynamic and physicochemical properties. A non-redundant dataset (D3349 or D1684p+1665n) comprising 1684 experimentally verified piRNAs and 1665 non-piRNA sequences was obtained from piRBase and NONCODE, respectively. These sequences were subjected to the computation of various sequence-structure based features in binary format and trained using different machine learning techniques, of which support vector machine (SVM) performed the best.
RESULTS: During the ten-fold cross-validation approach (10-CV), piRNAPred achieved an overall accuracy of 98.60% with Mathews correlation coefficient (MCC) of 0.97 and receiver operating characteristic (ROC) of 0.99. Furthermore, we achieved a dimensionality reduction of feature space using an attribute selected classifier.
CONCLUSION: We obtained the highest performance in accurately predicting piRNAs as compared to the current state-of-the-art piRNA predictors. In conclusion, piRNAPred would be helpful to expand the piRNA repertoire, and provide new insights on piRNA functions.

Entities: Chemical

Keywords: algorithm; classification; non-coding RNA; physicochemical; piRNA; prediction

Year: 2019 PMID： 32655289 PMCID： PMC7327968 DOI： 10.2174/1389202920666191129112705

Source DB: PubMed Journal: Curr Genomics ISSN： 1389-2029 Impact factor: 2.236

INTRODUCTION

Since its discovery, RNA interference (RNAi) and its effector non-coding RNAs (ncRNAs) namely, microRNAs (miRNAs), small interfering RNAs (siRNAs), and PIWI-interacting RNA (piRNAs) have revolutionized our understanding of the mechanisms regulating gene expression [1, 2]. The common key event in all the RNAi regulatory mechanisms is the binding of small ncRNAs to Argonaute (AGO), forming a ribonucleoprotein complex termed as RNA-induced silencing complex (RISC) [3]. The AGO family of proteins is phylogenetically divided into two sub-families: somatic AGO [4], and germ line specific PIWI (P-element induced wimpy testis) clade [2, 5]. Both miRNAs and siRNAs act through the AGO family and constitute the RISC complexes known as the miRISC and siRISC, whereas the small ncRNAs interacting with PIWI are known as the piRNAs [6]. piRNA is a recently discovered class of ncRNAs, which are in the length range of ~24-32 nucleotides [1, 7, 8]. Initially, piRNAs were described as repeat-associated siRNAs (rasiRNAs) because of their origin from the repetitive elements such as transposable sequences of the genome [9]. However, later it was identified that they acted via PIWI-protein [10]. In addition to having a role in the suppression of genomic transposons, various roles of piRNAs have been recently reported like regulation of 3’ UTR of protein-coding genes via RNAi [11], transgenerational epigenetic inheritance to convey a memory of past transposon activity [12], and RNA-induced epigenetic silencing [13]. Furthermore, piRNA sequences are comparatively diverse than any other class of cellular ncRNAs and they constitute the most prevalent class of ncRNAs [7]. The overall mechanism of piRNA biogenesis is substantially different than that of the other ncRNAs. It includes siRNA, miRNA and long ncRNAs (lncRNAs) [13]. piRNAs are generated from long, single-stranded RNAs, which are transcribed from the genomic loci termed as piRNA clusters [14-19]. Earlier studies suggested multiple mechanisms of piRNA biogenesis in different cell types, tissues and organisms [10]. However, recent reports proposed a single and unified model that explained the mechanism of piRNA biogenesis across a range of evolutionarily diverse organisms [20-23]. These studies indicated that piRNA biogenesis could be divided into two parts: 1) the piRNA-dependent amplification loop, known as the ‘Ping-Pong cycle’, and 2) the piRNA-independent generation of phased trailing piRNAs. The former pathway begins with a maternally inherited 1U-biased piRNA, known as the initiator piRNA [24]. The PIWI-bound initiator piRNA cleaves the complementary single-stranded long transcript sequence into a pre-pre-piRNA with a terminal 5’ monophosphate [23]. The PIWI-bound pre-pre-piRNA undergoes subsequent RNA cleavage from its 5’ end to produce responder pre-piRNA. The responder pre-piRNA further undergoes 3’ end-processing to generate mature responder piRNA with 10A-bias. Subsequently, the mature responder piRNA enters the piRNA biogenesis cycle, acting as an initiator piRNA, and in turn, produces a new responder piRNA with 1U-bias, which is identical to the original initiator piRNA. The pathway operates as an amplification loop and hence, it is termed the ‘Ping-Pong’ cycle [7]. Thus, this arm of the unified model is dependent on a maternal initiator piRNA and also on the availability of long single-stranded piRNA precursor sequences. In contrast, phased piRNA generation is a piRNA-independent process, which operates through interaction with the mitochondrial endonuclease phospholipase D family member 6 (PDL6), an endonuclease present on the outer mitochondrial membrane. PDL6 cleaves the remaining 3’ end of the pre-pre-piRNA in a repeated manner, producing an array of tail-to-head pre-piRNAs, which are known as phased trailing piRNAs [21]. Thus, the current model proposed that the initiator piRNAs produce responder piRNAs and trailing piRNAs, which were known as the secondary and primary piRNAs, respectively, in the old model [22]. Hence, owing to the extensive role of piRNAs in regulating various biological processes, the conserved features of their biogenesis, and their sequence diversity, genome-wide identification of novel piRNAs would be of great importance to help understand piRNA-guided gene regulatory mechanisms across different cell types and organisms. In piRNA identification, the general alignment-centered algorithms such as basic local alignment search tool (BLAST) [25] and Multiple Expectation Maximization for Motif Elicitation (MEME) [26] are well known for their robustness, but not for accuracy due to divergence of piRNAs across different species [27]. Recently, next-generation sequencing (NGS)-based methods have emerged as a powerful platform to identify piRNAs in a high-throughput mode [28]. However, in addition to piRNA, sequencing-generated data may also harbor reads from several other small ncRNAs such as miRNAs, endogenous siRNAs (endo siRNAs), small nuclear RNAs (snRNAs), small nucleolar RNAs (snoRNAs), smaller fragments of mRNAs and long ncRNAs in the piRNA length range. Therefore, the length range alone cannot be a characteristic parameter to screen for piRNAs in small RNA-based sequencing platforms. Hence, more accurate and robust algorithms are required to distinguish the real piRNA sequences from the pseudo piRNAs (non-piRNAs) of similar lengths by employing various discriminative features. Recent studies employed multiple approaches to identify piRNAs with high accuracy in the genome of various organisms (Table ). Betel et al. proposed an algorithm based on the position-specific scoring matrix (PSSM) of 10 nucleotides upstream and downstream of piRNA sequences identified from the pachytene stage of mouse spermatogenesis, and predicted mouse-derived conserved piRNAs with ~60-70% precision [29]. Nevertheless, developed on a homogeneous dataset and trained on static position-specific nucleotide usage, the method shows position-specific features for only one species and hence, has limitations in piRNA identification in other species. Zhang et al. developed an algorithm using k-mer nucleotide frequency to capture the sequence-based characteristics in piRNA sequences [27]. Since the algorithm adopted dynamic nucleotide frequency calculation over the static PSSMs, it was able to identify piRNAs in the organisms beyond the training set species. However, there was a need to include additional features to improve the accuracy of piRNA identification apart from the sequence-based features. Other studies integrated additional features than sequence alone like structure and thermodynamic energy, and developed different piRNA prediction algorithms. The algorithm ‘PIANO’ was developed to predict piRNAs in Drosophila melanogaster using structural features and transposon interactions [30, 31], adopting local contiguous sequence-structure triplet-elements [27]. To predict mouse piRNAs using sequence motifs, ‘Pibomd’ was developed [32]. The above tools were trained on homogeneous datasets generated from single species (Table ). ‘Accurate piRNA prediction’ [33], ‘GA-WE’ (a genetic algorithm-based weighted ensemble method) [34] and ‘2L-piRNA’ (a two-layer ensemble classifier) [35] utilized sequence, structure and k-mer spectrum profile-based features to achieve better performance. Although these tools demonstrated a high accuracy level in predicting piRNAs, there was a need to develop a more robust algorithm capable of analyzing heterogeneous datasets from multiple organisms with high accuracy. In this study, a comprehensive and robust machine learning-based algorithm is developed to predict piRNA sequences relying on the hybrid features such as spectrum profile or k-mer features (k=1 to 5), secondary structure, thermodynamic energy and physicochemical properties of RNA dinucleotides extracted from the piRNA sequences of eight species: Homo sapiens, Mus musculus, D. melanogaster, Caenorhabditis elegans, Danio rerio, Gallus gallus domesticus, Xenopus tropicalus, and Bombyx mori (Table ). During the 10-fold cross-validation (10-CV), our method reached an overall accuracy of >98% and a sensitivity of >98% in the above species. When compared with the existing algorithms, our hybrid predictive model piRNAPred demonstrated higher accuracy level. In conclusion, piRNAPred is the most updated piRNA identification method developed by integrating features of sequence, structure, thermodynamic energy and physicochemical properties extracted from a heterogeneous dataset across phylogenetically diverse organisms.

MATERIALS AND METHODS

Dataset Preparation

The experimentally verified positive piRNA and non-piRNA sequences were extracted from piRBase [28] and NONCODE [36], respectively. Since both the databases contain millions of sequences from numerous organisms, we removed redundant and overlapping sequences using a cluster database at high identity with tolerance (CD-Hit) software. Removal of the redundant sequences was achieved at different percentage identity level; we used 60% sequence identity. Sample sequences having 60% identical overlap among them were grouped into one cluster and represented as a single sequence. The rationale of clustering was to verify the redundancy/overlap of the sequences to reduce the number of sequences at different percentage sequence identity. Finally, a non-redundant benchmark dataset (D3349 or D1684p+1665n) comprising 1684 experimentally verified piRNAs and 1665 non-piRNA sequences was obtained. These piRNA sequences were extracted from eight organisms (Table ). Further, benchmark dataset was subjected to the computation of various sequence-structure based features in binary format, and trained using different machine learning techniques (MLTs). Fig. ( entails the step-wise methodology adopted in the development of piRNAPred.

Fig. (1)

Schematic illustration of the overall workflow adopted to develop piRNAPred: Left and right arm demonstrates the processing of piRNA and non-piRNA sequence from piRBase and NONCODE, respectively to generate a dataset (D1684p+1665n) followed by their downstream conversion into sequence, structure, thermodynamic, physiochemical and BINARY1+10 feature space and predictive model development. (A higher resolution / colour version of this figure is available in the electronic copy of the article).

Apart from the above heterogeneous benchmark dataset representing sequences from evolutionarily distant species, we considered two homogeneous datasets in a tissue-specific manner. The first tissue-specific dataset was extracted from a single study, which involved the identification of piRNAs and non-piRNAs from the testes of C57BL/6P20Miwi +/+ and C57BL/6P20Miwi/ADH mice, respectively, using MIWI-Immunoprecipitation (IP) followed by sequencing [37]. There were 57,5786 piRNA and 55,1640 non-piRNA sequences in GSM822759 and GSM822762, respectively. These sequences were screened, processed and made non-redundant, which resulted in a total of 2000 positive and 1711 negative sequences. Similarly, the second tissue-specific dataset involved 1418 piRNA and 1418 pseudo-piRNA sequences form M. musculus [35]. These tissue-specific homogeneous datasets were subsequently subjected to features formulation and model development using 10-CV.

Features utilized

Spectrum Profile or k-mer Features

One of the important contributing features of predictive model development is the k-mer string [33, 35, 38]. In machine learning, a k-mer string is defined as particular k-mer tuple (1, 2, 3, 4, 5 or even higher) of nucleotide or amino acid sequences that can be used to identify some representative motifs within DNA or proteins. The overall idea of using different k-mer strings is the identification of differential nucleotide usage between the authentic piRNA and pseudo piRNA sequences. However, when the dataset sequences are of different length, spectrum profile or k-mer nucleotide composition (k-MNC) is used [34, 39]. k-MNC is defined as a total number of a particular nucleotide divided by the length of the sequence. e.g., a piRNA sequence with five guanine residues and of length n, where n=19 nucleotides, the percentage composition of guanine residues will be 26.31%. Hence, there are 4 mononucleotide composition-based features (A, T/U, G and C). (k, any amino acid or nucleotide) Similarly, there will be 16-dinucleotide composition-based features (AA, AU, AC, AG, UA, UU, UC, UG, CA, CU, CC, CG, GA, GU, GC, GG). From k=1 to 5, they are designated as mono-, di-, tri-, tetra- and penta-nucleotide compositions. Hybrid predictive models (k=1, 2/1, 2, 3 etc.) involve higher-order combinations like hybrid of mono- and di-nucleotide composition is (md), mono-, di- and tri-nucleotide is mdt, mono-, di-, tri- and tetra-nucleotide is mdtt and mono-, di-, tri-, tetra- and penta-nucleotide is mdttp. Thus, k-MNC can be utilized to identify the motifs or regions of interest, which can distinguish the positive sequences from the negative ones in terms of differential nucleotide usage. Here, we employed a hybrid model of k=1-5, resulting into a total of 1364 strings of feature space.

Sequence-Structure Triplet Elements (SSTE)

A set of 32 features of local contiguous structure-sequence triplet element (SSTE) was proposed by Xue et al., for distinguishing the hairpins of real pre-miRNAs and pseudo pre-miRNAs [31]. These SSTEs can also be incorporated for extracting real piRNAs from pseudo piRNA sequences. In a secondary structure, there are two states for every nucleotide, paired or unpaired, which are designated by brackets: either “(” or “)” or dots “.” Correspondingly, standard secondary structure notation in RNAfold was developed in the Vienna RNA package where, a paired nucleotide near the 5'-side “(”, can form pair with “)” nucleotide at 3’ end [40]. To convert them into a binary format, the paired state was designated as “1”, while the unpaired state as “0”. Further, an SSTE feature was calculated, which consisted of the secondary structure state of three contiguous nucleotides with the identity of the nucleotide in the middle [e.g. A(((]. For example, there will be 8 possible SSTEs of “adenine” as [“A(((“, “A((.”, “A(..”, “A(.(“, “A.((“, “A.(.”, “A..(“ and “A...”]. Therefore, there will be 32 (4 × 8) possible structure-sequence combinations for the nucleotides A, U, G, and C (Table ) [31].

Thermodynamic Energy of Contiguous Dinucleotides

It was reported that thermodynamic properties of miRNA and siRNA sequences contribute to determining strand bias, molecule’s function, and longevity [41]. The thermodynamic energy feature space in the study corresponds to Gibbs free energy of continuous two nucleotides ina siRNA/miRNA sequence [41]. These features were incorporated to develop computational models for siRNA prediction and demonstrated considerable improvement [42, 43]. Therefore, to check the potential of thermodynamic energy of contiguous dinucleotides in piRNA identification from false positive sequences, we calculated 16 thermodynamic energy-related patterns, and develop predictive models.

RNA Physicochemical Properties

There are six physicochemical properties of RNA dinucleotides namely rise, roll, shift, tilt, slide, and twist (Table ). Their values have been used for the prediction of ncRNAs such as miRNA and piRNA [33]. The dinucleotide combination of an individual physicochemical property leads to the generation of 16 features, and these six RNA physicochemical properties collectively create a feature space of 96 strings.

Binary Profile of 1U and 10A Bias of piRNAs

We have included the information about the relative preference of nucleotides at 1st and 10th position of the primary piRNA and secondary piRNA in the form of a new predictive feature termed as the binary profile of 1+10 position (BINARY1+10). This feature reflects the relative occupancy of a particular nucleotide (A/U/G/C) at a given position of the piRNA sequence. Thereafter, this position-specific occupancy was converted into SVM readable binary format resulting in a total of 16 features (4 for 1st and 4 for 10th position, respectively). Thus, it provided information about the relative preference of a nucleotide at each position of piRNA (Table ). All the above features were combined in a hybrid model i.e. k-MNC with k=1-5, leading to 1364 vectors, 32 SSTE, 16 thermodynamic energy-, 96 physicochemical property-related features and 16 features for BINARY1+10 for 1U and 10A bias of piRNA making up to a total of 1516 features space (1364+32+16+96+16). As computational algorithms are only trained on binary notation (0 or 1), these features were further needed to be transformed into binary symbols to implement the feature vectors in different MLTs. In the present study, binary symbolisation for standard 4 nucleotides was in a string of 4 characters: A, U, G and C, and are denoted by 1000, 0101, 0010, and 0001, respectively [38]. Further, the classifier was trained on 1 for “ON” and 0 for “OFF” signals for a particular feature. In this way, we could provide sequence, structure or any other biological features into binary format to train the classifier. The main goal of computing composition is to change the varied length of a sequence to a fixed length vector. This step was crucial while performing the classification of sequences by MLTs.

Algorithm development

We used SVMlight software package (http://svmlight.joachims.org/) to train the 1516 features of D1684p+1665n and to develop the predictive model in 10-CV mode. SVMlight is an implementation of a supervised MLT called support vector machines (SVM), which was proposed by Vapnik [44]. It works on the principle of pattern-recognition in the given dataset during training, and predicts the test-set based on these patterns. We used radial basis function (RBF) kernel having kernel width parameter (gamma, g) and regularization parameter (c) to build the SVM-based models in the range of 0.000001 to 5 and 0.00001 to 2000, respectively. Apart from SVM, we utilized other MLTs (Random Forest (RF)), Bagging, classification via regression, J48 pruned tree, naive Bayes and Instance Based Learner (Ibk)) using Weka package (Table ).

Feature Selection to achieve dimensionality reduction

An optimal classifier depends upon features used for training. Therefore, to screen out the maximum relevant sub-space out of 1516 features, we employed attribute selected classifier in the Waikato Environment for Knowledge Analysis (Weka) package [45], and retrieved reduced feature space and provide 198 maximum relevance features (Table ). Further, we checked the performance of these 198 features in different MLTs (Table ).

Statistical assessment of algorithm

The performance of predictive models on the test set was assessed with the help of the statistical equation Mathews correlation coefficient (MCC). It is generally applied to calculate correlation between the actual and predicted values along with other statistical measures namely, percentage sensitivity (Sn), specificity (Sp), accuracy (Ac), true positive (TP), true negative (TN), false positive (FP), and false negative (FN) [35, 46]. These equations are provided below:

RESULTS

Performance of Individual Features During 10-CV

Spectrum Profile or k-mer Features

The k-mer models i.e. mono-nucleotide (mono-NT), di-NT, tri-NT, tetra-NT and penta-NT achieved a MCC of 0.35, 0.37, 0.41, 0.51, and 0.56, respectively. We also assessed the performance of hybrid k-mer features e.g. a combination of mono- and di-nucleotide termed as md hybrid model. The mono-di (MD), mono-tri (MT), di-tri (DT), mono-di-tri (MDT), mono-di-tri-tetra (MDTT) and mono-di-tri-tetra-penta (MDTTP) achieved a minimum MCC of 0.38 to a maximum MCC of 0.44. Among all the k-mer models, MDTTP performed the best, and hence, it was selected to combine with structural and thermodynamic features to create a hybrid model (Table ).

SSTE Based Features

SSTE based models achieved excellent performance with a percentage accuracy of 95.61, sensitivity and specificity of 93.53 and 97.72, respectively.

Thermodynamic Energy- based Features

Thermodynamic energy-based predictive models scored percentage sensitivity, specificity, accuracy of 76.13, 81.38, and 78.74, respectively, with an MCC of 0.58.

RNA Physicochemical Property-based Features

The 96 RNA physicochemical property-based features exhibited percentage sensitivity, specificity, and accuracy of 77.26, 76.70, and 76.98 respectively, with an MCC of 0.54.

Binary Profile of 1U and 10A Bias of piRNA

The 16 features reflecting the position-specific relative occupancy of a particular nucleotide exhibited percentage sensitivity, specificity, and accuracy of 69.24, 44.85, and 57.12, respectively with an MCC of 0.15 (Table ).

Performance of Hybrid Features During 10-CV

For assessing the performance of hybrid models with MDTTP and other features during 10-CV, we employed SSTE (A), thermodynamic energy-based features (B), RNA physicochemical property-based features (C) and BINARY1+10 (D). Out of all predictive feature combinations, MDTTP+A+B+C i.e. a hybrid of MDTTP, SSTE, thermodynamic energy and RNA physicochemical properties achieved the best result as it exhibited 98.60% accuracy, 98.57% sensitivity, 98.62% specificity, and 0.97 MCC followed by MDTTP+ABC+BINARY1+10, which performed almost equally well: 98.24% accuracy, 99.05% sensitivity, 97.42% specificity and 0.96 MCC. We also plotted the performance of the three best predictive models in the graphic analysis using the receiver operating characteristic (ROC) plot (Fig. ). Additionally, to reflect the performance of a predictive model trained on homogeneous dataset from a single tissue, we developed two models trained on the piRNA and non-piRNA sequences in a single tissue-specific manner from mouse testes and ovary. The dataset was formulated into 1516 features, accessed using 10-CV resulting in models MDTTP+ABC+Binary1+10 (testes) and MDTTP+ABC+  Binary1+10 (ovary) for testes and ovary dataset, respectively. Further, we performed external validation by checking the performance of the testes model to predict the ovary dataset and vice-versa (Table ).

Performance of MDTTP+A+B+C Trained on Other MLTs

To check the performance of other MLTs on the 1508 feature space, we performed 10-CV. However, the SVM performed the best of all the tested MLTs, followed by random forest, bagging, and classification-via-regression (Table ). The best predictive model was termed as “piRNAPred”, and has been provided to users for the prediction of piRNAs (https://github.com/IshaMonga/piRNAPred).

Cross-validation

We adopted the 10-CV method for validating our classifier. During 10-CV, the complete dataset was randomly divided into 10 sets, of which the model was trained on 9 sets (the training set) and 1 set was kept aside for testing (the test set) (Fig. ). This process was recursively repeated ten-times, and the performance of ten steps was averaged to provide the final assessment of the predictive model [38, 42].

Comparison with Other State of the Art Predictors

Since the existing piRNA prediction algorithms were developed on dissimilar features employing diverse MLTs, therefore they exhibited different sensitivity and specificity levels: piRNA- 72.47% and 95.53% [27], PIANO- 95.89% and 94.61% [30], Pibomd- 91.48% and 89.76% [32], accurate piRNA prediction- 83.10% and 82.10% [33], GA-WE -90.6% and 78.3% [34] and 2L-piRNA- 88.3% and 83.9% [35]. On the other hand, piRNAPred achieved 98.57% sensitivity, 98.62% specificity, 98.60% accuracy and 0.97 MCC (Table ).

DISCUSSION

Cell growth, homeostasis and phenotypic expression of cellular functions have underlying complex regulations operating both at the gene expression and epigenetic levels [8, 10]. Gene expression regulation is majorly governed by small ncRNA-based surveillance and silencing in the biological systems [7, 47]. Among different players of RNAi, piRNAs are the most recently discovered and highly diverse class of ncRNAs [2]. They were first reported to govern the gonadal cell development by regulating the expression of transposons in the mouse germ cells [48-50], and loss of their expression leads to sterility [32-34]. Recent findings also suggest their roles in post-transcriptional gene regulation [11, 51], transgenerational epigenetic inheritance [12], regulation of mRNA decay [52], and localization in germplasm part of the embryo during embryogenesis [53]. Although many piRNAs have been recently discovered in some species, their identification has largely remained elusive in most of the organisms due to a lack of robust tools capable of identifying piRNA across phylogenetically diverse organisms with high accuracy. The identification of piRNAs is majorly carried out by the NGS-based methods in which, many sequences from other ncRNAs fall within the range of piRNA sequence-length. Hence, to differentiate the piRNAs from the false-positive sequences, a more accurate and robust algorithm was warranted [33]. Recently, various methods were published which employed different features and used piRNAs from different species. Integration of the k-mer spectrum profile in the algorithm was found to better distinguish the real piRNAs than the pseudo ones, and these k-mer nucleotide strings escalated to hybrid models by combining different k-mer nucleotide combinations [33]. Zhang et al., proposed a k-mer approach to identify piRNAs from five organisms i.e. rat, mouse, human, fruit fly and nematode [27]. Further, all the ncRNAs have a unique secondary structure, which can be computed by SSTE [31]. This approach was utilized to classify piRNAs in D. melanogaster [30]. However, this tool had a homogeneous training dataset, and this limited the discovery of novel piRNAs in the other non-related organisms [54]. ‘Pibomd’ extended the use of sequence-based motifs in the prediction of M. musculus piRNAs [32]. However, a comprehensive dataset was needed to capture all the possible sequence-based motifs and predict piRNAs from various evolutionary diverse organisms. Other algorithms like GA-WE [34] and 2L-piRNA [35] used pseudo dinucleotide composition (pseKNC), PSSM, mismatch profiles, etc., and achieved high accuracy of the classifier. However, the existing piRNA prediction algorithms did not combine various other features like secondary structure, thermodynamic energy and physicochemical properties of RNA, and did not utilize piRNA sequences from phylogenetically diverse organisms, which we considered crucial for the prediction of piRNAs with high accuracy. In the present study, we collected 1684 positive piRNA sequences from eight different species (Table ). The length range of 1684 piRNAs was 24-33 nucleotides, and they were statistically distributed in 21 followed by 24, 29 and 31 nucleotides. To compare piRNAs and non-piRNAs, we investigated the composition-specific properties in detail and calculated the composition up to 5th order of nucleotides that resulted in 1364 vector space. The accuracy of the classifier increased from k=1 to 5, suggesting that a combination of higher-order nucleotide usage (5-mer in our case) would better distinguish piRNAs from the non-piRNA sequences (Table ). Apart from training the classifier on the differential nucleotide usage by incorporating k-MNC, we also trained our classifier on the relative position-specific occupancy of nucleotides at 1st and 10th position in the primary and secondary piRNAs, respectively, which resulted into a feature space of 16. However, their accuracy (57.12%) suggested that these features alone were not enough to distinguish the piRNAs from the non-piRNA sequences. Further, the high accuracy of the triplet element features suggested that sequence and secondary structure together could be a potent and discriminative property to segregate piRNAs from the pseudo sequences of similar length. However, there was moderate accuracy in the classifier trained on the properties based on Gibbs free energy and physicochemical values of dinucleotides. Importantly, the hybrid model piRNAPred which incorporated the k-mer spectrum profile, SSTE, thermodynamic properties of continuous dinucleotides and physicochemical features, demonstrated the highest accuracy among all the models with an MCC of 0.97.

CONCLUSION AND FUTURE IMPLICATIONS

In conclusion, the piRNAPred demonstrated the highest accuracy in predicting piRNAs as compared to the existing piRNA prediction tools. We hope it would be helpful in expanding our current understanding of piRNA biology by predicting novel piRNAs in different organisms. In the future, we will update more piRNAs from the other species, and develop a user-friendly interface. The scripts, predictive models, datasets and other supplementary material are provided on https://github.com/IshaMonga/piRNAPred.

Table 1

Performance of different predictive models using SVM during 10-fold cross-validation.

S. No.	Predictive Model	Features	No. of Features	Thres	TP	FP	TN	FN	Sn (%)	Sp (%)	Acc (%)	Mcc	g	c	ROC
1	MONO	k-MNC based features	4	0	1426	867	798	258	84.7	47.93	66.41	0.35	1	5	0.75
2	DI		16	0.1	1153	531	1134	531	68.5	68.11	68.29	0.37	5.00E-05	50	0.73
3	TRI		64	0.1	1189	501	1164	495	70.6	69.91	70.26	0.41	5.00E-05	500	0.76
4	TETRA		256	0	1126	282	1383	558	66.9	83.06	74.92	0.51	0.01	1	0.81
5	PENTA		1024	0	1363	412	1253	321	80.9	75.26	78.11	0.56	0.0005	50	0.85
6	MD	Hybrid k-MNC based features	20	0	1212	574	1091	472	72	65.53	68.77	0.38	1.00E-05	300	0.74
7	MT		68	0.1	1165	484	1181	519	69.2	70.93	70.05	0.40	5.00E-05	100	0.75
8	DT		80	0	1228	557	1108	456	72.9	66.55	69.75	0.40	0.0001	50	0.76
9	MDT		84	0.1	1172	499	1166	512	69.6	70.03	69.81	0.40	1.00E-05	1000	0.75
10	MDTT		340	0	1233	554	1111	451	73.2	66.73	69.99	0.40	0.0001	10	0.76
11	MDTTP		1364	-0.1	1308	572	1093	376	77.7	65.65	71.69	0.44	5.00E-05	300	0.79
12	A	SSTE based features	32	0	1575	38	1627	109	93.5	97.72	95.61	0.91	0.0001	10	0.70
13	B	Thermodynamic energies of RNA dinucleotides	16	0.1	1282	310	1355	402	76.1	81.38	78.74	0.58	0.05	250	0.86
14	C	Physicochemical properties of RNA dinucleotides	96	0	1301	388	1277	383	77.3	76.7	76.98	0.54	1.00E-05	5	0.85
15	BINARY¹⁺¹⁰	position specific NT usage of 1^st and 10^th position (Binary)	16	0	817	642	522	363	69.24	44.85	57.12	0.15	1.00E-005	100	0.35
16	AB	Hybrids of SSTE, Thermodynamic energies of RNA dinucleotides and Physicochemical properties of RNA dinucleotides (ABC) with MDTTP	48	-0.3	1653	27	1638	31	98.2	98.38	98.27	0.97	0.0005	500	1.00
17	AC		128	0	1468	230	1435	216	87.2	86.19	86.68	0.73	1.00E-05	1000	0.93
18	BC		112	0	1301	388	1277	383	77.3	76.7	76.98	0.54	1.00E-05	5	0.85
19	ABC		144	-0.1	1502	275	1390	182	89.2	83.48	86.35	0.73	1.00E-05	500	0.93
20	MDTTP + A		1396	0	1592	99	1566	92	94.5	94.05	94.3	0.89	5.00E-05	100	0.98
21	MDTTP + B		1380	0	1436	333	1332	248	85.3	80	82.65	0.65	5.00E-05	200	0.90
22	MDTTP + C		1460	0	1369	127	1538	315	81.3	92.37	86.8	0.74	0.01	5	0.89
S. No.	Predictive Model	Features	No. of Features	Thres	TP	FP	TN	FN	Sn (%)	Sp (%)	Acc (%)	Mcc	g	c	ROC
23	MDTTP +AB	Hybrid of k-MNC based features+ SSTE, thermodynamic, RNA physicochemical properties and position specific NT usage of 1^st and 10^th position (Binary)	1412	0	1608	71	1594	76	95.5	95.74	95.61	0.91	5.00E-05	100	0.99
24	MDTTP +BC		1476	0.1	1065	9	1656	619	63.2	99.46	81.25	0.67	0.01	5	0.89
25	MDTTP +AC		1492	0	1488	227	1438	196	88.4	86.37	87.37	0.75	1.00E-05	50	1.00
26	MDTTP +ABC		1508	0.1	1660	23	1642	24	98.57	98.62	98.60	0.97	1.00E-05	50	0.99
27	MDTTP+ABC+BINARY¹⁺¹⁰		1516	0	1668	43	1622	16	99.05	97.42	98.24	0.96	1.00E-005	50	0.99

Abbreviations: Acc, Accuracy; diNT, dinucleotide; c, Regularization parameter; FP, False Positive; FN, False Negative; g, Gamma (a kernel density parameter); k-MF, k-Mer Features OR k-MNC, k-Mer Nucleotide Composition; MMP, Mismatch Profile; MCC, Mathews Correlation Coefficient, pseDNC, pseudo Dinucleotide Composition; PCPseDNC, Parallel Correlation Pseudo Dinucleotide Composition; PSSM, Position-Specific Scoring Matrix; SSTE, Structure-Sequence Triplet Elements; SSP, Subsequence Profile; Sn, Sensitivity; Sp, Specificity; SSTE, Structure-Sequence Triplet Elements (A); B, Thermo-dynamic energies of contiguous dinucleotides; C, RNA physicochemical properties of adjoining dinucleotides (diNTs); AB, hybrid of SSTE and Thermo-dynamic energies of contiguous diNTs; Thres, threshold; TN, True Negative; TP, True Positive, AC, hybrid of SSTE and RNA physicochemical property of adjoining diNTs; ABC, hybrid of SSTE, thermo-dynamic energy and RNA physicochemical property of contiguous dinucleotides; ROC, Receiver Operating Characteristic.

Table 2

Performance of 1508 features on different machine learning techniques during 10-fold cross-validation.

S. No.	Machine Learning Technique	TP	FN	TN	FP	Sn (%)	Sp (%)	Acc (%)	MCC
1	SVM (piRNAPred)	1660	24	1642	23	98.57	98.62	98.60	0.97
2	Random Forest	1616	68	1647	18	95.96	98.92	97.43	0.95
3	Bagging	1611	73	1628	37	95.67	97.78	96.72	0.93
4	Classification via Regression	1586	98	1576	89	94.18	94.65	94.42	0.89
5	J48 Pruned tree	1553	131	1515	150	92.22	90.99	91.61	0.83
6	Naive Bayes	920	764	1362	303	54.63	81.8	68.14	0.38
7	IbK	1302	382	610	1055	77.32	36.64	57.09	0.15

Abbreviations: Acc, Accuracy; FP, False Positive; FN, False Negative; MCC, Mathews Correlation Coefficient, Sn, Sensitivity; Sp, Specificity; SVM, Support Vector Machines; TN, True Negative; TP, True Positive.

Table 3

Comparison of piRNAPred with current state-of-the-art piRNA prediction methods.

S. No.	Tool	Description of Features	Species from which piRNAs Taken	Sn (%)	Sp (%)	ACC (%)	MCC	References
1	piRNA	k-MF (k=5, 1364)	Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans	72.47	95.5	NA	NA	(Zhang et al., 2011)
2	PIANO	SSTE	D. melanogaster	95.89	94.6	95.27	NA	(Wang et al., 2014)
2	PIANO	SSTE	D. melanogaster	95.89	94.6	95.27	NA	(Wang et al., 2014)	3	Pibomd	SSP	M. musculus	91.48	89.8	90.62	NA	(Liu et al., 2014)
4	Accurate piRNA prediction	k-MF, MMP, SSP, PSSM, pseDNC, SSTE	H. sapiens, M. musculus, D. melanogaster	83.10	82.10	82.6	0.651	(Luo et al., 2016)
5	GA-WE	k-MF, PCPseDNC, PSSM	H. sapiens, M. musculus, D. melanogaster	90.6	78.3	84.4	0.694	(Li et al., 2016)
6	2L-piRNA	pseDNC, C	M. musculus	88.3	83.9	86.1	0.723	(Liu et al., 2017)
7	piRNApred	k-MNC, SSTE, B and C	H. sapiens, M. musculus, D. melanogaster, C. elegans Danio rerio, Gallus gallus domesticus, Xenopus tropicalus, Bombyx mori	98.57	98.6	98.6	0.97	Algorithm proposed in the current study

Abbreviations: Acc, Accuracy; k-MF, k-Mer Features OR k-MNC, k-Mer Nucleotide Composition OR SP, Spectrum Profile; MMP, Mismatch Profile; MCC, Mathews Correlation Coefficient, pseDNC, pseudo Dinucleotide Composition; PCPseDNC, Parallel Correlation Pseudo Dinucleotide Composition; PSSM, Position-Specific Scoring Matrix; SSTE, Structure-Sequence Triplet Elements; SSP, Subsequence Profile; Sn, Sensitivity; Sp, Specificity; SSTE, Structure-Sequence Triplet Elements (A); B, Thermo-dynamic energies of contiguous dinucleotides; C, RNA physicochemical properties of adjoining dinucleotides (diNTs).

53 in total

Review 1. The Argonaute family: tentacles that reach into RNAi, developmental control, stem cell maintenance, and tumorigenesis.

Authors: Michelle A Carmell; Zhenyu Xuan; Michael Q Zhang; Gregory J Hannon
Journal: Genes Dev Date: 2002-11-01 Impact factor: 11.361

2. Functional siRNAs and miRNAs exhibit strand bias.

Authors: Anastasia Khvorova; Angela Reynolds; Sumedha D Jayasena
Journal: Cell Date: 2003-10-17 Impact factor: 41.582

3. A k-mer scheme to predict piRNAs and characterize locust piRNAs.

Authors: Yi Zhang; Xianhui Wang; Le Kang
Journal: Bioinformatics Date: 2011-01-11 Impact factor: 6.937

4. Data mining in bioinformatics using Weka.

Authors: Eibe Frank; Mark Hall; Len Trigg; Geoffrey Holmes; Ian H Witten
Journal: Bioinformatics Date: 2004-04-08 Impact factor: 6.937

5. Human Argonaute2 mediates RNA cleavage targeted by miRNAs and siRNAs.

Authors: Gunter Meister; Markus Landthaler; Agnieszka Patkaniowska; Yair Dorsett; Grace Teng; Thomas Tuschl
Journal: Mol Cell Date: 2004-07-23 Impact factor: 17.970

6. Discrete small RNA-generating loci as master regulators of transposon activity in Drosophila.

Authors: Julius Brennecke; Alexei A Aravin; Alexander Stark; Monica Dus; Manolis Kellis; Ravi Sachidanandam; Gregory J Hannon
Journal: Cell Date: 2007-03-08 Impact factor: 41.582

7. MEME SUITE: tools for motif discovery and searching.

Authors: Timothy L Bailey; Mikael Boden; Fabian A Buske; Martin Frith; Charles E Grant; Luca Clementi; Jingyuan Ren; Wilfred W Li; William S Noble
Journal: Nucleic Acids Res Date: 2009-05-20 Impact factor: 16.971

8. Peculiarities of piRNA-mediated post-transcriptional silencing of Stellate repeats in testes of Drosophila melanogaster.

Authors: Roman N Kotelnikov; Mikhail S Klenov; Yakov M Rozovsky; Ludmila V Olenina; Mikhail V Kibanov; Vladimir A Gvozdev
Journal: Nucleic Acids Res Date: 2009-03-24 Impact factor: 16.971

INTRODUCTION

MATERIALS AND METHODS

Dataset Preparation

Features utilized

Spectrum Profile or k-mer Features

Sequence-Structure Triplet Elements (SSTE)

Thermodynamic Energy of Contiguous Dinucleotides

RNA Physicochemical Properties

Binary Profile of 1U and 10A Bias of piRNAs

Algorithm development

Feature Selection to achieve dimensionality reduction

Statistical assessment of algorithm

RESULTS

Performance of Individual Features During 10-CV

Spectrum Profile or k-mer Features

SSTE Based Features

Thermodynamic Energy- based Features

RNA Physicochemical Property-based Features

Binary Profile of 1U and 10A Bias of piRNA

Performance of Hybrid Features During 10-CV

Performance of MDTTP+A+B+C Trained on Other MLTs

Cross-validation

Comparison with Other State of the Art Predictors

DISCUSSION

CONCLUSION AND FUTURE IMPLICATIONS

Review 1. The Argonaute family: tentacles that reach into RNAi, developmental control, stem cell maintenance, and tumorigenesis.

Review 3. Evaluation of exosomal non-coding RNAs in cancer using high-throughput sequencing.

Review 4. A Review of Discovery Profiling of PIWI-Interacting RNAs and Their Diverse Functions in Metazoans.