Literature DB >> 35317234

Identification of piRNA disease associations using deep learning.

Syed Danish Ali^1,2, Hilal Tayara³, Kil To Chong^1,4.

Abstract

Piwi-interacting RNAs (piRNAs) play a pivotal role in maintaining genome integrity by repression of transposable elements, gene stability, and association with various disease progressions. Cost-efficient computational methods for the identification of piRNA disease associations promote the efficacy of disease-specific drug development. In this regard, we developed a simple, robust, and efficient deep learning method for identifying the piRNA disease associations known as piRDA. The proposed architecture extracts the most significant and abstract information from raw sequences represented in a simplicated piRNA disease pair without any involvement of features engineering. Two-step positive unlabeled learning and bootstrapping technique are utilized to abstain from the false-negative and biased predictions dealing with positive unlabeled data. The performance of proposed method piRDA is evaluated using k-fold cross-validation. The piRDA is significantly improved in all the performance evaluation measures for the identification of piRNA disease associations in comparison to state-of-the-art method. Moreover, it is thus projected conclusively that the proposed computational method could play a significant role as a supportive and practical tool for primitive disease mechanisms and pharmaceutical research such as in academia and drug design. Eventually, the proposed model can be accessed using publicly available and user-friendly web tool athttp://nsclbio.jbnu.ac.kr/tools/piRDA/.

Entities: Chemical

Keywords: Convolutional Neural Network; Deep learning; Positive unlabeled learning; Reliable negative sample; Sequence analysis; Web-server; piRNA disease associations

Year: 2022 PMID： 35317234 PMCID： PMC8908038 DOI： 10.1016/j.csbj.2022.02.026

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

piRNAs are the largest subclass among three distinct classes of regulatory small non-coding RNAs (sncRNAs) along with microRNAs (miRNAs) and small interfering RNAs (siRNAs), which are found in several species including vertebrates and invertebrates, specifically in syntenic genomic locations of humans [1], [2], [3]. The three main types of ncRNAs despite differences in their mode of target regulation and biogenesis impart certain common functionalities including the guidance of Argonaute proteins to target the nucleic acids in a sequence-dependent manner [4]. Specifically, there are eight Argonaute proteins in humans, together with four Argonaute subfamily proteins (Ago) and four PIWI subfamilies (PIWI) proteins, respectively [5]. The expressed Ago proteins bind to siRNAs and miRNA, where they are transformed in a dicer-dependent mechanism from double-stranded precursors into mature small RNAs of 20–22 nucleotides (nt) [6]; whereas, the PIWI proteins develop a particular RNA-induced silencing complex (RISC), which is known as piRISCs with a small RNA population termed as piRNAs [7]. The long single strand of primary piRNAs is independent of dicer in biogenesis; however, different nucleases are involved for cutting these strands into each piRNA unit [6], [8]. The length of each piRNAs sequence varies from 26 to 32 nt [9]. piRNAs are responsible for the self-renewal of the stem cells as they abundantly exist in spermatogenic cells and play a significant role in maintaining germline and genome veracity by concealing the insertional mutations from transposons [10], [11], [12], [13]. The involvement of piRNAs in epigenetic silencing of transposons, regulation of gene transcription, histone modification, heterochromatin modification, and DNA methylation appeals researchers to further explore their associations with specific human diseases [14], [15], [16]. Moreover, the aberrant expression of piRNAs is associated with the development of various human diseases such as cardiovascular diseases, neurodegenerative disorders together with Alzheimer’s disease, Parkinson’s disease, malignant tumors, and hallmarks of cancer like augmented stemness, cell proliferation, inhibited apoptosis, and metastasis [17], [18], [19], [20], [21], [22], for example, neurodegenerative disorder considering the differential expression of piRNAs in the healthy human brain in comparison to Alzheimer’s disease. The diagnosed brain comprises five more than ten-fold upregulated piRNAs including piR-hsa-25781, piR-hsa-28467, piR-hsa-1177, piR-hsa-26593, and piR-hsa-29114 among the 146 upregulated and 3 downregulated piRNAs, which may act as an effective signature for Alzheimer’s disease [23]. In reference to cancers, the expression of piR-651 was upregulated in several gastric, lung, breast, mesothelium, liver, and cervical cancer cell lines [24]. Furthermore, piR-823 was remarkably upregulated in colorectal tumorigenesis where it binds with HSF1 while boosting its transcriptional activity and phosphorylation at Ser326 having an active role as a tumor booster [25]. Therefore, the piRNAs are reliable biomarkers associated with the diagnosis and treatment of diseases, which could be facilitated by identifying piRNAs associated with diseases. In this regard, several piRNA databases [9], [26], [27], [28], together with efficient and cost-effective web-server based computational predictors for identifying piRNA and their functions, are available [29], [30], [31]; whereas, research regarding human disease-associated piRNAs is in its early stages. Recently, the development of piRDisease v1.0 [32], which is a collection of various experimentally verified piRNA-disease associations, allows researchers to develop robust and cost-efficient computational methods in order to identify piRNA-associated diseases [33], [34], [35], [36], [37]. Thus, Wei et. al. proposed computational models for the identification of human disease-associated piRNAs together with iPiDi-PUL [33] and iPiDA-sHN [34]. iPiDi-PUL a random forest-based ensemble learning approach used positive unlabeled learning [38] for predicting piRNA disease association, wherein the features for associations were extracted using three dissimilar biological data sources. The negative data for training of model was randomly selected from unlabeled data consists of samples which were not experimentally verified; thus, there was a possibility of positive associations in unlabeled samples, and those samples employed as a negative data could results in low recall or inappropriate decision boundary of a classifier as illustrated in Fig. 2. Although, iPiDi-PUL utilized positive unlabeled learning to assuage low recall or false-negative problem; however, the selection of random negative samples from the unlabeled piRNA disease associations results in compromising the performance of the predictor due to presence of outliers in negative samples. Recently, to mitigate this false-negative obstruction, Wei et al. [34] proposed iPiDA-sHN. A two-step positive-unlabeled learning technique [39] for selection of reliable negative samples from unlabeled piRNA disease associations. Where the three heterogeneous biological sources were combined to describe the piRNA disease-associated features. Moreover, convolution neural network (CNN) was utilized for feature extraction from the multi-source handcrafted disease-associated features. Finally, a Support Vector Machine (SVM) classifier was employed for predicting the piRNA disease association. The employed biological sources for both of the available computational methods include experimentally verified piRNA-disease associations, disease semantic terms, and piRNA sequence information. The shortfall in the fusion of multiple biological data sources as a feature descriptor introduces irrelevant and noisy information. The performance of the computational method could be compromised due to inadequate description of features without tackling redundancy, irrelevant and noisy information.

Fig. 2

Illustration of reliable negative selection. (a) Positive and unlabeled data samples. (b) Training with random negative. (c) Unlabeled samples according to their prediction scores.

Consequently, the issues related to manual extraction of features that are highly dependent upon field knowledge need to be addressed. While the deep learning algorithms are extremely efficient and effective in extracting the most significant and abstract features from raw data utilizing the general purpose learning [40]. Moreover, deep learning is also capable of identifying and recognizing the patterns in unstructured data with low-level involvement of manual configuration [41]. Thus, deep learning has breakthroughs in the fields of natural language processing [42], speech recognition [43], image recognition [44], precision agriculture [45], [46], [47], potential drug molecules [48], post-translation modifications [49], [50], RNA binding proteins [51], [52], post-transcriptional modifications [53], [54], [55], identification of promoters [56], [57], [58], DNA modifications [59], [60], [61], [62], and prediction of disease association [63], [64], [65]. In the present study, we proposed deep learning architecture piRDA consist of CNN and fully connected layers, CNN is the most commonly used deep learning method considering its efficacy and efficiency in various applications. The CNN-based deep learning architecture is a hierarchical model capable of learning the patterns by utilizing the series of convolutional operations [40]. Fully connected layers are utilized for extracting high level features. For construction of reliable negative data from the unlabeled samples, a two-step positive-unlabeled learning technique [39] was employed to reduce the false-negative rate while predicting piRNA disease association. The raw piRNA sequences are encoded as feature vectors by implementing one-hot encoding technique as an input to CNN where the concealed information of raw piRNA sequences is recognized by CNN. Nevertheless, the disease association for each piRNA is represented with one-dimensional feature vector known as disease association one-hot vector (DAOHV). This is then concatenated with piRNA features extracted by CNN and fed into fully connected neural network layers. These layers extract the piRNA disease association patterns (utilizing multiple levels of abstraction) which leads to high performance identification of piRNA disease associations without losing any contextual information among piRNAs and diseases. To extenuate the bias of proposed computational method for piRDA toward the majority class in predictions, we used the bootstrapping method [66]. Furthermore, the grid search algorithm was utilized for optimum hyperparameter selection. We utilized the subsampling (k-fold cross-validation) test for comprehensively evaluating the performance, where the proposed architecture of piRDA significantly outperformed the state of the art. Additionally, for the convenience of drug developers, experimental scientists, and considering the importance of webservers in medical sciences research, we developed a publicly available web-server for identifying piRNA associated with disease accessible at http://nsclbio.jbnu.ac.kr/tools/piRDA/. The overall description of the proposed architecture piRDA is illustrated in Fig. 1. The major contributions of the piRDA are enlisted as.

Fig. 1

The overall workflow of proposed Architecture piRDA for identifying piRNA disease associations.

Novel and simple supervised learning-based representation of sequences and their disease associations. Development of a deep learning model for identification of raw piRNA sequences and their associated diseases. Achieving significantly high performance in the identification of piRNA disease association. Visualization of the feature space learned by piRDA in the prediction of piRNA disease association. Development of publicly accessible web-server. The overall workflow of proposed Architecture piRDA for identifying piRNA disease associations. Illustration of reliable negative selection. (a) Positive and unlabeled data samples. (b) Training with random negative. (c) Unlabeled samples according to their prediction scores.

Materials and methods

Dataset construction

piRDisease v1.0 [32] is a manually curated database collection comprising experimentally verified 7939 piRNA disease associations. The redundant and non–human piRNAs were filtered; by extracting the human piRNAs with the piRNA IDs accordingly in piRBase [28]. Eventually, 4350 piRNAs were associated with 21 diseases, thereby providing 5002 experimentally validated disease associations. The benchmark data is the same as introduced and utilized in the literature by Wei et al. [33], [34]. Mathematically, the benchmark dataset is described as:In Eq. (1) is the union of all the 4350 piRNAs associated among 21 diseases with 91,350 total numbers of samples. The represents positive samples comprising 5002 experimentally validated associations of 4350 piRNAs and 21 diseases, whereas represents the unlabeled 86348 piRNA disease pairs among 4350 piRNAs and 21 diseases. The diseases are enlisted in Table 1.

Table 1

Summary of piRDA performance for identifying piRNA disease associations using independent piRNA IDs.

No.	Disease	DAOHV
1	Renal cell carcinoma	[1]
2	Lung cancer	[0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
3	Breast cancer	[0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
4	Pancreatic carcinoma	[0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
5	Head and neck (squamous cell) carcinoma	[0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
6	Lung cancer (lung adenocarcinoma)	[0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
7	Alzheimer’s disease	[0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
8	Cardiovascular diseases (CDC, CF, CCS) cardiac regeneration	[0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0]
9	Head and neck cancer	[0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0]
10	Gastric cancer	[0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0]
11	Colon cancer	[0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0]
12	Non-small cell lung carcinoma (NSCLC)	[0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0]
13	Prostate cancer	[0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0]
14	Dysplastic liver nodules and hepatocellular carcinoma	[0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0]
15	Rheumatoid arthritis	[0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0]
16	Testicular germ cell carcinoma	[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0]
17	Endometrial carcinogenesis	[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0]
18	Male infertility	[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0]
19	Leukemia	[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0]
20	Heart stroke	[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0]
21	Ovarian cancer	[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1]

Summary of piRDA performance for identifying piRNA disease associations using independent piRNA IDs.

Proposed methodology

The effective and efficient computational method (piRDA) in terms of computational cost and efficacy is proposed for the identification of disease-associated piRNAs. The overall flow of the proposed study is illustrated in Fig. 1. The flowchart depicts that the proposed computational method comprises three main steps. The first step is simple and effective one-hot feature representation of respective association between piRNAs and diseases. Second, to avoid the false negative rate of the classifier in prediction classes, high-quality reliable negative data samples were selected from the unlabeled dataset. To maintain consistency, fair comparison and generalization the reliable negative data was same as used in the previous study by Wei et al. [34]. Eventually, the features were processed using CNN-based deep learning architecture (piRDA) for identifying the piRNAs associated with the diseases. The 10-fold cross-validation is used for evaluating the performance of proposed architecture.

Feature representation

The one-hot encoding was utilized for the representation of each piRNA and disease association as an input to proposed model. One-hot encoding is the most prevalent technique because of its simplicity and effectiveness [67]. The piRNA sequences acquired from piRBase v2.0 [28] are inconsistent in lengths. Therefore, the shorter sequences were padded with dummy variable ”N” for making all the sequences to be equal in their lengths for further processing in CNN. Hence, the one input of piRDA was disease-associated raw piRNA sequence, where , where j in ,and . Each raw piRNA sequence was encoded corresponding to ,and C as 4 four-dimensional feature vectors [1], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]. The second input of piRDA included information of the diseases associated with piRNAs. Hence, each disease was represented with one-dimensional feature vectors by assigning a distinct unit vector to each associated disease known as disease association one-hot vector (DAOHV). This one-dimensional simple representation of associated disease directly extracts discriminative information of the disease. DAOHV is a 21 elements vector representing 21 diseases, where only one element for the specific disease would have value “1” and all the other 20 elements would be “0”. The description of DAOHV for diseases is presented in Table 1.

Positive unlabeled learning

A reliable negative dataset was prepared following the same methods as in previous studies [34], [51], [68], [69]. Therefore, a two-step technique was employed for dealing with positive unlabeled sample datasets [39], wherein, the first step is the identification of reliable negative samples and the second step is to create predictors based upon the positive labeled samples and reliable negative samples [70]. For selecting reliable negative samples and to accomplish the first step of the two-step technique, SVM classifier was employed. The SVM classifier was trained by a random selection of unlabeled samples with the same size as positive samples expressed in Eq. (1). Parameters used for training of SVM are C = 1.0, gamma = 1, and kernel = ’rbf’. The trained SVM was utilized for obtaining the prediction scores from all the unlabeled piRNA disease association samples in Eq. (1). The prediction scores corresponding to the unlabeled samples were sorted in descending order and divided into three clusters of nearly the same size. The second cluster of unlabeled piRNA disease association samples was considered as the reliable negative sample having minimum chance to be the false negative. As the selection of difficult examples made the training more effective for yielding substantial performance boost [71]. The selection process of reliable negative samples is illustrated in Fig. 2.

Bootstrapping technique

Considering training of the proposed architecture piRDA as the number of disease-associated piRNA or positive samples are less than that of the reliable negative samples. The predictor would be biased towards the majority occurring class; to avoid biases in predictions bootstrapping technique [66] was employed similarly used in literature [72], [73] to tackle class imbalance. In this technique, we divided the prepared reliable negative samples into chunks of samples, which are approximately equal to the number of samples as of positive samples disease-associated piRNAs, thereby resulting in five sets of data where each dataset comprises disease-associated piRNAs and a reliable negative chunk. Moreover, k-fold cross-validation was employed; where the value of k is equal to 10 as of state of the art for fair comparison, and keeping consistency among the dataset. The results obtained using k-fold cross-validation are rigorous and unbiased as they are evaluated on k numbers of different test sets [74], [75]. Furthermore, 10-fold cross-validation employed divides each dataset into 10 sub datasets, where eight folds were used for training, onefold for the validation, and the remaining onefold for testing of the model. This cognitive operation was repeated 10 times so that each fold was considered to be a distinctive test set. The average of these distinctive subsets was considered to be the final outcomes of each of dataset; whereas, the final outcome of the proposed method piRDA was obtained using the average of all the five sets of data.

Proposed architecture

The proposed architecture is a two inputs deep learning based computational model, as illustrated in Fig. 3. The model can extract more abstract level of features from the raw data. The main components of the proposed model piRDA are convolutional and dense blocks, which reduce noise and acquire high-level features from raw piRNA sequences and their respective association with the disease. The first input of the proposed architecture was the raw piRNA sequence, which was transformed into a single-dimensional four channel vector as an input to the convolution block, comprising a one-dimensional convolution layer (Conv1D), where multiples filters extract features from the input data by preserving the corresponding spatial information. Therefore, each filter in the Conv1D identifies and extracts the most salient patterns and motifs in raw piRNA sequences [76]. Conv1D used in piRDA comprises 24 filters, where the size of each filter was 7. The Rectified Linear Unit (ReLU) [77] was considered as an activation function for Conv1D, whereas the ReLU activation function is responsible for capturing the nonlinearities and interaction among the feature matrix [78]. Following the ReLU activation, a normalization layer was applied, which acts as a regularizer and is responsible for stabilization of training optimization by substantially confiscating the covariate shift [79]. Therefore, for normalization of feature matrix group normalization (GP) was employed, which is an effective alternative to batch normalization while dealing with small batch size [80]. Where the normalization is performed in groups without employing the batch dimension, the group size of 4 was selected. The one-dimensional max-pooling layer is employed for the features from GP layer, which enhances the ability of generalization by eradicating redundancy and dimensionality. The filter size of 2 along with a stride of 2 was utilized for the one-dimensional max-pooling layer. Flatten layer was utilized after max-pooling layer, which collapses the spatial dimensionality of the extracted feature matrix into a one-dimensional vector. The flattened features are concatenated with the disease associated one hot vector, which is the second input of piRDA using the concatenation layer. Thereafter, final high-level features from the disease-associated piRNA pair were extracted using two fully connected layers having 128 and 32 neurons, respectively. The ReLU is utilized as an activation function for both fully connected layers. L2 regularizer on bias and weight is employed for the fully connected layers, which is the most effective and sophisticated technique to mitigate the overfitting by penalizing larger weights of the model [81]. The value assigned to L2 regularization plenty is . The dropout layer, an effective regularization to avoid overfitting by randomly switching off the effects of neurons [82] was used between the two fully connected layers. The dropout probability for the dropout layer is 0.25. Eventually, these high-level features are fed into the output layer where the sigmoid activation was employed to assign the prediction scores for the disease-associated piRNA pairs. The mathematical representation of proposed architecture is formulated as:Eq. (2) represents one-dimensional convolution layer where Prepresents the raw piRNA sample as an input, lis the index of the filter, and kis the index of output position. represents each of the convolution filters having a weight matrix of dimensions. Sdenotes the size of the filter, whereas Nrepresents the number of input channels.Eq. (3) is the representation of ReLU activation function having x as an input.Eq. (4) is the representation of a fully connected layer where the additive bias term is denoted by is a representation of the dropout operator derived from Bernoulli distribution having with the probability of represents the 1d dimensional feature vector, and represents the previous layer weights of .Eq. (5) denotes the final prediction layer having sigmoid as an activation function and x as an input.

Fig. 3

Illustrating detailed architecture of proposed method piRDA where the convolutional block comprises convolution layer with ReLU as an activation function along with group normalization and max-pooling layers. The dense block consists of two fully connected layers along with dropout probability ReLU as an activation function and sigmoid activation function for prediction associated scores.

Model implementation/training

The proposed architecture piRDA was constructed using the Keras framework (https://keras.io/). For optimizing the parameters of the piRDA, the adaptive moment estimation, commonly known as Adam, was used; this is an efficient stochastic optimization method where the magnitude in updates of parameters is unaffected by the rescaling of gradient [83]. The learning rate used for the optimizer was . Moreover, the loss function utilized was binary cross-entropy for computation of the classification loss among the actual labels and the predicted probabilities during training [84]. Furthermore, the early stopping on validation loss was employed to diminish the overfitting. The patience utilized in early stopping was 20, which signifies that the model will stop training if there is no improvement (reduction) in validation loss for 20 epochs. The maximum number of epochs for training were 200 and the batch size was 32. The optimum hyper-parameters of the proposed architecture piRDA are selected using a grid search algorithm known as keras-hypetune (https://github.com/cerlymarco/keras-hypetune). The tuning of hyper-parameters provides a substantial part in selecting an optimal deep-learning model.

Results

Evaluation measures

To evaluate the prediction performance and efficiency of statistical predictors, the most commonly used k-fold cross-validation was utilized. The performance evaluation metrics include accuracy (Acc), sensitivity (Sn), specificity (Sp), Mathew correlation coefficient (Mcc), and rank index (RI). The accuracy is the ratio of correctly classified samples to all samples. The sensitivity and specificity are the proportion of true positive (Tp) and true negative (Tn) respectively. Mcc is a measure of the classifier quality and stability. All of the true positive, false positive (Fp), true negatives, and false negatives (Fn) are considered for evaluating this metric, which results in an effective evaluation in case of class imbalance. Tp is the number of correctly identified positive samples, whereas the number of positive samples predicted as negative samples are known as Fn. Similarly, Tn is the number of accurately identified negative samples and Fp is incorrectly identified negative samples as positive associations. Furthermore, the rank index [33], [34], [85] is a measure of the identification capacity of the positive association with respect to their ranks in all the piRNA-disease pairs of the test subset. Considering higher values of , and Mcc, better the predictor’s performance. Conversely, the lower value of the RI metric signifies superior the performance. The evaluation measures can be calculated as follows:where represents the number of positive test subset associations. denotes the positive piRNA disease association rank position among all the pairs of piRNA-disease in the test subset . Moreover, the receiver operating characteristic curve (ROC) was utilized for evaluating the success rate of the classifier. ROC is the graphical plot between true positive rate and false positive rate depicting the predictor’s performance at all thresholds of classification. Additionally, the precision-recall curve (PRC) is a measure of evaluating the positive class prediction of a classifier. The PRC is plotted between precision and recall on all classification thresholds; where both of these measures, ROC and PRC, are the significant indicators for positive class evaluation. Herein, area under the ROC (AUC) and PRC (AUPRC) signifies the prediction quality of the classifier. Both AUC and AUPRC are the composite metrics of the classifier’s success that considers all the potential classification thresholds.

Model performance

The proposed method piRDA for identifying piRNA-disease association by using the two-step positive unlabeled learning, together with the supervised learning labeling method, where the contextual information of the sequence is contemplated. The piRDA was evaluated by rigorous k-fold cross-validation techniques. The performance of the piRDA by employing the evaluation measures is summarized in Table 2. These outcomes are the average values along with standard deviation error of 50 sub test dataset from piRNA disease and reliable negative sequence datasets, where the values for , AUC, and AUPRC are 91.32%, 90.89%, 91.80%, 0.827, 0.056, 0.951%, and 0.931%, respectively. Also, the AUC and AUPRC together with standard deviation errors of five folds are illustrated in Fig. 4 and Fig. 5 respectively. Similarly, Fig. 6 and Fig. 7 illustrates the AUC and AUPRC along with standard deviation errors by utilizing 10 sub folds cross-validation.The feature space learned by piRDA was represented using UMAP and is shown in Fig. 8.

Table 2

Summary of performance comparison of piRDA with existing methods for identifying piRNA disease associations.

Metric	piRDA	iPiDA-sHN	iPiDi-PUL
Acc	0.913 ± 0.007	0.736 ±0.020	0.589 ±0.012
Sn	0.909 ± 0.011	0.779 ±0.078	0.281 ±0.027
Sp	0.918 ±0.014	0.694 ±0.080	0.897 ±0.007
Mcc	0.827 ± 0.016	–	–
RI	0.056 ± 0.004	0.307 ± 0.005	0.322 ±0.005
AUC	0.951 ±0.001	0.887 ±0.009	0.856 ±0.009
AUPRC	0.931 ±0.003	0.834 ±0.023	0.764 ±0.014

”-” denotes Not Applicable.

Fig. 4

Illustration of the five folds success rate (ROC), with associated calculation of prediction quality (AUC) and standard deviation error.

Fig. 5

Illustration of the five folds PRC together with (AUPRC) and standard deviation error.

Fig. 6

Illustration of ROC along with AUC and standard deviation error of sub 10-fold cross-validation.

Fig. 7

Illustration of PRC along with AUC and standard deviation error of sub 10-fold cross-validation.

Fig. 8

Clusters of positive and negative piRNA disease associations features of the proposed method obtained from hidden layer activation using UMAP.

Summary of performance comparison of piRDA with existing methods for identifying piRNA disease associations. ”-” denotes Not Applicable. Illustration of the five folds success rate (ROC), with associated calculation of prediction quality (AUC) and standard deviation error. Illustration of the five folds PRC together with (AUPRC) and standard deviation error. Illustration of ROC along with AUC and standard deviation error of sub 10-fold cross-validation. Illustration of PRC along with AUC and standard deviation error of sub 10-fold cross-validation. Clusters of positive and negative piRNA disease associations features of the proposed method obtained from hidden layer activation using UMAP.

Comparative analysis

To analyze the significance and dominance of the proposed architecture piRDA, we compared the performance with the existing and state-of-the-art methods including iPiDi-PUL [33] and iPiDA-sHN [34] respectively. iPiDi-PUL: is an ensemble learning-based random forest method where the extracted features include the amalgamation of three biological data sources. The model was trained using positive unlabeled learning, where for the positive set or labeled piRNA disease associations, the equivalent number of negative set was randomly selected from unlabeled piRNA disease associations. iPiDA-sHN: is an SVM-based classifier where the CNN was used to extract features of computed piRNA similarity and disease similarity from three independent biological sources. Furthermore, SVM-based two-step positive unlabeled learning was employed to construct reliable negative samples from unlabeled data and classification of piRNA disease associations. The proposed computational method piRDA outperforms all the available relevant computational methods in comparison. The comparison of outcomes in the identification of piRNA disease association along with standard deviation errors are summarized and illustrated in Table 2 and Fig. 9, respectively. The aforementioned comparative results are obtained from the state-of-the-art method iPiDA-sHN [34]. Furthermore, piRDA outstrips state of the art in the performance evaluation measures including , AUC, and AUPRC by 17.7, 13.0, 22.4, 25.1, 6.4, and 10.0 percent, respectively.

Fig. 9

Illustration of evaluation measures comparision of piRDA with existing methods for identifying piRNA disease associations.

Discussion

The outperformance of proposed method in all evaluation measures signifies that the piRDA is most robust and efficient than the available computational methods in identifying the piRNA disease associations. The efficacy and robustness of the proposed method piRDA are attributable to selection of reliable negative using two-step positive unlabeled learning and DAOHV, a supervised learning representation of the raw piRNAs and their associated disease pairs. This enables the deep learning algorithm to directly extract the most significant and abstract features from the raw inputs without losing the contextual information of the sequences. The multiple levels of abstraction in the deep learning model formulate the possibility to identify the piRNA disease associations more precisely and accurately without being involved in any hand-crafted feature extraction method, whereas the available methods constructed their features matrix by fusing the information of three different biological sources, thereby introducing some noisy information, which leads to misclassification of machine learning based algorithms. Moreover, calculating the similarity matrix for feature representation results in loss of contextual information among the sequences of piRNA and disease pair. Furthermore, biases and false-negative obstruction were diminished by utilizing bootstrapping method, and two steps positive unlabeled learning where the selection of reliable negative associations helps in reducing the false negative problem to the difference of only 1 percent between the Sn and Sp. Which was 8.5 percent in iPiDA-sHN and 61.6 percent following the case of iPiDi-PUL. This drastic difference depicts that piRNA disease associations were inaccurately classified as non-piRNA disease associations. The random selection of negative samples from unlabeled data for training of iPiDi-PUL is responsible to evoke the bias predictions of the classifier.

Case study

Evaluation of the proposed method piRDA in reference to the literature regarding piRNAs as potential biomarkers and therapeutic targets of various diseases. We test the proposed method using the experimentally verified piRNAs which were not involved in training of the model. For instance, piRBase ID or NCBI accession number piR-hsa-23317 (DQ593039), piR-hsa-1207 (DQ570956), piR-hsa-27730 (DQ597484), piR-hsa-24016 (DQ593768), piR-hsa-26593 (DQ596377), piR-hsa-29114 (DQ599147) reported in Li et al. [22] showing the highest association for Cardiovascular diseases. piR-hsa-26686 (DQ596470) [86], piR-hsa-20266 (DQ590013) [87] for Renal cell carcinoma, and piR-hsa-25783 (DQ595536), piR-hsa-28467 (DQ598252), piR-hsa-24016 (DQ593768), piR-hsa-2107 (DQ571813), piR-hsa-820 (DQ570540), piR-hsa-515 (DQ570206) [23] for Alzheimer disease. The disease associations for the independent piRNAs is summarized in Table 3.

Table 3

Summary of piRDA performance for identifying piRNA disease associations using independent piRNA IDs.

piRNA ID	Association	Reported
piR-hsa-23317	Cardiovascular diseases	Li et al.[22]
piR-hsa-1207
piR-hsa-24016
piR-hsa-26593
piR-hsa-29114

piR-hsa-26686	Renal cell carcinoma	Wu et al. [86]
piR-hsa-20266	Renal cell carcinoma	Fu et al. [87]

piR-hsa-25783	Alzheimer disease	Roy et al. [23]
piR-hsa-28467
piR-hsa-24016
piR-hsa-2107
piR-hsa-820
piR-hsa-515

piRNA ID refers to the piRBase [28].

Summary of piRDA performance for identifying piRNA disease associations using independent piRNA IDs. piRNA ID refers to the piRBase [28].

Web-server

The urbanization of a user-friendly and freely accessible webserver accumulating the processes of proposed architecture piRDA is available at http://nsclbio. jbnu.ac.kr/tools/piRDA/. The web servers are efficient in maintaining the records of computationally analyzed results. The server is constructed using the python flask web framework. The input in piRNA sequences can be uploaded in FASTA format, whereas the output results in disease associated with the respective piRNA sequence.

Conclusion

In this study, we proposed deep learning based computationally efficient and robust algorithm for identifying piRNA disease association. The significantly important features were extracted from disease-associated piRNA without any intervention of hand-designed feature engineering. For constructing a reliable negative dataset and to remove biases of the classifier, two-step positive unlabeled learning and bootstrapping methods were utilized, respectively. The experimental outcomes reveal that the proposed architecture piRDA significantly outperforms the state-of-the-art computational methods for predicting piRNA disease associations. Accurate identification of piRNA disease associations would promote the experimentalists, researchers, and drug developers to further enhance the understanding of mechanism regarding diseases associated with piRNAs. The publicly accessible convenient web tool would be an effective platform to obtain their desired reliable information effectively. Presently, as the research regarding piRNA disease association is in its infancy. Therefore, this model can identify the piRNA disease association of 21 diseases, which could be further enhanced and generalized in future with availability of the verified disease-associated piRNAs.

CRediT authorship contribution statement

Syed Danish Ali: Conceptualization, Methodology, Software, Writing - original draft, Writing - review & editing. Hilal Tayara: Conceptualization, Software, Validation, Supervision, Writing - review & editing. Kil To Chong: Conceptualization, Validation, Supervision, Writing - review & editing, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

57 in total

1. A germline-specific class of small RNAs binds mammalian Piwi proteins.

Authors: Angélique Girard; Ravi Sachidanandam; Gregory J Hannon; Michelle A Carmell
Journal: Nature Date: 2006-06-04 Impact factor: 49.962

Review 2. Biogenesis of small RNAs in animals.

Authors: V Narry Kim; Jinju Han; Mikiko C Siomi
Journal: Nat Rev Mol Cell Biol Date: 2009-02 Impact factor: 94.444

Identification of piRNA disease associations using deep learning.

Introduction

Materials and methods

Dataset construction

Proposed methodology

Feature representation

Positive unlabeled learning

Bootstrapping technique

Proposed architecture

Model implementation/training

Results

Evaluation measures

Model performance

Comparative analysis

Discussion

Case study

Web-server

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

1. A germline-specific class of small RNAs binds mammalian Piwi proteins.

Review 2. Biogenesis of small RNAs in animals.

Review 3. Computational Methods and Online Resources for Identification of piRNA-Related Molecules.

4. DeepIDA: Predicting Isoform-Disease Associations by Data Fusion and Deep Neural Networks.

5. piR-823 contributes to colorectal tumorigenesis by enhancing the transcriptional activity of HSF1.

6. SNNRice6mA: A Deep Learning Method for Predicting DNA N6-Methyladenine Sites in Rice Genome.

7. piRNA cluster database: a web resource for piRNA producing loci.

8. Exploiting sequence-based features for predicting enhancer-promoter interactions.

9. DNA6mA-MINT: DNA-6mA Modification Identification Neural Tool.

10. DAM: Hierarchical Adaptive Feature Selection Using Convolution Encoder Decoder Network for Strawberry Segmentation.