Literature DB >> 31931344

DBMDA: A Unified Embedding for Sequence-Based miRNA Similarity Measure with Applications to Predict and Validate miRNA-Disease Associations.

Kai Zheng¹, Zhu-Hong You², Lei Wang³, Yong Zhou⁴, Li-Ping Li⁵, Zheng-Wei Li⁴.

Abstract

MicroRNAs (miRNAs) play a critical role in human diseases. Determining the association between miRNAs and disease contributes to elucidating the pathogenesis of liver diseases and seeking the effective treatment method. Despite great recent advances in the field of the associations between miRNAs and diseases, implementing association verification and recognition efficiently at scale presents serious challenges to biological experimental approaches. Thus, computational methods for predicting miRNA-disease association have become a research hotspot. In this paper, we present a new computational method, named distance-based sequence similarity for miRNA-disease association prediction (DBMDA), that directly learns a mapping from miRNA sequence to a Euclidean space. The notable feature of our approach consists of inferring global similarity from region distances that can be figured by chaos game representation algorithm based on the miRNA sequences. In the 5-fold cross-validation experiment, the area under the curve (AUC) obtained by DBMDA in predicting potential miRNA-disease associations reached 0.9129. To assess the effectiveness of DBMDA more effectively, we compared it with different classifiers and former prediction models. Besides, we constructed two case studies for prostate neoplasms and colon neoplasms. Results show that 39 and 39 out of the top 40 predicted miRNAs were confirmed by other databases, respectively. BDMDA has made new attempts in sequence similarity and achieved excellent results, while at the same time providing a new perspective for predicting the relationship between diseases and miRNAs. The source code and datasets explored in this work are available online from the University of Chinese Academy of Sciences (http://220.171.34.3:81/).

Entities: Disease Gene Species

Keywords: chaos game representation; disease; heterogenous information; miRNAs; rotation forest

Year: 2019 PMID： 31931344 PMCID： PMC6957846 DOI： 10.1016/j.omtn.2019.12.010

Source DB: PubMed Journal: Mol Ther Nucleic Acids ISSN： 2162-2531 Impact factor: 8.886

Introduction

MicroRNA (miRNA) is a short group of noncoding RNA (ncRNA) constructed from about 22 nt that can combine designated messenger RNA by base pairing and control the translation and stability. Since the first miRNA was discovered by Victor Ambros in 1993, a large number of found miRNAs accumulated at a high level during the past 20 years from a far-ranging variety of species., The study found that miRNA plays an important influence on biological processes, such as cell development, proliferation, and apoptosis, and the regulation functions of miRNA are related to some particular gene expressions in the post-transcriptional stage. Based on the above findings, more and more miRNAs have been validated in connection with the development of complex diseases in humans. For instance, miR-137 controlled the mitotic progression of lung cancer cells by targeting Cdc42 and Cdk6. In von Brandenstein et al.’s study, miR-15a is a potential biological marker for differentiating benign and malignant renal tumors in biopsy and urine samples. The progression of head and neck carcinomas could also be boosted by miR-211 through combining transforming growth factor-β receptor 2 (TGF-βR2). However, the biological experimental conditions for verifying the association between miRNA and disease are harsh and have time-consuming and laborious disadvantages. Therefore, the computational algorithms for forecasting the potential miRNA-disease associations have become a hot topic, and more studies attach importance to it. Correspondingly, computational methods can more effectively assist biological experiments to validate disease-associated miRNAs by predicting results. Over the years, an increasing number of studies constructed computational models for predicting miRNA-disease association.11, 12, 13, 14, 15, 16, 17 There are two main types of computational models based on similarity and based on machine learning. To be specific, methods based on similarity figure the correlation intension through the miRNA and disease network. For example, Chen et al. proposed Random Walk with Restart for MiRNA-Disease Association prediction (RWRMDA) is a method for calculating global network similarity by combining matrices of miRNA functional similarity. Li et al. presented a computational method to predict potential associations by calculating functional consistency score (FCS) of target genes and disease-related genes. The main progress of heterogeneous graph inference for miRNA-disease association prediction (HGIMDA) was to calculate the optimal solution set through an iterative process, given by Chen et al. On the other hand, the methods based on machine learning predict the potential miRNA-disease association by using the known miRNA-disease association training model. For example, Xu et al. used a support vector machine (SVM) classifier to identify positive and negative associations in a miRNA-target-dysregulated network. Chen and Yan proposed a method for predicting new disease-related RNA without negative correlation named Regularized Least-squares for MiRNA-Disease Association according to semi-supervised learning. Restricted Boltzmann machine for multiple types of miRNA-disease association prediction (RBMMMDA) is a method developed by Chen et al., whose main improvement is the acquisition of several types of new associations. In this study, we build a distance-based sequence similarity for miRNA-disease association prediction (DBMDA) based on chaos game representation (CGR). DBMDA combines the information of miRNA sequence, miRNA function, confirmed association, and disease semantic. The motivation for this approach is to map miRNA sequences to Euclidean space, where the regional distance directly corresponds to a measure of miRNA sequence similarity. In detail, we first obtained miRNA and disease similarity matrices based on miRNA sequence information and disease semantic information. Second, the similarity matrices obtained in the previous step are combined with the Gaussian profile kernel similarity matrices of miRNA and disease to get the integrated similarity matrices. Third, each nucleotide directly learns the mapping from miRNA sequences to Euclidean space through CGR techniques. To be specific, the CGR plane is divided into 8 × 8 grids, and the average coordinates of each grid are calculated. Also, the regional distance between miRNAs is used to quantify the similarity of miRNA functions to construct a miRNA sequence similarity matrix and integrate the similar information obtained in the second step into a comprehensive feature. Finally, the integrated feature vector is placed in the rotation forest (RoF) classifier to predict the potential association. The following experiments have been designed to evaluate the reliability of the method. We use the 5-fold cross-validation to assess the performance of DBMDA in the Human microRNA Disease Database (HMDD) v.3.0 dataset. The AUC of 5-fold cross-validation was 0.9129 ± 0.0113 in result. Moreover, two case studies on prostate neoplasms and colon neoplasms have been applied. As a result, 39 (prostate tumors) and 39 (colon tumors) of the top 40 predicted miRNAs, respectively, were verified by other datasets. It shows that DBMDA is an efficient predicting potential miRNA-disease associations method.

Results

Performance Evaluation

Evaluation Criteria

We follow the widely used evaluation measure by means of classification accuracy (Accu.), sensitivity (Sen.), precision (Prec.), and F1 score to assess the performance of DBMDA as defined, respectively, by:where TP, FP, TN, and FN represent the true positive, false positive, true negative, and false negative, respectively. In addition, the receiver operating characteristic (ROC) curve and the area under the curve (AUC) can be used to show the performance of the model generally.

Prediction of miRNA-Disease Association

We have used the 5-fold cross-validation to assess the performance of DBMDA based on confirmed associations in HMDD v.3.0. Li et al. selected 17,412 papers and extracted 32,281 known miRNA-disease associations constructed by 1,102 miRNAs and 850 diseases. Because some information of miRNA cannot be judged by the public database miRBase, we have removed it. After screening, the associations confirmed by miRbase have been chosen as positive samples. Meanwhile, negative samples are constructed by possible miRNA-disease association pairs from all possible miRNA-disease pairs. Figure 1 lists the performance of the 5-fold cross-validation obtained by DBMDA. We can see from the table that DBMDA has gained an average prediction AUC of 0.9129 ± 0.0113. The AUC of the five experiments is 0.8904 (fold 1), 0.9177 (fold 2), 0.9174 (fold 3), 0.9206 (fold 4), and 0.9188 (fold 5), respectively. The yielded averages of accuracy, sensitivity, precision, and F1-score come to be 85.36%, 85.74%, 85.09%, and 85.40% as in Table 1.

Figure 1

The ROCs of DBMDA and AUCs Based on 5-Fold Cross-Validation

Table 1

The Comparison Results of DBMDA Based on 5-Fold Cross-Validation

Testing Set	Accuracy	Sensitivity	Precision	F1-Score
1	83.14%	81.55%	84.23%	82.87%
2	86.21%	86.83%	85.77%	86.30%
3	85.57%	86.42%	84.99%	85.70%
4	86.22%	87.07%	85.63%	86.34%
5	85.66%	86.83%	84.85%	85.83%
Average	85.36% ± 1.27%	85.74% ± 2.35%	85.09% ± 0.62%	85.40% ± 1.44%

The ROCs of DBMDA and AUCs Based on 5-Fold Cross-Validation The Comparison Results of DBMDA Based on 5-Fold Cross-Validation

Comparison with Different Classifier Models

In the 5-fold cross-validation, our proposed method achieved good results in the HMDD v.3.0 dataset using the RoF classifier. The RoF as part of the proposed method was compared with SVM, random forest (RF), and decision tree (DT) in this experiment to illustrate why it was chosen. The accuracies of the four experiments are 85.00% (RoF), 83.73% (SVM), 82.06% (RF), and 80.33% (Decision Tree). Their AUCs are 91.15% (RoF), 89.01% (SVM), 90.77% (RF), and 80.29% (Decision Tree), which are shown in Figure 2. The accuracy, sensitivity, precision, and F1-score have been shown in Table 2. From the experimental results, the performance of the rotating forest classifier in terms of sensitivity is not the highest among the four classifiers However, the best results were obtained in other evaluation criteria, especially the AUC that represents the overall performance of the model. In general, rotating forests is the best classifier for the features we build.

Figure 2

The ROCs of Four Different Classifiers, which Are RoF, SVM, Random Forest, and Decision Tree

Table 2

Performance Comparison among Four Different Classifiers, which are Rotation Forest, SVM, Random Forest, and Decision Tree

Method	Accuracy	Sensitivity	Precision	F1-Score
SVM	83.73%	83.56%	83.33%	83.45%
RF	82.06%	76.49%	85.43%	80.72%
DT	80.33%	78.12%	81.10%	79.58%
RoF	85.00%	85.60%	84.11%	84.85%

The ROCs of Four Different Classifiers, which Are RoF, SVM, Random Forest, and Decision Tree Performance Comparison among Four Different Classifiers, which are Rotation Forest, SVM, Random Forest, and Decision Tree

Comparison with Related Methods

Many studies in the past have explored the field of the associations between miRNAs and diseases. To evaluate performance, we compared it with eight state-of-the-art methods. Because the database versions used are not the same, we compare only the AUC values reported in the article. Compared with the AUC of RLSMDA, PBSI, MBSI, NetCBI, MaxFlow, miRGOFS, HGIMDA, MDHGI, and LMTRDA, DBMDA performs better, as shown in Table 3.,,27, 28, 29, 30, 31 There are manifold reasons why DBMDA is more outstanding than traditional miRNA similarity. First, the sequence information of miRNAs contains attribute features and is an excellent source of knowledge reflecting essential information. Second, the miRNA similarity obtained based on limited knowledge resources may have errors caused by information loss. Third, our approach inferring global similarity from regional distances also helps improve performance.

Table 3

The Comparison with Related Models

Methods	AUC Scores
RLSMDAa	86.17%
PBSIb	54.02%
MBSIb	74.83%
NetCBIb	80.66%
MaxFlowc	86.93%
miRGOFSd	87.70%
HGIMDAe	87.81%
MDHGIf	87.94%
LMTRDAg	90.54%
DBMDA	91.29%

The results of the method are reported in Chen and Yan.

The results of the method are reported in Chen and Zhang.

The results of the method are reported in Yu et al.

The results of the method are reported in Yang et al.

The results of the method are reported in Chen et al.

The results of the method are reported in Wang et al.

The Comparison with Related Models The results of the method are reported in Chen and Yan. The results of the method are reported in Chen and Zhang. The results of the method are reported in Yu et al. The results of the method are reported in Yang et al. The results of the method are reported in Chen et al. The results of the method are reported in Chen et al. The results of the method are reported in Wang et al.

Case Studies

Here DBMDA will be applied to two kinds of human diseases, including prostate neoplasms and colon neoplasms. It further evaluates the effectiveness of DBMDA based on the associations identified in the HMDD v.3.0 database. The test samples are miRNA-disease associations consisting of two diseases and all possible miRNAs. We confirmed prediction results with top 40 ranks in dbDEMC v.2.0 and dbDEMC v.2.0. In the United States, prostate cancer has caused more than 20,000 deaths and has become one of the hidden dangers of men’s health today. Age is a major cause of prostate cancer, and older people may have a higher rate. However, an increasing number of younger men were diagnosed with prostate neoplasms. Prostate neoplasms may pass to other areas of the human body, such as surrounding tissue like regional lymph nodes. Therefore, we took prostate neoplasms as an example to evaluate the performance of DBMDA. The results are shown in Table 4. Thirty-nine of the top 40 predicted miRNAs were identified by the two datasets mentioned above.

Table 4

Prediction of the Top 40 Predicted miRNAs Associated with Prostate Neoplasms Based on Known Associations in dbDEMC v.2.0 and miR2Database

miRNA	dbDEMC	miR2D	miRNA	dbDEMC	miR2D
hsa-mir-192	confirmed	unconfirmed	hsa-mir-181a-2	confirmed	unconfirmed
hsa-let-7i	confirmed	unconfirmed	hsa-mir-196a	confirmed	unconfirmed
hsa-mir-140	confirmed	unconfirmed	hsa-mir-208a	confirmed	unconfirmed
hsa-mir-199b	confirmed	confirmed	hsa-mir-337	confirmed	unconfirmed
hsa-mir-144	confirmed	unconfirmed	hsa-mir-1246	confirmed	unconfirmed
hsa-mir-372	confirmed	unconfirmed	hsa-mir-30	confirmed	unconfirmed
hsa-let-7e	confirmed	confirmed	hsa-mir-184	confirmed	confirmed
hsa-let-7f	confirmed	confirmed	hsa-mir-509	unconfirmed	unconfirmed
hsa-mir-10b	confirmed	confirmed	hsa-mir-9-3	confirmed	unconfirmed
hsa-mir-129	confirmed	unconfirmed	hsa-let-7f-2	confirmed	unconfirmed
hsa-mir-9-1	confirmed	unconfirmed	hsa-mir-202	confirmed	confirmed
hsa-mir-206	confirmed	unconfirmed	hsa-mir-33a	confirmed	unconfirmed
hsa-mir-125a	confirmed	confirmed	hsa-mir-451a	confirmed	unconfirmed
hsa-mir-30b	confirmed	confirmed	hsa-let-7f-1	confirmed	unconfirmed
hsa-mir-362	confirmed	unconfirmed	hsa-mir-186	confirmed	unconfirmed
hsa-mir-133	confirmed	unconfirmed	hsa-mir-302b	confirmed	unconfirmed
hsa-mir-139	confirmed	unconfirmed	hsa-mir-328	confirmed	unconfirmed
hsa-mir-137	confirmed	unconfirmed	hsa-mir-383	confirmed	unconfirmed
hsa-mir-181b-2	confirmed	unconfirmed	hsa-mir-431	confirmed	unconfirmed
hsa-mir-338	confirmed	unconfirmed	hsa-mir-103a-2	confirmed	unconfirmed

Prediction of the Top 40 Predicted miRNAs Associated with Prostate Neoplasms Based on Known Associations in dbDEMC v.2.0 and miR2Database In the United States, colon neoplasms have the third highest morbidity and third highest fatality rate, which is defined as a type of common malignant cancer. A study showed that more than 135,000 individuals would be diagnosed with colon neoplasms and rectum neoplasms. Therefore, we chose colon neoplasms as a case study to evaluate the performance of DBMDA. As a result, 39 of the top 40 potential miRNAs that associate with colon neoplasms were confirmed by experimental findings recorded in dbDEMC v.2.0 and miR2Disease as shown in Table 5.

Table 5

Prediction of the Top 40 Predicted miRNAs Associated with Colon Neoplasms Based on Known Associations in dbDEMC v.2.0 and miR2Database

miRNA	dbDEMC	miR2D	miRNA	dbDEMC	miR2D
hsa-mir-26a	confirmed	confirmed	hsa-mir-497	confirmed	confirmed
hsa-mir-182	confirmed	confirmed	hsa-mir-92a-2	confirmed	unconfirmed
hsa-mir-342	confirmed	confirmed	hsa-mir-124	confirmed	confirmed
hsa-mir-483	confirmed	unconfirmed	hsa-mir-129	confirmed	confirmed
hsa-mir-139	confirmed	unconfirmed	hsa-mir-133a-1	confirmed	confirmed
hsa-mir-372	confirmed	unconfirmed	hsa-mir-181b-1	confirmed	confirmed
hsa-mir-181b-2	confirmed	confirmed	hsa-mir-26a-1	confirmed	confirmed
hsa-mir-181a-2	confirmed	confirmed	hsa-mir-373	confirmed	unconfirmed
hsa-mir-124-1	confirmed	confirmed	hsa-mir-423	confirmed	unconfirmed
hsa-mir-193a	confirmed	unconfirmed	hsa-mir-499	unconfirmed	unconfirmed
hsa-mir-193b	confirmed	unconfirmed	hsa-mir-128	confirmed	confirmed
hsa-mir-26b	confirmed	unconfirmed	hsa-mir-16	confirmed	unconfirmed
hsa-mir-34b	confirmed	unconfirmed	hsa-mir-212	confirmed	unconfirmed
hsa-mir-1	confirmed	confirmed	hsa-mir-340	confirmed	unconfirmed
hsa-mir-133a-2	confirmed	confirmed	hsa-mir-98	confirmed	unconfirmed
hsa-mir-199b	confirmed	unconfirmed	hsa-mir-100	confirmed	unconfirmed
hsa-mir-27b	confirmed	confirmed	hsa-mir-124-3	confirmed	confirmed
hsa-mir-29c	confirmed	unconfirmed	hsa-mir-133	confirmed	confirmed
hsa-mir-451a	confirmed	unconfirmed	hsa-mir-183	confirmed	confirmed
hsa-mir-144	confirmed	unconfirmed	hsa-mir-370	confirmed	unconfirmed

Prediction of the Top 40 Predicted miRNAs Associated with Colon Neoplasms Based on Known Associations in dbDEMC v.2.0 and miR2Database

Discussion

Sequence-based miRNA similarity can aid in predicting miRNA-disease associations, extract biological property information, and enhance the analytical quality of high-throughput sequencing data. However, most existing methods do not involve sequence information, and according to current information sources (miRNA-disease association), the relationship between miRNAs is not directly reflected. Therefore, this paper proposed a predictive model for inferring miRNA similarity based on sequence information, called DBMDA. The improvement of the method was to directly learn the mapping from miRNA sequence to Euclidean space. In Euclidean space, the regional distance directly corresponds to the measure of miRNA sequence similarity. Excellent experimental results indicate that DBMDA had performed well in predicting disease-associated miRNAs with the support of new algorithms and sequence information. In addition, sequence information has sufficient coverage for human miRNAs, and DBMDA is universal in functional analysis.

Conclusions

Materials and Methods

Human miRNA-Disease Associations

We downloaded the confirmed associations data from the HMDD dataset in this experiment. The last update of HMDD v.3.0 was October 9, 2018, which includes 32,281 experimentally known associations about 850 diseases and 1,102 miRNAs from 17,412 papers. Based on it, an adjacency matrix is built to reshape the associations, where and are the number of the diseases and miRNAs in HMDD v.3.0. is equal to 1 if miRNA had been confirmed to associate with a disease , otherwise equal to 0.

miRNA Functional Similarity

Wang et al. proposed a method for quantifying miRNA functional similarity between miRNAs based on the hypothesis that functionally similar miRNAs are more likely to affect the same disease and pathologically similar diseases are more likely to be affected by the same miRNA. The miRNA function information is uploaded to http://www.cuilab.cn/files/images/cuilab/misim.zip. A 495 rows × 495 columns matrix, , can be defined to represent the miRNA functional similarity, and the element is the similarity score between the miRNA and the miRNA .

Disease Semantic Similarity Model

We built a directed acyclic graph (DAG) to define the relationship among diseases based on the method proposed by Wang et al., which is according to the Medical Subject Headings (MeSH) descriptors. The MeSH descriptors can be downloaded from the U.S. National Library of Medicine database (https://www.nlm.nih.gov/). The disease can be defined as , where is a node set including the information of disease and its ancestor diseases, and is an edge set including the information of the corresponding edges. Based on the DAG, the contribution values of disease in to the semantic value of disease was calculated as:where the semantic contribution decay factor is , which is set to 0.5 according to previous studies. Furthermore, if disease is not disease , it will decrease the contribution of disease . If disease is disease , the contribution of disease is defined as 1. Besides, we described the semantic value as follows:If disease and have more shared segments of their DAGs, they will have a larger similarity score. The semantic similarity score could be defined as follows:The is defined as the 850 rows and 850 columns semantic similarity matrix, and the element is the semantic similarity of and based on disease semantic similarity model 1. According to the above formula, diseases in the same layer in will have the same contribution value. However, a higher value should be contributed by a definite disease that appears in fewer . Hence the contribution of disease in to the semantic value of disease is described based on the method built by Xuan et al. as follows:where is a disease of all the diseases in our method. Also, the semantic similarity between disease and is described as , which is based on the shared ancestor nodes and all the ancestor nodes. To be specific, the disease semantic similarity can be computed as follows:where and are the semantic score of and , and can be computed the same as for Equation 2.

GIP Similarity for Diseases and miRNA

The HMDD v.3.0 dataset provides plenty of correlation information. Based on the hypothesis that the pathologically similar disease may be affected by the same miRNA and vice versa, we calculate the disease and miRNA similarity by Gaussian interaction profile kernel (GIP) similarity. The binary vector is the -th row vector of adjacency matrix . The disease GIP similarity between and was computed by:where adjustment coefficient was used to adjust the kernel bandwidth, which was computed via normalizing original parameter as follows:Similarly, GIP similarity for miRNA between miRNA and miRNA can be calculated as follows:where binary vector [or )] is the interaction profile of miRNA (or ) by observing whether (or ) has association with each of the 850 diseases and is equivalent to the-th (or -th) column vector of adjacency matrix .

Multi-source Feature Fusion

By combining the semantic similarity of the disease with the GIP similarity constructed above, a comprehensive similarity matrix incorporating heterogeneous information is computed. The element represented combined similarity between disease and , and was described as follows:The miRNA similarity matrix is constructed from miRNA functional similarity and miRNA GIP similarity . The miRNA similarity matrix [r(i), r(j)] formula for miRNA r(i) and miRNA r(j) is as follows:

CGR

In this study, based on the research of Jessime et al. that homologs can be effectively detected even if all positions of ncRNA are treated equally, we introduced CGR to map RNA sequences. In 1990, Jeffrey built a scale-independent representation for RNA sequences named CGR. CGR is an iterative mapping that can be traced back to chaos theory and is the basis of statistical mechanics. However, studies never fully explore identifying the resulting sequence scheme as representing the nucleotide sequence by the CGR format. RNA sequences can be mapped into the CGR space, which is planar. The four possible nucleotides confine the CGR space as vertices of a binary square (Figure 3).where is the positions, is the length of the sequence, is the nucleotide coefficient, parameter is the decay factor, and we define and .

Figure 3

CGR of the miRNA Named hsa-mir-135

Sequence Similarity for miRNAs

Information on miRNA sequences is mapped to Euclidean space, and its region distance is utilized to quantify the similarity of miRNA sequence. It will be easy to implement assignments such as miRNA sequence recognition, verification, and clustering using standard methods with DBMDA embeddings as feature vectors, if this space has been built. First, we downloaded 1,057 miRNA precursor sequences from the miRBase. Second, each nucleotide is mapped to a Euclidean space, and the CGR space is separated from the appropriately sized grid. After that, average coordinate in each quadrant is figured (Figure 4). Third, the regional distance between each miRNA and other miRNAs is calculated. The region distance as was figured by:where indicates the -th grid, represents the penalty parameter, and, is the average coordinate of in . Fourth, the calculation of the similarity between sequences at any scale was based on the region distance , defined as follows:Finally, we used a grid to get the distance-based similarity matrix of nucleotide length . (). Therefore, each miRNA sequence could be described by a 1,057-dimensional vector:

Figure 4

The Flowchart of Quantify the Sequence Similarity Utilizing its Regional Distance

(A) The CGRs of hsa-mir-27a are plotted with the average coordinates for each 8 × 8 quadrant represented. (B) The CGRs of hsa-mir-651 are plotted with the average coordinates for each 8 × 8 quadrant represented. (C) Figuring the region distances of hsa-mir-27a and hsa-mir-651.

The Flowchart of Quantify the Sequence Similarity Utilizing its Regional Distance (A) The CGRs of hsa-mir-27a are plotted with the average coordinates for each 8 × 8 quadrant represented. (B) The CGRs of hsa-mir-651 are plotted with the average coordinates for each 8 × 8 quadrant represented. (C) Figuring the region distances of hsa-mir-27a and hsa-mir-651.

Rotation Forest

Rotation Forest() independently trains decision trees using different extraction feature sets., Rodríguez et al. defined as features (attributes), which is an matrix that represents the training and as the ensemble of classifiers. Each bootstrap sample is trained separately for the independent classifiers. The improvement of RoF is extracting a feature and rebuilding a complete training set for each decision tree in . Specifically, the RoF randomly divides the training set into subsets and runs principal-component analysis (PCA) separately. The data are mapped into the new feature space and use it to train classifier . Different subsets will extract different features that improve the diversity through the bootstrap sampling.

Method Overview

A DBMDA was built. It assumes that functionally similar diseases have relation to similar miRNAs, which is also used to compute the association between target proteins and drug. DBMDA has four main processes: first, choosing positive examples and negative examples; second, gathering complex feature vectors by miRNA and disease similarity matrixes; and third, building an effective prediction model to figure potential miRNA-disease pairs. Specifically speaking, we will introduce each process in more detail. First, we constructed the training examples. Specifically, we analyzed HMDD v.3.0 and selected the known miRNA-disease associations as positive samples. Then, we clustered all of the positive samples with negative samples to build a training set. There are three steps of selecting negative samples: (1) we selected a disease from all known diseases (850) randomly, (2) chose a miRNA in the same way, and (3) combined the miRNA and disease if miRNA and disease pair is not in positive samples as a negative sample. Second, we built the feature set. In particular, we gathered three disease matrixes, which are a Gaussian profile kernel similarity matrix and two semantic similarity matrixes, into feature vectors as disease features. Feature vector of disease is described as follows:where the -th row vector of matrix is described as , and the combined similarity value between disease and is defined as . In the same way, we calculated each of 1,057 similarity values to construct a 1,057-dimensional feature vector by Gaussian interaction kernel profile similarity matrix as follows:where the -th column vector of matrix is described as , and the gathered similarity value of miRNA and is described as . Each miRNA-disease sample can be described as 1,907-dimensional vector as follows:, where are the 850 gathered similarity values of the disease and stands for the 1,057 combined similarity values of the miRNAs. After that, we resized by autoencoder (AE) from 1,097 to 32, and the sequence feature matrixes is resized in same way from 1,057 to 32. We defined each miRNA-disease sample as a 64-dimensional vector as follows:Finally, we used RoF to build the prediction model by training set. In particular, we got 64-dimensional vectors in steps 2 and 3 and used them as training set. Then, training samples were put into RoF, and a predicting potential miRNA-disease associations model was built. The workflow of the DBMDA model is shown in Figure 5.

Figure 5

The Workflow of DBMDA Model to Predict Potential miRNA-Disease Associations

Availability of Data and Materials

The datasets analyzed during the current study are available from the corresponding author on reasonable request.

Author Contributions

K.Z. conceived the algorithm, analyzed it, conducted the experiment, and wrote the manuscript. K.Z. and L.W. prepared the dataset. L.-P.L., Z.-W.L., and Y.Z. analyzed the experiment. The final draft was read and approved by all authors.

Conflicts of Interest

The authors declare no competing interests.

39 in total

Review 1. microRNAs: tiny regulators with great potential.

Authors: V Ambros
Journal: Cell Date: 2001-12-28 Impact factor: 41.582

Review 2. The functions of animal microRNAs.

Authors: Victor Ambros
Journal: Nature Date: 2004-09-16 Impact factor: 49.962

3. Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases.

Authors: Dong Wang; Juan Wang; Ming Lu; Fei Song; Qinghua Cui
Journal: Bioinformatics Date: 2010-05-03 Impact factor: 6.937

4. miR-211 promotes the progression of head and neck carcinomas by targeting TGFβRII.

Authors: Ting-Hui Chu; Cheng-Chieh Yang; Chung-Ji Liu; Mann-Tin Lui; Shu-Chun Lin; Kuo-Wei Chang
Journal: Cancer Lett Date: 2013-05-29 Impact factor: 8.679

5. MicroRNA 15a, inversely correlated to PKCα, is a potential marker to differentiate between benign and malignant renal tumors in biopsy and urine samples.

Authors: Melanie von Brandenstein; Jency J Pandarakalam; Lukas Kroon; Heike Loeser; Jan Herden; Gabriele Braun; Katherina Wendland; Hans P Dienes; Udo Engelmann; Ullrich Engelmann; Jochen W U Fries
Journal: Am J Pathol Date: 2012-03-17 Impact factor: 4.307

6. Chaos game representation of gene structure.

Authors: H J Jeffrey
Journal: Nucleic Acids Res Date: 1990-04-25 Impact factor: 16.971

7. MiRGOFS: a GO-based functional similarity measurement for miRNAs, with applications to the prediction of miRNA subcellular localization and miRNA-disease association.

Authors: Yang Yang; Xiaofeng Fu; Wenhao Qu; Yiqun Xiao; Hong-Bin Shen
Journal: Bioinformatics Date: 2018-10-15 Impact factor: 6.937

8. FMLNCSIM: fuzzy measure-based lncRNA functional similarity calculation model.

Authors: Xing Chen; Yu-An Huang; Xue-Song Wang; Zhu-Hong You; Keith C C Chan
Journal: Oncotarget Date: 2016-07-19

9. microRNAs and genetic diseases.

Authors: Nicola Meola; Vincenzo Alessandro Gennarino; Sandro Banfi
Journal: Pathogenetics Date: 2009-11-04

10. WBSMDA: Within and Between Score for MiRNA-Disease Association prediction.

Authors: Xing Chen; Chenggang Clarence Yan; Xu Zhang; Zhu-Hong You; Lixi Deng; Ying Liu; Yongdong Zhang; Qionghai Dai
Journal: Sci Rep Date: 2016-02-16 Impact factor: 4.379

12 in total

1. SMMDA: Predicting miRNA-Disease Associations by Incorporating Multiple Similarity Profiles and a Novel Disease Representation.

Authors: Bo-Ya Ji; Liang-Rui Pan; Ji-Ren Zhou; Zhu-Hong You; Shao-Liang Peng
Journal: Biology (Basel) Date: 2022-05-20

2. Hierarchical graph attention network for miRNA-disease association prediction.

Authors: Zhengwei Li; Tangbo Zhong; Deshuang Huang; Zhu-Hong You; Ru Nie
Journal: Mol Ther Date: 2022-02-02 Impact factor: 12.910

3. DF-MDA: An effective diffusion-based computational model for predicting miRNA-disease association.

Authors: Hao-Yuan Li; Zhu-Hong You; Lei Wang; Xin Yan; Zheng-Wei Li
Journal: Mol Ther Date: 2021-01-09 Impact factor: 11.454

4. NLPEI: A Novel Self-Interacting Protein Prediction Model Based on Natural Language Processing and Evolutionary Information.

Authors: Li-Na Jia; Xin Yan; Zhu-Hong You; Xi Zhou; Li-Ping Li; Lei Wang; Ke-Jian Song
Journal: Evol Bioinform Online Date: 2020-12-26 Impact factor: 1.625

5. PMDFI: Predicting miRNA-Disease Associations Based on High-Order Feature Interaction.

Authors: Mingyan Tang; Chenzhe Liu; Dayun Liu; Junyi Liu; Jiaqi Liu; Lei Deng
Journal: Front Genet Date: 2021-04-09 Impact factor: 4.599

6. An effective drug-disease associations prediction model based on graphic representation learning over multi-biomolecular network.

Authors: Hanjing Jiang; Yabing Huang
Journal: BMC Bioinformatics Date: 2022-01-04 Impact factor: 3.169

7. MIMRDA: A Method Incorporating the miRNA and mRNA Expression Profiles for Predicting miRNA-Disease Associations to Identify Key miRNAs (microRNAs).

Authors: Xianbin Li; Hannan Ai; Bizhou Li; Chaohui Zhang; Fanmei Meng; Yuncan Ai
Journal: Front Genet Date: 2022-01-27 Impact factor: 4.599

8. SIPGCN: A Novel Deep Learning Model for Predicting Self-Interacting Proteins from Sequence Information Using Graph Convolutional Networks.

Authors: Ying Wang; Lin-Lin Wang; Leon Wong; Yang Li; Lei Wang; Zhu-Hong You
Journal: Biomedicines Date: 2022-06-29

9. DANE-MDA: Predicting microRNA-disease associations via deep attributed network embedding.

Authors: Bo-Ya Ji; Zhu-Hong You; Yi Wang; Zheng-Wei Li; Leon Wong
Journal: iScience Date: 2021-04-20

10. PESM: predicting the essentiality of miRNAs based on gradient boosting machines and sequences.

Authors: Cheng Yan; Fang-Xiang Wu; Jianxin Wang; Guihua Duan
Journal: BMC Bioinformatics Date: 2020-03-18 Impact factor: 3.169