Literature DB >> 32581642

Applications of Machine Learning in miRNA Discovery and Target Prediction.

Alisha Parveen¹, Syed H Mustafa¹, Pankaj Yadav¹, Abhishek Kumar¹.

Abstract

MicroRNA (miRNA) is a small non-coding molecule that is involved in gene regulation and RNA silencing by complementary on their targets. Experimental methods for target prediction can be time-consuming and expensive. Thus, the application of the computational approach is implicated to enlighten these complications with experimental studies. However, there is still a need for an optimized approach in miRNA biology. Therefore, machine learning (ML) would initiate a new era of research in miRNA biology towards potential diseases biomarker. In this article, we described the application of ML approaches in miRNA discovery and target prediction with functions and future prospective. The implementation of a new era of computational methodologies in this direction would initiate further advanced levels of discoveries in miRNA.

Entities: CellLine Chemical Disease Gene Species

Keywords: feature generation; feature selection; gene expression; machine learning; microRNA; target prediction

Year: 2019 PMID： 32581642 PMCID： PMC7290058 DOI： 10.2174/1389202921666200106111813

Source DB: PubMed Journal: Curr Genomics ISSN： 1389-2029 Impact factor: 2.236

INTRODUCTION

MicroRNAs (miRNAs) are endogenous, short length (around 22 base pairs) non-coding RNA molecules that play an important role in the gene regulation in animals, plants and viruses. miRNA regulates the post-transcriptional expression of a gene by aligning to a different region that directs to translational repression or endonucleolytic cleavage of the coding gene [1]. MiRNA is known to be involved in several biological processes such as muscle development, hematopoiesis, apoptosis, immune system, aging and signal transduction depending upon target regulation (Fig. ). Moreover, it plays a central role during the embryogenesis period that leads to tissue identity and differentiation. It acts as a biomarker for several fatal diseases including cancer, heart disease and neurological disorders [2]. During the 1990s, Lee et al. discovered miRNA in the nematode Caenorhabditis elegans as lin-4 [3, 4]. About two decades later, let-7 gene reported in the same nematode species with similarity to lin-4 [5]. In the biogenesis, RNA polymerase II (RNA pol II) along with Dorsha enzyme transcribed miRNA gene into long primary transcript (pre-miRNAs) in the nucleus. Exportin V protein transports pre-miRNA into the cytoplasm for further processing of the hydrolysis of Ran-GTP complex. During the maturation phase, ATP dependent protein Dicer recognizes pre-miRNA and processed into the duplex structure of miRNA-miRNA*. In the duplex structure, one strand is an antisense strand consisting of G: U wobble base pair, other matches, mismatches and unpaired at the 5’ end and whereas the other strand is sense strand [2]. Previous studies found that double-stranded load on the argonaute-1 (AGO-1) protein along with an RNA-induced silencing complex (RISC) makes this complex guide to target mRNA that leads to the post-transcriptional expression of the gene as described in Fig. (). There are two different processes of the post-transcriptional phases as the endonucleolytic cleavage phase and the repression phase. In the endonucleolytic cleavage process, miRNA sequence is extensively aligned to their target gene by removing poly (A) tail, which is being silenced by AGO protein leading to target fragmentation [6]. In translational repression, miRNA sequence aligns partially to a binding region of their target gene that resists the binding of ribosomes to the target gene resulting in the inhibition of synthesis of the polypeptide [6, 7]. In the past, significant progress has been made in the discovery of miRNAs and their target predictions. This includes the ascertainment of several physical and functional characteristics of miRNA that are indicative of the miRNA functions and targets. These characteristics include information such as folding patterns, thermodynamic properties, and sequence conservation [8]. Machine-learning (ML) methods are very useful for the performance of real-time predictive and analytical study in the identification of miRNA and their target genes which are involved in different diseases [9]. This article presents the recent advancement in the implementation of the ML in the field of microRNA biology and biomedical research. Furthermore, we present the fundamental concept behind various classification methods including supervised and unsupervised approaches.

APPLICATION OF MACHINE LEARNING IN MicroRNA IDENTIFICATION

The generalized process for the application of machine learning in miRNA identification is shown in Fig. () and summarized in Table . The machine learning approach involves several steps in training the miRNA identification classifier as summarized: a) To train the classifier model, data mining is a crucial step in the extraction and identification of features from a dataset. b) In the generation of the positive dataset, the hairpin sequence of miRNA extracted from experimentally verified databases [10] undergoes several levels of filtration to improve the high confidence positive set. c) The negative dataset is equally important to train the classifier, so that it can easily distinguish between negative and positive datasets. The excess amount of negative dataset and positive dataset can create overfitting and underfitting model, respectively. For instance, the positive sample dataset should consist of miRNA duplexes derived from experimentally validated miRNAs. To avoid redundant information, only one miRNA duplex should be included in the positive sample set if both the 5' and 3' strands of the miRNA duplex are functional. The negative samples should consist of pseudo miRNA duplexes derived from segments randomly selected from pre-miRNA hairpins. The miRNA secondary structure can be predicted using RNAfold package as implemented in Vienna software [11]. d) Learning classifier is trained after the generation of positive and negative datasets. Also, a different algorithm based on the prediction of the mature miRNA sequences can generate features. e) Different features are measured to train the classifier by different platforms and to build a model for identification of miRNA sequence. For example, Scikit-learn python package and Keras package for deep learning [12]. f) The best classifier model for the identification of miRNA is selected based upon the cross-validation results.

SUPPORT VECTOR MACHINE (SVM)

Support vector machine (SVM) is a popular discriminative classifier that has shown to be an efficient classifier model in dealing with classification problems. SVM takes labeled data as input and it generates decision hyperplanes [13]. Fig. () evaluates a linearly separable 2D axis cartesian plane between two classes with multiple linear lines, which is a solution to the problem. The same concept is applied to problems where more than two categories have to be classified. Therefore, the goal of the SVM is to find the optimal linear line that passes to possible all data points [13]. SVM learning problem is based on generating an optimal solution for multiple existences of lines that generate the largest minimum distance to labelled data. Based on the SVM, there are several methods for identification of miRNAs sequence as described below: a) miR-abela is based on the SVM classifier program for predicting mammalian miRNAs and it has shown high specificity for a dataset of 40 pre-miRNAs. The features involved in the pre-miRNA sequence prediction included thermodynamic energy, loop length, conservation and stem length. However, this algorithm has low specificity for the identification of a mature miRNA sequence [14]. b) Triplet-SVM has a triple set of nucleotides to generate structural and sequential feature properties that indicate the pairing state of every three adjacent nucleotides in which true miRNAs sequence lies and is separated by pseudo hairpins [15]. It has illustrated more accuracy for genomic data of animals as compared to lower species based on classifier performance [15]. c) RNAz is based upon selected features such as thermodynamic stability, conservation and sequential and structural properties in predicting structural noncoding RNAs and cis-acting regulatory elements of mRNAs. It can be used to detect functional RNA structures deployed in genome-wide screens. This algorithm has a high sensitivity, but it also has a high type I error rate [16].

HIDDEN MARKOV MODEL (HMM)

The Hidden Markov Model (HMM) is a statistical model using the probabilistic distribution for modeling time series data. HMM can be applied to a stochastic process having unobservable hidden states. HMM-based algorithms are frequently applied for miRNA identification [17]. a) ProMir is a web server for the generation of the non-coding miRNA query sequence. ProMir method is trained with loop-based features generated based upon a probabilistic score. ProMir II is an optimized version method used to identify both conserved and non-conserved miRNA sequences. The latest version of ProMir implements miRNA identification score and several filtering criteria that include free energy, GC content and conservation [18]. b) MiRRim uses the evolutionary and structural features as a multidimensional vector to train the classifier. In the miRNA structure, the stem region is more conserved than the loop region and their corresponding surrounding regions are also less conserved across different species [19]. c) HHMMiR uses thermodynamic energy, similar to RNAfold program [11], as selected features to train hierarchical HMM classifier to predict hair-loop structure of miRNA, which lacks the evolutionarily conserved feature. This method has high sensitivity and specificity for determining functional roles of the miRNAs [20].

NAïVE BAYES

Naïve Bayes is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naïve) independence assumptions. A more descriptive term for the underlying probability model would be the “independent feature model”. In simple terms, a naïve Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. Depending on the probability model, Naïve Bayes classifiers can be trained very efficiently in a supervised learning setting [21]. a) BayesMirFind was created for the identification of miRNA sequence in C. elegans and mice. It uses a comparative post-filtering technique on a large set of sequential and structural features providing >80% of sensitivity and >90% of specificity. However, overall classifier performance is poor as compared to other algorithms [21]. b) miR-KDE was developed by Chang et al. and is based on the classification problem. For extraction of pre-miRNAs in humans, this method uses the hairpin sequence and structural features collected from previously published work. miR-KDE incorporates a variable kernel density method to classify RNA sequence from a generated set of features. Experimentally verified pre-miRNA is collected from 40 species to evaluate the overall performance of the classifier [22].

APPLICATION OF LEARNING CLASSIFIER IN THE TARGET PREDICTION

Previous research shows that miRNA sequence has a seed region, which is 6-8 nucleotide in length that aligns at the 5′ end of the mature mRNA [8]. The binding of miRNA to target is an important feature by different pairing sites (Fig. ). Thus, linear development in the amount of data in genomics and proteomics needs accurate and precise prediction algorithm unlike rule-based traditional methods [9]. In the rule-based prediction, the algorithm determines whether user-provided miRNA sequences are not manually created; rather generate characteristics trained to learn classifier [23]. Several machine learning-based miRNA target prediction algorithms were developed in the last decade. The general process in target prediction has been as follows: a) For each miRNA, identify the putative binding site from the validated target (as positive) and non-target (as negative) mRNAs based on seed site complementarity and other features such as thermodynamics energy and global alignments. b) Extract or identify features from these interactions (irrespective of whether they are functional or not). c) Train a classifier to distinguish between target and non-targets. d) For an unknown miRNA-mRNA pair, use the classifier to label it as positive (target) or negative (non-target). A few of the popular machine learning-based algorithm in target prediction are described below: a) MBSTAR is based on the random forest (RF) algorithm with a combination of 31 structural and 340 sequence features and that are applied to an unsupervised learning algorithm to select 40 putative features for miRNA-mRNA interaction. After 5-fold cross-validation, the top six different multiple instance learning (MIL) techniques are considered to evaluate the classifier. After performing an analysis of different learning classifiers, the RF was shown to have the highest accuracy. For further analysis, MBSTAR shows more putative binding sites in nucleotide conversion in biological sequence. This method is generated from mature miRNA sequences downloaded from miRBase database and 3UTRs sequence of mRNA for feature extraction [24]. b) TarPmiR detects miRNA and their corresponding target sites by the application of random forest by the integration of six regular features and seven other newly generated features for a prediction. For the generation of the negative dataset, TarPmiR applies random application in the target site with equal expression level and molecular functions. The selection of proper negative dataset is based on a different parameter per se no overlapping between positive sites and negative sites and low binding energy. To evaluate the learning classifier, TarpmiR shows good results with the PAR-CLIP dataset in a human HEK293 cell line [25]. c) DeepTarget is the combination of both supervised and unsupervised learning methods. This method relies on the application of autoencoder to generate target prediction by utilizing a sequence-based interaction feature to train the recurrent neural network (RNN) model. DeepTarget has a high level of accuracy and eliminates the necessity of manually curated features for prediction. The appearance of pattern nucleotide positions in the RNN layer corresponds to interaction. DeepTarget delivers a quantum leap in the longstanding challenge of robust miRNA target prediction (Website, http://data.snu.ac.kr/pub/deepTarget/). d) NBmiRTar is a Naïve Bayes classifier learning approach. To train this classifier, features are extracted from the seed and out-seed regions filtered from the output of miRanda tool. In this approach, incorporation of these in seed and out-seed structural and sequential features improves the performance of NBmiRTar. These artificial mature miRNAs consist of a random string of nucleosides A, C, G and U with a probability of 0.34, 0.19, 0.18 and 0.29, respectively, that are not consistent with the base frequencies in true miRNAs. Several parameters based on free energy and conservation are applied to artificial interaction for the production of a negative dataset, which is further used as input to the classifier [26].

MACHINE LEARNING IN FUNCTIONAL CHARACTERIZATION OF MIRNA

In an early section, we have provided a brief overview of several different machine learning-based methods in the identification of miRNA and the target prediction. In a broad range of genomics, machine learning is a useful tool for interpretation of a large amount of genomics data [27]. Functional annotation helps miRNA research. In this section, we provide functional information and involvement of miRNA in cancer and other disease pathways using machine-learning techniques. Functional analysis that reveals the involvement of miRNA in several physiological processes are essential for determining the association with diseases [28]. miRNA is an emerging therapeutic agent against diseases. Therefore, several databases have been developed in past few decades that provide functional information of miRNA such as DAVID [29] which is a knowledge-based visualization of functional annotation of miRNA. Another online database is miRDB [30] that is based on functional information of miRNA target prediction sequence. There are two statistical tools for evaluation of miRNA functions in different diseases, namely MAGIA (miRNA and genes integrated analysis) [31] and FAME (Functional Assignment of miRNAs via Enrichment) [32]. These methods have several limitations due to the dependence on target prediction. The main disadvantage of these approaches is that they are unable to predict whether miRNA binds outside the binding region. Due to this, these approaches are not able to predict functional analysis beyond the binding region. GenMiR++ is a generative Bayesian interface that has high sensitivity and it is used to evaluate the expression profiles. GenMiR++ provides a balanced score between predicted and other generated miRNA target interactions and identifies their functional annotation. To test the performance of this approach, the dataset of biological process (BP) annotations from the gene ontology (GO) annotation database is collected to test the performance of the classifier. GenMiR++ is more consistent with high score confidence on the set of predicted functional targets from the sequence-based predictions [33].

FURTHER DIRECTIONS

With the extensive increment in novel discoveries of therapeutic, an identified biomarker must undergo trained validation procedure before undergoing clinical trials. For example, in recent studies, scientists executed the application of advanced learning techniques to determine the stability of Tarcolimus (immunosuppressive drug) in renal transplantation [34]. Recently, deep learning is receiving more attention in the domain of genomics. In the past, pharma companies have been trying to develop new learning technology that resulted in Deep genomics costing approx. $13 million and iCarbonX. Due to this high-end learning technology, scientists have advanced knowledge in understanding of human genome in a much faster and accurate way. Chen et al. carried out an investigation for the improvement of the accuracy of metabolism defects using the advanced learning model [35]. This study was primarily covered with chemical compounds that involve in specified conditions reduced false positive rates for e.g., phenyl ketonuria decline false positive rates from 21 to 2, hypermethioninemia from 30 to 10 data point, and 2-methyl crotonyl-CoA-carboxylase reduced from 209 to 46 deficiency [35]. The applications of ML have diverse importance in genomics and personalized medicine that generate fast and accurate models before undergoing further clinical trials, more than traditional medicine. Companies are expecting to adopt advanced learning technology to know more about unsolved queries in their related field.

CONCLUSION

Discovery in miRNA has brought forth a new era in the field of molecular biology across the world. This discovery excited researchers and there have been aggressively taken a peak in the development of computational biology approaches in miRNA biology. An enormous amount of genomic data has been generated from molecular biology technique like next-generation sequencing, microarray, etc. This application has overcome the difficulties of experimental procedures involved in miRNA discovery and target predictions but also the limitation of conservation-based computational approaches. Several ML approaches like SVM HMM, naïve bayes, and deep learning (advanced artificial neural networks), which is an optimized version of neural network classifier have been an efficient framework to identify novel miRNA and their target prediction with high accuracy and a less false positive rate. In deep learning, the recurrent neural network is naturally capable of temporary remodeling of naturally fitting biological molecules like DNA, RNA, proteins, and miRNA. Therefore, the advancement in ML approaches is needed in miRNA research for novel identification, target prediction and functional annotation for clinical biomarkers.

Table 1

Summary of different tools of microRNA prediction, target prediction and functional annotation.

A. miRNA Prediction
Tools	Algorithm	Positive	Negative	Feature
miR-abela	SVM	miRBase	Coding region	Structure
Triplet-SVM	SVM	Rfam	Pseudo miRNA hairpin	Structure and sequence feature
miPred	Random forest	Registry database	Pseudo miRNA hairpin	Structure and thermodynamics
ProMir	Probabilistic colearning	Mature pre-miRNA	Randomly extracted stem loop	Structure, thermodynamics, and sequence
MiRRim	Hidden Markov Model	Conserved miRNA	Non-conserved, moderately conserved, and highly conserved	Sequence alignment
HHMMiR	Hidden Markov Model	microRNA registry	Coding region	Structure and thermodynamics
B. Target Prediction
Tools	Algorithm	Positive	Negative	Feature
MBSTAR	Random forest	miRBase	Randomly generated	Sequential and Structural
NbmiRTar	Naïve Bayes	Tarbase	Probability Randomization	Sequence
TargetBoost	Genetic Programming	let-7, lin-4, miR-13a, and bantam	Random string with same frequency	Sequence
TarpmiR	Random forest	CLASH	Reshuffling site of target site	Structure
DeepTarget	Recurrent Neural Network	miRecords and miRBase	Mocking in alignment	Sequence
TargetMiner	SVM	miRecords	Randomly generated	Seed
C. Functional annotation of miRNA
Tools	Algorithm	Positive	Negative	Feature
GenMiR++	Bayesian learning	Tarbase and miRecords	Negative correlation in expression profiles	Sequence and expression data
Joung et al.	Probabilistic learning	Expression profile	Low profile expression data	Parametric adjusted population size, and minimum subset size
Tran et al.	Rule based	Expression profile from human cancer	Randomly generated	Alignment

4 in total

1. Machine-learning-based Analysis Identifies miRNA Expression Profile for Diagnosis and Prediction of Colorectal Cancer: A Preliminary Study.

Authors: Dorota Pawelka; Izabela Laczmanska; Pawel Karpinski; Stanislaw Supplitt; Wojciech Witkiewicz; Barłomiej Knychalski; Joanna Pelak; Paulina Zebrowska; Lukasz Laczmanski
Journal: Cancer Genomics Proteomics Date: 2022 Jul-Aug Impact factor: 3.395