| Literature DB >> 28553651 |
Yi Zhang1, Haiyun Huang1, Dahan Zhang2, Jing Qiu3, Jiasheng Yang4, Kejing Wang1, Lijuan Zhu1, Jingjing Fan1, Jialiang Yang5.
Abstract
Noncoding RNAs (ncRNAs) play important roles in various cellular activities and diseases. In this paper, we presented a comprehensive review on computational methods for ncRNA prediction, which are generally grouped into four categories: (1) homology-based methods, that is, comparative methods involving evolutionarily conserved RNA sequences and structures, (2) de novo methods using RNA sequence and structure features, (3) transcriptional sequencing and assembling based methods, that is, methods designed for single and pair-ended reads generated from next-generation RNA sequencing, and (4) RNA family specific methods, for example, methods specific for microRNAs and long noncoding RNAs. In the end, we summarized the advantages and limitations of these methods and pointed out a few possible future directions for ncRNA prediction. In conclusion, many computational methods have been demonstrated to be effective in predicting ncRNAs for further experimental validation. They are critical in reducing the huge number of potential ncRNAs and pointing the community to high confidence candidates. In the future, high efficient mapping technology and more intrinsic sequence features (e.g., motif and k-mer frequencies) and structure features (e.g., minimum free energy, conserved stem-loop, or graph structures) are suggested to be combined with the next- and third-generation sequencing platforms to improve ncRNA prediction.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28553651 PMCID: PMC5434267 DOI: 10.1155/2017/9139504
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1Four popular categories of computational methods in predicating ncRNAs. (a) Homology-based methods, which compare a query RNA with known ncRNAs deposited in databases based on sequence or structure alignment; (b) de novo methods, which predict ncRNA from primary sequences or structure based on general principles that govern ncRNA folding or statistical tendencies of k-mer features; (c) transcriptional sequencing and assembling based methods, which utilize next-generation sequencing and transcriptome data; and (d) RNA family specific methods, which predict specific ncRNA classes.
Homology-based ncRNA function prediction methods.
| Name | URL | Feature | Prediction algorithm |
|---|---|---|---|
| BLAST [ | | Sequence only | BLAST |
|
| |||
| BLAT [ | | Sequence only | Pairwise alignment algorithm |
|
| |||
| CSHMM [ | Structure only | A discriminant function based on likelihood score for a hidden Markov model (CSHMM) | |
|
| |||
| Infernal [ | | Sequence and RNA secondary structure | Stochastic context-free grammars called covariance models (CMs), HMM |
|
| |||
| ERPIN [ | | Sequence and RNA secondary structure | Profile-based dynamic programming algorithm and |
|
| |||
| QRNA [ | Sequence only | Pair hidden Markov model | |
|
| |||
| RNAz [ | | RNA secondary structure and thermodynamic stability | Support vector machine regression |
|
| |||
| Evofold [ | | A log-odds score | Phylogenetic stochastic context-free grammars |
|
| |||
| MASTR [ | | Mutual information with gap penalty, six canonical base pairs, stacking of adjacent base pairs, and the score combining the log-likelihood of the alignment, a covariation term, and the base-pair probabilities | Sampling approach by Markov chain Monte Carlo in a simulated annealing framework |
De novo ncRNA function prediction methods using RNA sequence features.
| Name | URL | Feature | Prediction algorithm |
|---|---|---|---|
| RNA | | 3-mer of nucleotides | SVM with parameters and kernels optimized by model training |
|
| |||
| CNCI [ | | Frequency of adjoining nucleotide triplets (6-mer), the length and | SVM using the standard radial basis function kernel |
|
| |||
| PLEK [ | | Normalized frequencies of 1–5 mers of RNA sequences | SVM |
|
| |||
| CONC [ | Peptide length, amino acid composition, nucleotide frequencies, predicted secondary structure content, predicted percentage of exposed residues, compositional entropy, number of homologs from database searches, and alignment entropy | SVM | |
|
| |||
| CPC [ | | The longest reading frame in the three forward frames, log-odds score, coverage of the predicted ORF, and integrity of the predicted ORF | SVM |
De novo ncRNA prediction methods using RNA structure features.
| Name | URL | Feature | Prediction algorithm |
|---|---|---|---|
| RNAfold [ | Base-pair probabilities and MFE | Partition function and dynamic programming | |
|
| |||
| Mfold [ | | MFE | Dynamic programming |
|
| |||
| Afold [ | | Sets of conditionally optimal multibranch loop free structures | Dynamic programming |
|
| |||
| Sfold [ | | Internal loops, sets of conditionally optimal MLF structures | Nearest-neighbour model (NNM) |
|
| |||
| Nussinov [ | | Individual base pairs and loop structure with the lowest free energy | Dynamic programming |
|
| |||
| Partition function method [ | | Full equilibrium partition for secondary structure and the probabilities of various substructures | Dynamic programming |
|
| |||
| Zhang [ | | MFE and GC content | Dynamic programming |
|
| |||
| ptRNApred [ | | 91 features including (1) 7 selected dinucleotide properties as well as their dinucleotide values, (2) 52 properties derived from the secondary structure, for example, the number of loops, and (3) 32 triplet element properties | Random forest and SVM |
|
| |||
| incRNA [ | | 9 genomic features including 4 expression features, 3 sequence information, and 2 RNA structure features | Random forest |
Figure 2Popular de novo methods and the statistical algorithms applied.
Sequencing-assembling based whole ncRNA set methods.
| Name | URL | Feature | Prediction algorithm |
|---|---|---|---|
| Tilling array [ | | Synonymous amino acid substitutions, reading frame conservation, and the occurrence of premature stop codons | RNAcode algorithm and biweight kernels |
|
| |||
| DigitagCT [ | | Genomic sequences, DGE tags, and tiling array expression | Infernal and BLASTN |
|
| |||
| BlockClust [ | | (1) The block group: entropy of read starts, entropy of read ends, entropy of read lengths, median of normalized read expressions and normalized read expression levels in first quantile; (2) block: number of multimapped reads, entropy of read lengths, entropy of read expressions, minimum read length and block length, and (3) block edge: contiguity and difference in median read expressions | Graph-kernel SVM |
|
| |||
| Noncoder [ | | Sequence homology, evolutionary information, the longest reading frame in three forward frames, log-odds score, coverage of the predicted orf, and integrity of the predicted orf | BLAT and PhyloCSF |
|
| |||
| Vicinal [ | | Chimeric RNA-cDNA fragments and terminal stem-loop | Bowtie 2 local mapping, filtering, and Vicinal mapping |
|
| |||
| CoRAL [ | | Read length, abundance of antisense transcription, 5′ and 3′ positional entropy, four nucleotide frequencies transformed into a log-odds ratio relative to equal base frequencies, and MFE | Multiclass classification random forest |
|
| |||
| FlaiMapper [ | | Densities of start and end positions of aligned reads and read lengths | Peak detection on the start and end position densities followed by filtering and a reconstruction process |
Methods to predict miRNA.
| Name | URL | Feature | Prediction algorithm |
|---|---|---|---|
| CSHMM [ | Structure only | A discriminant function based on likelihood score for a hidden Markov model | |
|
| |||
| MiPred [ | 32 possible combinations of the middle nucleotide among the triplet elements, local contiguous structure sequence composition, MFE, and | Random forest | |
|
| |||
| PlantMiRNAPred [ | | 115 features including (1) 17 primary sequence-related features, (2) 64 secondary structure-related features, and (3) 34 energy- and thermodynamics-related features | SVM |
|
| |||
| miRdentify [ | | 5′ heterogeneity, overhangs, negative numbers indicating 5′ overhang, thermodynamics, entropy, tailing, and multimapping | Mapping and seeking duplex-forming reads within 46-80nt distance with the guide strand |
|
| |||
| CID-miRNA [ | | Secondary structure likelihood | Stochastic context-free grammar model, Chomsky normal form; Cocke-Young-Kasami algorithm, and Classification tree |
|
| |||
| miRank [ | | 36 global and local intrinsic features, including the normalized MFE of folding, the normalized base pairing propensities of both arms, and the normalized loop length | Belief propagation on a weighted graph, random walks-based ranking algorithm |
|
| |||
| miRCat [ | |
| Dynamic programming |
|
| |||
| mirTool [ | | miRNA/miRNA, absolute/relative reads count, and the most abundant tag | Folding the flanking genomic sequence using the miRDeep program |
|
| |||
| miRanalyzer [ | | Number of bindings in read cluster sequence, normalized mean free energy of precursor sequence, number of bindings in precursor, length of read cluster, the corresponding putative mature star sequence, number of bindings in read cluster divided by the read cluster length, number of reads in read cluster, mean free energy of precursor sequence, degree of bulb asymmetry in precursor, and the number of bulbs in precursor secondary structure | Random forest |
|
| |||
| sRNAbench [ | | Within cluster ratio, 5′ fluctuations, most frequent to all ratio, minimum number of hairpin bindings, minimum number of mature bindings, most frequent read, length interval, and minimum reads | Hierarchical clustering |
Methods to predict lncRNAs.
| Name | Feature | Prediction algorithm |
|---|---|---|
| Estimating lincRNome size for human [ | lincRNA numbers validated experimentally in human and mouse, and their overlap lincRNA number | System of nonlinear equations |
|
| ||
| Classifying human lncRNA [ | RNA sequence-structure patterns (RSSPs) describing 42 highly structured families, motif binding sites extracted as 1314 Position-Weight Matrices (PWMs), all | Classifying human lncRNA by being able (or disable) to bind the polycomb repressive complex (PRC2), SVM with linear kernel |
|
| ||
| Identify, classify, and localize maize lncRNAs [ | Transcript length, open reading frame (ORF) size, and homology with known proteins | SVM |
|
| ||
| The GENCODE v7 catalog of human lncRNA [ | Lack of homology with known proteins, no reasonable-sized open reading frame (ORF), and no high conservation, confirmed by PhyloCSF through the majority of exons conserved promoters | Manual annotation and pattern recognition |
|
| ||
| Highly conserved large noncoding RNAs [ | Chromatin signatures “K4–K36” domain | Maximum CSF score observed across the entire genomic locus |