| Literature DB >> 30084867 |
Siyu Han1, Yanchun Liang1,2, Qin Ma3,4, Yangyi Xu1, Yu Zhang1, Wei Du1, Cankun Wang4, Ying Li1.
Abstract
Discovering new long non-coding RNAs (lncRNAs) has been a fundamental step in lncRNA-related research. Nowadays, many machine learning-based tools have been developed for lncRNA identification. However, many methods predict lncRNAs using sequence-derived features alone, which tend to display unstable performances on different species. Moreover, the majority of tools cannot be re-trained or tailored by users and neither can the features be customized or integrated to meet researchers' requirements. In this study, features extracted from sequence-intrinsic composition, secondary structure and physicochemical property are comprehensively reviewed and evaluated. An integrated platform named LncFinder is also developed to enhance the performance and promote the research of lncRNA identification. LncFinder includes a novel lncRNA predictor using the heterologous features we designed. Experimental results show that our method outperforms several state-of-the-art tools on multiple species with more robust and satisfactory results. Researchers can additionally employ LncFinder to extract various classic features, build classifier with numerous machine learning algorithms and evaluate classifier performance effectively and efficiently. LncFinder can reveal the properties of lncRNA and mRNA from various perspectives and further inspire lncRNA-protein interaction prediction and lncRNA evolution analysis. It is anticipated that LncFinder can significantly facilitate lncRNA-related research, especially for the poorly explored species. LncFinder is released as R package (https://CRAN.R-project.org/package=LncFinder). A web server (http://bmbl.sdstate.edu/lncfinder/) is also developed to maximize its availability.Entities:
Keywords: EIIP physicochemical property; machine learning; multi-scale secondary structure; predictive modeling; sequence intrinsic composition
Year: 2019 PMID: 30084867 PMCID: PMC6954391 DOI: 10.1093/bib/bby065
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Overview of machine learning-based lncRNA identification tools
| Methods | CPC [ |
| CNCI [ | PLEK [ | LncRNA-ID [ | lncRScan-SVM [ |
|
|
|
|---|---|---|---|---|---|---|---|---|---|
| Year | 2007 | 2013 | 2013 | 2014 | 2015 | 2015 | 2016 | 2016 | 2017 |
| Category | Protein-coding potential calculator | Protein-coding potential calculator | lncRNA predictor | lncRNA predictor | lncRNA prediction method | lncRNA predictor | lncRNA predictor | Protein-coding potential calculator | Protein-coding potential calculator |
| Input Format | FASTA | FASTA, BED | FASTA, GTF | FASTA | GTF | FASTA | GTF | FASTA, GTF | |
| Species | Multi-species | Human, mouse, | Vertebrate, plant | Vertebrate, plant | Human, mouse | human | human, mouse, | Multi-species | |
| fly, zebrafish | fly, worm, plant | ||||||||
| Requirements | Linux, BLAST, | Linux, Python2.7, | Linux, Python2.7 | Linux, Python2.7 | Linux, Python2.7, | Linux, R | Linux, Python, | ||
| Protein database | R | Biopython | Biopython | ||||||
| Model | SVM | Logistic regression | SVM | SVM | Balanced random forest | SVM | deep neural network | balanced random forest | support vector machine |
| Features | ORF information, BLASTX [ | ORF length, transcript length, Fickett TESTCODE score [ | ANT information, codon bias | Improved | ORF length and coverage, ribosome interaction [ | Count of stop codon, exon information, txCdsPredict score [ |
| GC content, BLASTX [ | ORF information, Fickett TESTCODE score [ |
| Re-training |
|
| |||||||
| PubMed |
|
|
|
|
|
|
|
| |
| Software |
|
|
|
|
|
|
|
| |
| Web Server |
|
|
|
A brief summary of several lncRNA identification tools. Among the methods summarized above, the majority of tools identify lncRNAs with sequence-derived features alone, and only CPAT and PLEK can be re-trained by users. DeepLNC only provides web server, while CNCI, PLEK, lncRScan-SVM and COME can be downloaded for local use. CPC, CPAT and CPC2 are released as stand-alone application as well as web server. All the stand-alone tools listed in the table require Linux operating system.
CPAT has updated its features in the latest version. Original feature” coverage of ORF” has been replaced with feature ‘length of the transcript’. Features ‘Fickett TESTCODE score’ and ‘hexamer score’ are calculated on ORF region.
Figure 1.Framework of this study. Data sets used in our experiments are collected from GENCODE and Ensembl. Only one transcript from each gene is used. In addition to sequence-intrinsic composition, features are also extracted from multi-scale secondary structure and EIIP-based physicochemical property using two feature selection schemes. Evaluated with 10-fold CV and ROC curve, the optimal feature combination and machine learning algorithm are obtained to develop a new method for lncRNA identification. This method is benchmarked against five popular tools on five species, and it is finally included in LncFinder, which is a highly flexible package for lncRNA identification and analysis. LncFinder is published as R package as well as web server.
Figure 2.Illustration of multi-scale secondary structure-derived sequences. As a low-scale feature, MFE is a basic structural outline presenting one RNA sequence’s stability. Medium-scale sequences briefly sketch RNA information and can be obtained from dot-bracket notation sequence alone without using sequence nucleotide composition. High-scale sequences can be viewed as a high-resolution panorama displaying the integration of sequence and structural information.
Features selected from three feature groups
| Sequence-intrinsic composition | Multi-scale structural information | EIIP-based physicochemical property |
|---|---|---|
| Logarithm-distance | Minimum free energy (MFE) | Signal at 1/3 position ( |
| Length of the longest ORF | UP frequency of paired–unpaired sequence | SNR |
| Coverage of the longest ORF | Logarithm-distance | Quantile statistics (Q1, Q2, min and max) |
| Logarithm-distance |
Logarithm-distance consists of three features: LogDist.LNC, LogDist.PCT and LogDist.Ratio.
Performances of each feature group on human data set A
| Feature group | Sensitivity | Specificity | Accuracy |
| Kappa |
|---|---|---|---|---|---|
| Sequence | 0.9555 | 0.9705 | 0.9630 | 0.9628 | 0.9261 |
| SS | 0.8129 | 0.8921 | 0.8525 | 0.8464 | 0.7050 |
| EIIP | 0.9021 | 0.8686 | 0.8853 | 0.8872 | 0.7706 |
| All features |
|
|
|
|
|
Multi-scale structural features. The results are obtained from 10-fold CV. Bold numbers indicate the highest value.
Figure 3.ROC curves of different feature groups and different tools on three species. (A) Sequence-derived (Seq Group), EIIP-derived (EIIP Group), secondary structure-derived (SS Group) and other six classic feature groups were evaluated on human data set B. All three feature categories we proposed were among the top five feature groups. Logarithm-distance features outperformed other sequence-intrinsic features such as codon bias and hexamer score with the highest AUC. Six EIIP-based features even had performance comparable to that of 64 codon bias features. (B) Nine feature groups were extracted from the training set of human data set B and were used to build classifiers. Figure (B) shows the nine classifiers’ performances on mouse test set. All feature groups showed some fluctuations in performances, but sequence-derive features still achieved the best AUC. (C) Classifiers built on human data set B were evaluated with a test set of wheat data set. Compared with Figure 3 (A), AUC of codon bias features decreased about 18%, while AUC of hexamer score decreased about 30%. EIIP-based feature group surpassed codon bias features and demonstrated its satisfactory cross-species performance. Sequence-derived features still obtained the best AUC. (D) LncFinder and other five tools, namely, CPC (offline version), CPAT (re-trained model), CNCI, PLEK (re-trained model) and CPC2, were tested on human data set B. LncFinder and CPAT had the best AUC, but the accuracy of CPAT was lower than that of LncFinder. (E) LncFinder and other five tools were tested on mouse data set. LncFinder achieved the best result. (F) LncFinder and other five tools were tested on wheat data set. LncFinder achieved the best AUC on human and mouse data sets. Although the accuracy of CPC on human and mouse data sets was inferior to that of other tools, CPC surpassed all alignment-free tools on wheat data set. LncFinder had the best performance among alignment-free tools. We cannot know which tool is best for one specific species in advance. A tool that can present robust and stable results on multiple species is of crucial importance. LncFinder had the most stable and reliable performances among these tools.
Figure 4.Performances of different tools on human data set B. LncFinder had the best accuracy of 0.9728. CPC had a strong tendency to classify lncRNAs as protein-coding transcripts and thus having low accuracy of 0.8304 (web server). As an upgraded version, CPC2 presented accuracy of 0.9614. CPC2 was a big improvement on its predecessor and also outperformed CNCI and PLEK that obtained accuracies of 0.9450 and 0.9274, respectively. CPAT (re-trained model) was inferior to only LncFinder and obtained an accuracy of 0.9642. Even when secondary structure-derived features were excluded, LncFinder (Without.SS) can still surpass other tools with an accuracy of 0.9716.
Figure 5.Performances of different tools on mouse data set. LncFinder achieved the best accuracy of 0.9347, while PLEK (re-trained model) had an accuracy 0.8178. The accuracy of CPAT (re-trained model) was 0.9242 and better than CNCI’s 0.9133, CPC’s 0.8678 (web server) and CPC2’s 0.8611. LncFinder (Without.SS) outperformed other tools with an accuracy of 0.9286 even without secondary structure-derived features. When using the model for human, LncFinder outperforms CPC/CPC2, CPAT (web server), CNCI and PLEK (re-trained model) with a satisfactory accuracy of 0.9186. Under this circumstance, LncFinder can even rival CPAT (re-trained model for mouse), which demonstrates LncFinder’s robustness and high cross-species stability.
Figure 6.Performances of different tools on wheat data set. Although CPC had inferior performances on human and mouse, it achieved the best accuracy on wheat. CPC obtained an accuracy of 0.9595, but his alignment-free successor CPC2 only had an accuracy of 0.7870. The accuracies of CPAT (re-trained model) and PLEK (re-trained model) were 0.8743 and 0.8773, respectively, while LncFinder obtained an accuracy of 0.9283. When default models were used, CPAT (model for human), CNCI (default model for plants) and PLEK (default model for plants) had accuracies of 0.7145, 0.6158 and 0.5275, respectively. LncFinder (model for human) had an accuracy of 0.8190. Although CNCI and PLEK provide default models for plants, the performances were substandard. LncFinder has the best performance among alignment-free tools. Even using the model for human, LncFinder still outperformed CPC2, CNCI and the default models of CPAT and PLEK.
Performances of different tools on zebrafish and chicken data sets
| Methods | Zebrafish ( | Chicken ( | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Sensitivity | Specificity | Accuracy |
| Kappa | Sensitivity | Specificity | Accuracy |
| Kappa | |
| CPC | 0.6728 |
|
|
|
| 0.5784 |
| 0.7836 | 0.7277 | 0.5671 |
| CPAT | 0.8668 | 0.8660 | 0.8664 | 0.8663 | 0.7328 | 0.9189 | 0.9178 | 0.9183 | 0.9183 | 0.8366 |
| CNCI | 0.8535 | 0.8728 | 0.8631 | 0.8618 | 0.7263 | 0.9128 | 0.9051 | 0.9089 | 0.9093 | 0.8179 |
| PLEK | 0.8715 | 0.8255 | 0.8485 | 0.8519 | 0.6970 | 0.9346 | 0.9124 | 0.9235 | 0.9244 | 0.8740 |
| CPC2 |
| 0.7835 | 0.8391 | 0.8476 | 0.6783 | 0.7650 | 0.9235 | 0.8443 | 0.8308 | 0.6885 |
| LncFinder | 0.8815 |
|
|
|
|
| 0.9321 |
|
|
|
Bold numbers indicate the highest value. LncFinder has the best performance. In our test, CPC could not process the protein-coding transcripts of zebrafish; thus, only the result of lncRNAs is obtained.
Functions and descriptions of LncFinder R package
| Function | Description | Option |
|---|---|---|
| Functions for classic features extraction | ||
| compute_EIIP() | Compute EIIP-derived features | 1. |
| compute_EucDist() | Compute Euclidean-distance features | 1. |
| compute_FickettScore() | Compute Fickett TESTCODE Score |
|
| compute_GC() | Compute GC content |
|
| compute_hexamerScore() | Compute hexamer score | see compute_EucDist() |
| compute_kmer() | Compure |
|
| compute_LogDist() | Compure Logarithm-distance | see compute_EucDist() |
| compute_pI() | Compure isoelectric point | 1. |
| find_orfs() | Find ORFs |
|
| Functions for LncRNA identification and new classifier construction | ||
| lnc_finder() | Identify lncRNAs using LncFinder | 1. |
| build_model() | Build new model using LncFinder | see lnc_finder() |
| extract_features() | Extract features proposed by LncFinder |
|
| read_SS() | Load external secondary structure information | |
| run_RNAfold() | Run RNAfold and capture the results | |
| svm_cv() | Perform cross validation for SVM model | 1. |
| svm_tune() | Tune SVM model | see svm_cv() |
This table briefly summaries the main functions of LncFinder R package. All functions and descriptions are based on LncFinder R Package (version 1.1.2). The package will be updated regarlarly. Refer to Supplementary File 4 - R package Manual for detailed descriptions and examples.
Figure 7.Screenshot of LncFinder’s web server.