| Literature DB >> 32308930 |
Mengting Niu1, Jun Zhang2, Yanjuan Li3, Cankun Wang4, Zhaoqian Liu4, Hui Ding5, Quan Zou1,5, Qin Ma4.
Abstract
Circular RNA (circRNA) plays an important role in the development of diseases, and it provides a novel idea for drug development. Accurate identification of circRNAs is important for a deeper understanding of their functions. In this study, we developed a new classifier, CirRNAPL, which extracts the features of nucleic acid composition and structure of the circRNA sequence and optimizes the extreme learning machine based on the particle swarm optimization algorithm. We compared CirRNAPL with existing methods, including blast, on three datasets and found CirRNAPL significantly improved the identification accuracy for the three datasets, with accuracies of 0.815, 0.802, and 0.782, respectively. Additionally, we performed sequence alignment on 564 sequences of the independent detection set of the third data set and analyzed the expression level of circRNAs. Results showed the expression level of the sequence is positively correlated with the abundance. A user-friendly CirRNAPL web server is freely available at http://server.malab.cn/CirRNAPL/.Entities:
Keywords: ACC, Accuracy; CNN, Convolutional Neural Networks; Circular RNA; DAC, Dinucleotide-based auto-covariance; DACC, Dinucleotide-based auto-cross-covariance; DCC, Dinucleotide-based cross-covariance; ELM, extreme learning machine; Expression level; Extreme learning machine; GAC, Geary autocorrelation; Identification; MAC, Moran autocorrelation; MCC, Matthews Correlation Coefficient; MRMD, Maximum-Relevance-Maximum-Distance; NMBAC, Normalized Moreau–Broto autocorrelation; PC-PseDNC-General, General parallel correlation pseudo-dinucleotide composition; PCGs, protein coding genes; PSO, particle swarm optimization algorithm; Particle swarm optimization algorithm; PseDPC, Pseudo-distance structure status pair composition; PseSSC, Pseudo-structure status composition; RBF, radial basis function; RF, random forest; SC-PseDNC-General, General series correlation pseudo-dinucleotide composition; SE, Sensitivity; SP, Specifity; SVM, support vector machine; Triplet, Local structure-sequence triplet element; circRNA, circular RNA; lncRNAs, long non-coding RNAs
Year: 2020 PMID: 32308930 PMCID: PMC7153170 DOI: 10.1016/j.csbj.2020.03.028
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1CircRNA splicing structure. CircRNA is a new class of RNA that differs from traditional linear RNA. It does not have a 5′ cap or a 3′ tail and is not easily degraded by exonuclease. In humans, it is more stable than linear RNA. Most circRNAs are formed by exons, while a few are derived from intron fragments.
Fig. 2Flowchart of CirRNAPL. CirRNAPL identifies circRNA in four main steps: data, feature, classifier, and circRNA identification Data: This involves dataset construction. According to the bed data file and the hg38 human genome, we write Python script files to extract the corresponding sequence data. Features: This involves the extraction of features. In this work, information such as the structural characteristics of the RNA sequence is used as the feature to be extracted, including four parts, with a total of 14 calculation models. Classification: This involves the construction of the classifier. Here, the extreme learning machine based on particle swarm optimization is used as the classification algorithm. The classifier CirRNAPL is constructed by a tenfold cross-validation method, and the final classification model is output. CircRNA identification: The RNA sequence to be labeled is identified using the classifier CirRNAPL.
The details of the datasets.
| Datasets | Positive Data | Negative Data |
|---|---|---|
| circRNA vs PCG | 14,084 circRNAs | 9533 PCGs |
| circRNA vs lncRNA | 14,084 circRNAs | 19,722 lncRNAs |
| Stem cell vs not | 2082 circRNAs | 2082 circRNAs |
Fig. 3Feature expression method and feature dimension histogram.
Fig. 4Feature importance analysis. A) Top 20-dimensional feature distribution on three data sets. B) Feature Importance Analysis: On the three data sets, 520-dimensional features were obtained by feature selection. The 520-dimensional feature distribution was organized.
Fig. 5Classifier validity verification. A) Identification results of five activation functions under tenfold cross-validation. B) Validation of the kernel function on the independent test set. C) Optimization of the experimental results of the extreme learning machine. D) Results of comparison with other classifiers under ten-fold cross-validation. E) Validation of classifiers on independent test sets. F) Comparison of identification results compared with traditional blast sequences.
Fig. 6A) Comparison with state-of-the-art methods. B) Analysis of the importance of violin diagram features.
Fig. 7A) Relationship between sequence alignment and E value under stem and non-stem. B) Analysis of the relationship between the expression level and abundance of the sequence. C) Sequence alignment partial conservative region display and Consensus log.