| Literature DB >> 35897818 |
Muhammad Nabeel Asim1,2, Muhammad Ali Ibrahim1,2, Muhammad Imran Malik3, Andreas Dengel1,2, Sheraz Ahmed1,4.
Abstract
Circular ribonucleic acids (circRNAs) are novel non-coding RNAs that emanate from alternative splicing of precursor mRNA in reversed order across exons. Despite the abundant presence of circRNAs in human genes and their involvement in diverse physiological processes, the functionality of most circRNAs remains a mystery. Like other non-coding RNAs, sub-cellular localization knowledge of circRNAs has the aptitude to demystify the influence of circRNAs on protein synthesis, degradation, destination, their association with different diseases, and potential for drug development. To date, wet experimental approaches are being used to detect sub-cellular locations of circular RNAs. These approaches help to elucidate the role of circRNAs as protein scaffolds, RNA-binding protein (RBP) sponges, micro-RNA (miRNA) sponges, parental gene expression modifiers, alternative splicing regulators, and transcription regulators. To complement wet-lab experiments, considering the progress made by machine learning approaches for the determination of sub-cellular localization of other non-coding RNAs, the paper in hand develops a computational framework, Circ-LocNet, to precisely detect circRNA sub-cellular localization. Circ-LocNet performs comprehensive extrinsic evaluation of 7 residue frequency-based, residue order and frequency-based, and physio-chemical property-based sequence descriptors using the five most widely used machine learning classifiers. Further, it explores the performance impact of K-order sequence descriptor fusion where it ensembles similar as well dissimilar genres of statistical representation learning approaches to reap the combined benefits. Considering the diversity of statistical representation learning schemes, it assesses the performance of second-order, third-order, and going all the way up to seventh-order sequence descriptor fusion. A comprehensive empirical evaluation of Circ-LocNet over a newly developed benchmark dataset using different settings reveals that standalone residue frequency-based sequence descriptors and tree-based classifiers are more suitable to predict sub-cellular localization of circular RNAs. Further, K-order heterogeneous sequence descriptors fusion in combination with tree-based classifiers most accurately predict sub-cellular localization of circular RNAs. We anticipate this study will act as a rich baseline and push the development of robust computational methodologies for the accurate sub-cellular localization determination of novel circRNAs.Entities:
Keywords: circular RNA; classification; machine learning; non-coding RNA; nucleotide frequency; nucleotide physico-chemical properties; sub-cellular localization dataset; subcellular localization; web server
Mesh:
Substances:
Year: 2022 PMID: 35897818 PMCID: PMC9329987 DOI: 10.3390/ijms23158221
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 6.208
Figure 1A Hierarchical Classification of Non-Coding RNAs.
Accuracy Produced by 5 Different Machine Learning Classifiers using 7 Distinct Sequence Descriptors.
| Sequence Descriptors | RandomForest | Xgboost | Naive Bayes | SVM | Adaboost |
|---|---|---|---|---|---|
| EIIP | 0.675 | 0.663 | 0.092 | 0.403 | 0.523 |
| zCurve | 0.621 | 0.612 | 0.270 | 0.405 | 0.526 |
| triMonoKGap | 0.688 | 0.678 | 0.225 | 0.463 | 0.559 |
| diMonoKGap | 0.643 | 0.665 | 0.214 | 0.405 | 0.557 |
| Kmer | 0.685 | 0.670 | 0.606 | 0.469 | 0.490 |
| RCkmer | 0.676 | 0.651 | 0.591 | 0.489 | 0.521 |
| pseudoKNC | 0.685 | 0.658 | 0.249 | 0.446 | 0.564 |
F1-score Produced by 5 Different Machine Learning Classifiers using 7 Distinct Sequence Descriptors.
| Sequence Descriptor | RandomForest | Xgboost | Naive Bayes | SVM | Adaboost |
|---|---|---|---|---|---|
| EIIP | 0.619 | 0.613 | 0.109 | 0.232 | 0.532 |
| zCurve | 0.585 | 0.585 | 0.304 | 0.236 | 0.525 |
| triMonoKGap | 0.634 | 0.635 | 0.260 | 0.334 | 0.561 |
| diMonoKGap | 0.602 | 0.624 | 0.248 | 0.235 | 0.562 |
| Kmer | 0.623 | 0.622 | 0.589 | 0.351 | 0.493 |
| RCkmer | 0.617 | 0.601 | 0.582 | 0.385 | 0.521 |
| pseudoKNC | 0.630 | 0.613 | 0.282 | 0.308 | 0.565 |
Figure 2Standard Specificity Figures of 7 Different Sequence Descriptors Against 5 Different Classifiers.
Figure 3Standard MCC Figures of 7 Different Sequence Descriptors Against 5 Different Classifiers.
Figure 4Average Specificity and MCC Figures of 7 Different Sequence Sequence Descriptors Against 5 Different Classifiers.
Figure 5AU-ROC Performance Figures Produced by 5 Different Classifiers Using 3 K-gap, 3 K-mer and 2 simple Sequence Sequence Descriptors on a Benchmark Circular RNA Sub-Cellular Localization Dataset, (a) TriMonoKgap Peak Performance using 5-mers, (b) DiMonoKgap Peak Performance using 2-mers (c) RCKmer Peak Performance using 5-mers, (d) Kmer Peak Performance using 5-mers, (e) PseudoKNC Peak Performance using 5-mers, (f) EIIP Peak Performance, (g) Z-Curve Peak Performance.
Best Performing K-Order Sequence Descriptor Fusions across 5 Different Machine Learning Classifiers.
| Machine Learning | Best Performing K-Order Sequence Descriptor Fusion | |||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
| Random Forest | TriMonoKGap+ | RCKmer+ | diMonoKGap+ | diMonoKGap+ | diMonoKGap+ | diMonoKGap, |
| Xgboost | Kmer+ | pseudoKNC+ | Kmer, triMonoKGap+ | Kmer+triMonoKGap+ | pseudoKNC+triMonoKGap+ | diMonoKGap+EIIP+RCKmer+ |
| Naive Bayes | RCKmer+ | RCKmer+Kmer | RCKmer+pseudoKNC+ | diMonoKGap+RCKmer+ | diMonoKGap+RCKmer+ | diMonoKGap+EIIP+RCKmer+ |
| SVM | diMonoKGap+ | diMonoKGap+EIIP+ | diMonoKGap+ | diMonoKGap+EIIP+RCKmer | diMonoKGap+EIIP+ | diMonoKGap+EIIP+RCKmer+ |
| AdaBoost | diMonoKGap+ | diMonoKGap+ | diMonoKGap+RCKmer+ | diMonoKGap+RCKmer+ | diMonoKGap+RCKmer+ | diMonoKGap+EIIP+RCKmer+ |
Accuracy Produced by Top Performing K-Order Sequence Descriptor Fusions Across 5 Different Machine Learning Classifiers.
| Encoder Fusion | RandomForest | Xgboost | Naive Bayes | SVM | Adaboost |
|---|---|---|---|---|---|
| 2nd-order | 0.695 | 0.689 | 0.605 | 0.681 | 0.564 |
| 3rd-order | 0.693 | 0.689 | 0.249 | 0.683 | 0.576 |
| 4th-order | 0.694 | 0.688 | 0.249 | 0.682 | 0.571 |
| 5th-order | 0.687 | 0.686 | 0.247 | 0.681 | 0.562 |
| 6th-order | 0.692 | 0.684 | 0.239 | 0.674 | 0.551 |
| 7th-order | 0.683 | 0.678 | 0.221 | 0.673 | 0.531 |
F1-score Produced by Top Performing K-Order Sequence Descriptor Fusions Across 5 Different Machine Learning Classifiers.
| Encoder Fusion | RandomForest | Xgboost | Naive Bayes | SVM | Adaboost |
|---|---|---|---|---|---|
| 2nd-order | 0.643 | 0.637 | 0.587 | 0.621 | 0.566 |
| 3rd-order | 0.632 | 0.641 | 0.282 | 0.624 | 0.575 |
| 4th-order | 0.641 | 0.638 | 0.282 | 0.622 | 0.571 |
| 5th-order | 0.633 | 0.637 | 0.278 | 0.621 | 0.564 |
| 6th-order | 0.637 | 0.634 | 0.271 | 0.613 | 0.551 |
| 7th-order | 0.634 | 0.629 | 0.251 | 0.613 | 0.531 |
Figure 6CircLoc-Net: A Computational Framework for Sub-cellular Localization Prediction of circRNAs.
Figure 7Process of Generating Sequence K-mers (e.g., 3-mers), Where each Particular Color Frame Denotes a Unique 3-mer.
Figure 8Workflow of Generating circular RNA Sub-cellular Localization dataset comprised of following steps: Collecting raw sequences and associated sub-cellular localization’s, Eliminating Redundancy, and Transforming the dataset into Standard format. Bar Chart and Pie Graph illustrates Statistics of Dataset.