| Literature DB >> 35983235 |
Muhammad Nabeel Asim1,2, Muhammad Ali Ibrahim1,2, Muhammad Imran Malik3, Christoph Zehe4, Olivier Cloarec4, Johan Trygg5,6, Andreas Dengel1,2, Sheraz Ahmed2.
Abstract
Subcellular localization of Ribonucleic Acid (RNA) molecules provide significant insights into the functionality of RNAs and helps to explore their association with various diseases. Predominantly developed single-compartment localization predictors (SCLPs) lack to demystify RNA association with diverse biochemical and pathological processes mainly happen through RNA co-localization in multiple compartments. Limited multi-compartment localization predictors (MCLPs) manage to produce decent performance only for target RNA class of particular sub-type. Further, existing computational approaches have limited practical significance and potential to optimize therapeutics due to the poor degree of model explainability. The paper in hand presents an explainable Long Short-Term Memory (LSTM) network "EL-RMLocNet", predictive performance and interpretability of which are optimized using a novel GeneticSeq2Vec statistical representation learning scheme and attention mechanism for accurate multi-compartment localization prediction of different RNAs solely using raw RNA sequences. GeneticSeq2Vec generates optimized statistical vectors of raw RNA sequences by capturing short and long range relations of nucleotide k-mers. Using sequence vectors generated by GeneticSeq2Vec scheme, Long Short Term Memory layers extract most informative features, weighting of which on the basis of discriminative potential for accurate multi-compartment localization prediction is performed using attention layer. Through reverse engineering, weights of statistical feature space are mapped to nucleotide k-mers patterns to make multi-compartment localization prediction decision making transparent and explainable for different RNA classes and species. Empirical evaluation indicates that EL-RMLocNet outperforms state-of-the-art predictor for subcellular localization prediction of 4 different RNA classes by an average accuracy figure of 8% for Homo Sapiens species and 6% for Mus Musculus species. EL-RMLocNet is freely available as a web server at (https://sds_genetic_analysis.opendfki.de/subcellular_loc/).Entities:
Keywords: Attention mechanism; Deep learning; Explainable; GeneticSeq2Vec; Human; LSTM; Mouse; Multi-class; Multi-label; Neural tricks; RNA subcellular localization prediction; Single or multi compartment
Year: 2022 PMID: 35983235 PMCID: PMC9356161 DOI: 10.1016/j.csbj.2022.07.031
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
A summary of existing computational subcellular localization predictors for miRNA, lncRNA, and mRNA molecules.
| Approach | Subcellular Localization Cardinality | Nucleotide Encoding | Classifier |
|---|---|---|---|
| L2S-MirLoc | Multi-Label | Electronion Interaction PseudoPotentials (EIIP) | Random Forest (RF) |
| miRNALoc | pseudo dinucleotide compositions | Support Vector | |
| and di-nucleotide properties | Machine (SVM) | ||
| MirLocPredictor | positional and semantic | Convolutional Neural | |
| information of k-mers (kmerPR2Vec) | Network (CNN) | ||
| MirGOFS | functional similarity based encoding matrix | microRNA-based | |
| similarity inference model | |||
| MiRLocator | K-mer embeddings using Word2vec (RNA2Vec) | BiLSTM | |
| encoder-decoder model | |||
| iLoc-LncRNA 2.0 7 | Multi-Class | fusing mutual information algorithm | SVM |
| and incremental feature selection strategy | |||
| lncLocation | k-mer frequency, physicochemical properties, | SVM, RF, Logistic | |
| and secondary structure features Autoencoder | regression, XGBoost, | ||
| and binomial distribution based feature selection | lightGBM, DNN and CNN | ||
| Locate-R | K-mer composition and Pearson based filtering | Deep SVM | |
| lncLocator 2.0 | Glove embeddings | CNN, BiLSTM, MLP | |
| lncLocator | k-mer frequency and stacked autoencoder | stacked ensemble classifier | |
| (SVM, RF) | |||
| iLoc-lncRNA | binomial distribution-based feature selection, | SVM | |
| Pseudo K-tuple Nucleotide Composition | |||
| DeepLncLoc | subsequence embeddings | CNN | |
| lncLocPred | k-mer, triplet, and PseDNC VarianceThreshold, | Logistic Regression | |
| binomial distribution, and F-score based feature selection | |||
| Yang et al. | kmer nucleotide composition, Analysis Of | SVM | |
| LncRNAPred | Variance (ANOVA) based feature selection | ||
| DeepLncRNA | k-mer, RNA binding motifs Genomic loci | feed-forward multi-layer | |
| deep neural network | |||
| KD-KLNMF | k-mer and dinucleotide based | SVM | |
| spatial autocorrelation, KLD non-negative | |||
| matrix factorization based feature selection | |||
| mLoc-mRNA | Multi-Label | k-mer frequency and elastic-net based feature selection | RF |
| DM3Loc | One-hot encoding | Attention based CNN | |
| Zhang mRNALoc | Multi-Class | 9-mer, binomial distribution and one-way | SVM |
| analysis of variance based features | |||
| RNATracker | One-hot encoding | Hybrid (CNN + LSTM + Attention) | |
| mRNAloc | psuedo k-tuple nucleotide composition | SVM | |
| mRNALocater | psuedo k-tuple nucleotide composition electron–ion | ||
| interaction pseudopotential, correlation coefficient filtering | Ensemble(CatBoost+ | ||
| LightGBM + XGBoost) | |||
| SubLocEP | Nucleotide physicochemical properties | Weighted LightGBM | |
| NN-RNALoc | k-mer frequency, distance-based sub-sequence | Multi-Layer DNN | |
| profiling and PCA for dimensionality reduction | |||
Fig. 2Workflow of Novel K-hop Neighbourhood Relation based Statistical Representation Learning Scheme for RNA Sequences.
Fig. 1Illustration of K-order (K-hop) Proximity Information, Red Dotted Circle Represents First-Order proximity(), Green Dotted Circle Indicates Second-Order Proximity (), Aqua Dotted Circle Represents Third-Order () Proximity, and Orange Dotted Circle Indicates Fourth-Order Proximity ().
Optimal Parameter Values of Proposed EL-RMLocNet Approach for 8 Benchmark Datasets Belonging to 4 different RNA classes and 2 Species.
| 3 | 2 | 200 | 0.005 | 1 | 200 | 50 | 0.01 | 0.05 | 0.001 | 32 | |||
| 1 | 1 | 32 | 0.0025 | 1 | 32 | 60 | 0.005 | 0.06 | 0.1 | 32 | |||
| 2 | 2 | 64 | 0.0025 | 1 | 64 | 50 | 0.005 | 0.06 | 0.01 | 32 | |||
| 2 | 2 | 200 | 0.005 | 1 | 200 | 50 | 0.1 | 0.05 | 0.1 | 64 | |||
| 2 | 1 | 200 | 0.0025 | 4 | 64 | 90 | 0.05 | 0.06 | 0.1 | 32 | |||
| 1 | 1 | 32 | 0.0025 | 1 | 32 | 60 | 0.005 | 0.06 | 0.1 | 32 | |||
| 2 | 2 | 16 | 0.0025 | 1 | 16 | 50 | 0.005 | 0.06 | 0.0001 | 32 | |||
| 3 | 2 | 200 | 0.0025 | 4 | 60 | 50 | 0.05 | 0.05 | 0.01 | 128 | |||
Fig. 3Workflow of an Explainable Deep Learning Model for RNA Associated Multi-Compartment SubCellular Localization Prediction.
Fig. 4Information Flow in Standard LSTM Cell.
Fig. 5Architecture of the Attention Model.
Fig. 6Schematic Illustration of RNA Associated Multi-Compartment Subcellular Localization in Cells.
Fig. 7Statistical Distribution of Benchmark RNA Associated Multi-Compartment Localization Prediction Datasets Belonging to Homo Sapien species (A-D) and Mus Musculus species (E-H).
Fig. 8A Comparison of Variations in Sequence Length across 8 Benchmark RNA Associated Multi-Compartment Subcellular Localization Datasets.
Comparing the Impact of 6 Different Sequence Fixed Length Generation Approaches over the Performance of Proposed EL-RMLocNet Approach Produced for 8 Benchmark Datasets of 2 different Species in terms of Average Precision
| 0.72 | 0.70 | 0.77 | 0.72 | 0.73 | 0.71 | |
| 0.85 | 0.86 | 0.85 | 0.84 | 0.77 | 0.77 | |
| 0.83 | 0.84 | 0.83 | 0.84 | 0.82 | 0.85 | |
| 0.77 | 0.83 | 0.80 | 0.78 | 0.80 | 0.80 | |
| 0.66 | 0.65 | 0.71 | 0.68 | 0.60 | 0.63 | |
| 0.86 | 0.87 | 0.86 | 0.86 | 0.84 | 0.83 | |
| 0.73 | 0.70 | 0.77 | 0.73 | 0.72 | 0.69 | |
| 0.82 | 0.81 | 0.82 | 0.81 | 0.80 | 0.81 | |
Fig. 9AU-ROC Produced by Proposed EL-RMLocNet Approach for Multi-Compartment Subcellular Localization of mRNA, miRNA, snoRNA, and lncRNA across 2 Species A. Homo Sapien and B. Mus Musculus.
Fig. 10Multi-Compartment Localization Prediction Performance Produced by EL-RMLocNet on 4 Benchmark Homo Spaien Datasets of mRNA, miRNA, snoRNA, and lncRNA Corresponding to Unique Sequence-Compartment Distribution.
Fig. 11Multi-Compartment Localization Prediction Performance Produced by EL-RMLocNet on 4 Benchmark Mus Musculus Datasets of mRNA, miRNA, snoRNA, and lncRNA Corresponding to Unique Sequence-Compartment Distribution.
Performance Comparison of Proposed EL-RMLocNet Approach with State-of-the-art Approach for Multi-Compartment Localization Prediction of miRNA, mRNA, snoRNA, and lncRNA using 8 Benchmark Datasets of Homo Sapiens (Human) and Mus Musculus (Mouse) Species.
| Average Precision | Accuracy | Coverage | Ranking Loss | One error | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Species | Datasets | State-of-the-art | EL-RMLocNet | State-of-the-art | EL-RMLocNet | State-of-the-art | EL-RMLocNet | State-of-the-art | EL-RMLocNet | State-of-the-art | EL-RMLocNet | |
| miRNA | 0.79 | 0.52 | 1.46 | 0.17 | 0.29 | |||||||
| mRNA | 0.76 | 0.41 | 1.69 | 0.24 | 0.37 | |||||||
| snoRNA | 0.82 | 0.54 | 1.54 | 0.18 | 0.24 | |||||||
| lncRNA | 0.75 | 0.42 | 1.18 | 0.22 | 0.37 | |||||||
| miRNA | 0.79 | 0.58 | 1.31 | 0.18 | 0.31 | |||||||
| mRNA | 0.70 | 0.34 | 1.71 | 0.14 | 0.44 | |||||||
| snoRNA | 0.80 | 0.52 | 1.59 | 0.21 | 0.25 | |||||||
| lncRNA | 0.76 | 0.43 | 0.95 | 0.19 | 0.40 | |||||||
Fig. 12Most Informative and Least Informative Nucleotide K-mers Patterns for 4 different RNAs belonging to Homo Sapien and Musculus Species Identified by Attention Layer of Proposed EL-RMLocNet Approach.