| Literature DB >> 33266128 |
Tanvir Alam1, Hamada R H Al-Absi1, Sebastian Schmeier2.
Abstract
Long non-coding RNAs (lncRNA), the pervasively transcribed part of the mammalian genome, have played a significant role in changing our protein-centric view of genomes. The abundance of lncRNAs and their diverse roles across cell types have opened numerous avenues for the research community regarding lncRNAome. To discover and understand lncRNAome, many sophisticated computational techniques have been leveraged. Recently, deep learning (DL)-based modeling techniques have been successfully used in genomics due to their capacity to handle large amounts of data and produce relatively better results than traditional machine learning (ML) models. DL-based modeling techniques have now become a choice for many modeling tasks in the field of lncRNAome as well. In this review article, we summarized the contribution of DL-based methods in nine different lncRNAome research areas. We also outlined DL-based techniques leveraged in lncRNAome, highlighting the challenges computational scientists face while developing DL-based models for lncRNAome. To the best of our knowledge, this is the first review article that summarizes the role of DL-based techniques in multiple areas of lncRNAome.Entities:
Keywords: Attention mechanism; CNN; LSTM; convolutional neural network; deep learning; lncRNA; lncRNAome; long non-coding RNA; machine learning
Year: 2020 PMID: 33266128 PMCID: PMC7711891 DOI: 10.3390/ncrna6040047
Source DB: PubMed Journal: Noncoding RNA ISSN: 2311-553X
Figure 1A neural network (NN) with four inputs and two hidden layers (adopted from [32]). xi represents an input feature for the network, and yi represents an output class label.
Figure 2Restricted Boltzmann machine (RBM) (adopted from [34]).
Figure 3Pretraining of a deep belief network (DBN) (adopted from [36]).
Figure 4An architecture of a convolutional neural network (CNN) (adopted from [39]).
Figure 5A graph convolutional network (GCN) (adopted from [41]).
Figure 6Architecture of a generative adversarial network (GAN) (adopted from [42]).
Figure 7Architecture of an autoencoder (AE) (adopted from [45]).
Figure 8A simple architecture of an RNN.
Figure 9A long short-term memory (LTSM) architecture (adopted from [49]).
Figure 10A bidirectional LSTM (BLTSM) architecture. A and A’ represent an LSTM cell propagating data dependency in forward and reverse directions, respectively. xt and yt are input and output at timestep t from each LSTM cell, respectively. S0 and S’0 denote the initial states, whereas Si and S’i denote the final states.
Figure 11An attention mechanism (AM) (adopted from [51]). denotes the output map from the middle of the convolution layer of a network. The map is propagated to the next layer of the network, and the AM calculates the weighted average of as . The fully connected layer calculation is represented by the straight lines, and the weighted average calculation is represented by dashed lines. The neural network is utilized by the AM to estimate and the importance of each .
List of deep learning (DL)-based architectures that have been employed to solve key questions in lncRNA research.
| Research Area | Proposed DL Based Architecture | References |
|---|---|---|
| LncRNA Identification | CNN and RNN | LncRNAnet [ |
| DBN | LncADeep [ | |
| Embedding vector, BLSTM, CNN | Liu et al. [ | |
| DNN | DeepLNC [ | |
| Distinct transcription regulation of lncRNAs | CNN | DeepCNPP [ |
| Functional annotation of lncRNAs | DNN | LncADeep [ |
| Localization prediction | DNN | DeepLncRNA [ |
| lncRNA–protein interaction | Stacked auto-encoder, Random forest | IPminer [ |
| Stacked auto-encoder, CNN | RPITER [ | |
| LncRNA–miRNA interaction | GCN | GCLMI [ |
| LncRNA–DNA interaction | GCN | [ |
| LncRNA–disease association | GCN and AM | GCNLDA [ |
| CNN and AM | CNNLDA [ | |
| DNN | NNLDA [ | |
| Cancer type classification | MLP, CNN, LSTM, DAE | [ |
AM: attention mechanism. BLSTM: bi-directional long short-term memory. CNN: convolutional neural network. DAE: deep autoencoder. DBN: deep belief network. DNN: deep neural network. GCN: graph convolutional network. LSTM: long short-term memory. MLP: multi-layer perceptron. RNN: recursive neural network.
Overview of articles for lncRNA identification leveraging DL-based techniques.
| LncRNAnet [ | LncADeep [ | Liu et al. [ | DeepLNC [ | |
|---|---|---|---|---|
| Publication Year | 2018 | 2018 | 2019 | 2016 |
| Species | Human and Mouse | Human and Mouse | Human and Mouse | Human |
| Data source used | GENCODE 25, Ensembl | GENCODE 24, Refseq | GENCODE 28, Refseq | LNCipedia 3.1, Refseq |
| Number of lncRNA considered for training | ~21k (~21k) lncRNA transcripts from human (mouse) | ~66k (~42k) full length lncRNA transcripts from human (mouse) | 28k (~17k) lncRNA transcripts from human (mouse) | ~80k lncRNA transcripts and ~100k mRNA transcripts |
| Performance metric | SN, SP, ACC, F1-Score, AUC | SN, SP, Hm | SN, SP, ACC, F1-Score, AUC | SN, SP, ACC, F1-Score, Precision |
| Metrics for comparison against traditional ML based model * | ACC:91.79 # | Hm: 97.7 # | ACC:96.4 # | ACC: 98.07 |
| Intriguing features from the proposed model | ORF length and ratio | ORF length and ratio, k-mer composition and hexamer score, position specific nucleotide frequency etc. | k-mer embedding | Solely based on k-mer patterns |
| Source code/Implementation | N/A |
| N/A |
|
ACC: accuracy. AUC: area under the receiver operating characteristics curve. Hm: harmonic mean of sensitivity and specificity. MCC: Matthews correlation coefficient. N/A: not available. ORF: open reading frame. SN: sensitivity. SP: specificity. * Performance metrics that were highlighted in the original research article for comparing against traditional machine learning (ML)-based models. #: Performance on humans.
Overview of articles for demystifying transcription regulation of lncRNA leveraging DL-based techniques.
| DeepCNPP [ | DeePEL [ | |
|---|---|---|
| Publication Year | 2019 | 2019 |
| Species | Human | Human |
| Data source used | Dataset from [ | FANTOM CAT [ |
| Number of lncRNA transcripts or genes considered | ~19k lncRNA genes | ~7k (~3k) p-lncRNA (e-lncRNA) transcripts |
| Performance metric | SN, SP, ACC | SN, SP, MCC, AUC |
| Metrics for comparison against traditional ML based model * | ACC: 83.34 | Traditional ML model does not exist for this task |
| Intriguing features from the proposed model | k-mer embedding of promoter regions | k-mer embedding of promoter regions, transcription factor binding sites |
* Performance metrics that were highlighted in the original research article for comparing against traditional ML-based models.
Overview of articles for lncRNA–protein interaction prediction leveraging DL-based techniques.
| IPminer [ | RPI-SAN [ | BGFE [ | RPITER [ | |
|---|---|---|---|---|
| Publication Year | 2016 | 2018 | 2019 | 2019 |
| Species | Multi-species | Multi-species | Multi-species | Multi-species |
| Benchmark Data source used | NPInter 2.0, RPI369, RPI488, RPI1807, RPI2241, RPI13254 | NPInter 2.0, RPI488, RPI1807, RPI2241 | RPI488, RPI1807, RPI2241 | NPInter 2.0, RPI369, RPI488, RPI1807, RPI2241 |
| Performance metric | SN, SP, ACC, Precision, AUC, MCC | SN, SP, ACC, Precision, AUC, MCC | SN, SP, ACC, Precision, AUC, MCC | SN, SP, ACC, Precision, AUC, MCC |
| Metrics for comparison against traditional ML based model for different dataset * | NPInter 2.0 (ACC: 95.7) #, RPI369 (ACC: 75.2), RPI488 (ACC: 89.1), RPI1807 (ACC: 98.6), RPI2241 (ACC: 82.4), RPI13254 (ACC: 94.5) | NPInter 2.0 (ACC: 99.33) #, RPI488 (ACC: 89.7), RPI1807 (ACC: 96.1), RPI2241 (ACC: 90.77) | RPI488 (ACC: 88.68), RPI1807 (ACC: 96.0), RPI2241 (ACC: 91.30) | NPInter 2.0 (ACC: 95.5) #, RPI369 (ACC: 72.8), RPI488 (ACC: 89.3), RPI1807 (ACC: 96.8), RPI2241 (ACC: 89.0) |
| Intriguing features from the proposed model | Sequence composition features, specifically 3-mer and 4-mer from protein and RNA sequences, respectively | k-mer sparse matrix from RNA sequences and PSSM from protein sequences | k-mer sparse matrix from RNA sequences and PSSM from protein sequences. Stacked auto-encoder was employed to get high accuracy | k-mer frequency of sequence and two types of structural information (bracket and dot) from RNA. k-mer frequency of sequence and three types of structural information (α-helix, β-sheet and coil) from protein |
| Source code/Implementation | N/A | N/A |
|
PSSM: position-specific scoring matrix.* Performance metrics that were highlighted in the original research article for comparing against traditional machine learning (ML)-based models. #: Performance on humans.
Overview of articles for lncRNA–disease association prediction leveraging DL-based techniques.
| GCNLDA [ | CNNLDA [ | NNLDA [ | |
|---|---|---|---|
| Publication Year | 2019 | 2019 | 2019 |
| Data source used | LncRNADisease, Lnc2cancer, GeneRIF | LncRNADisease, Lnc2cancer, GeneRIF | LncRNADisease |
| Number of lncRNA considered | 240 | 240 | 19166 |
| Number of diseases considered | 402 | 402 | 529 |
| Performance metric | AUC, AUPRC, Precision, Recall | AUC, AUPRC, Precision, Recall | HR(k): Probability for the predicted samples to appear in top-k ranked list |
| Metrics for comparison against traditional ML based models | AUC $: 0.959 | AUC $: 0.952 | HR(k); k = 1.10 |
| Intriguing features from the proposed model * | For ncRNA-lncRNA similarity Chen’s method was applied [ | For ncRNA-lncRNA similarity Chen’s method was applied [ | Matrix factorization method was modified in two aspects to fit into this model: (a) cross-entropy was used as a loss function; (b) only one batch data per round was used to minimize loss |
| Source code/Implementation | N/A | N/A |
|
AUPRC: area under the precision-recall curve. HR(k): hit ratio, the probability for the predicted samples to appear in a top k ranked list. * Performance metrics that were highlighted in the original research article for comparing against traditional ML -based models. $: Average over 402 diseases.
Figure 12Loss function and performance metric over epoch to avoid the overfitting problem of deep networks. When the model performance of a validation set diminishes relative to the performance of a training set, an overfitting scenario may be indicated.