| Literature DB >> 31976169 |
Simon Orozco-Arias1,2, Gustavo Isaza2, Romain Guyot3,4, Reinel Tabares-Soto4.
Abstract
BACKGROUND: Transposable elements (TEs) constitute the most common repeated sequences in eukaryotic genomes. Recent studies demonstrated their deep impact on species diversity, adaptation to the environment and diseases. Although there are many conventional bioinformatics algorithms for detecting and classifying TEs, none have achieved reliable results on different types of TEs. Machine learning (ML) techniques can automatically extract hidden patterns and novel information from labeled or non-labeled data and have been applied to solving several scientific problems.Entities:
Keywords: Bioinformatics; Classification; Deep learning; Detection; Machine learning; Retrotransposons; Transposable elements
Year: 2019 PMID: 31976169 PMCID: PMC6967008 DOI: 10.7717/peerj.8311
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1PRISMA flow diagram.
PRISMA flow chart for search and article screening process. From: Moher et al. (2009).
Figure 2Stages of the systematic literature review process.
Based on Wen et al. (2012).
Literature resources used in this review.
| Database | Link |
|---|---|
| Scopus | |
| Science direct | |
| Web of science | |
| Springer link | |
| PubMed | |
| Nature |
Figure 3Classification of TEs following Rexdb and GyDB nomenclatures.
Adapted from: Schietgat et al. (2018).
Selected publications and their contribution to each research question.
| Publication identifier | Year | Q1 | Q2 | Q3 | Q4 | References | Publication identifier | Year | Q1 | Q2 | Q3 | Q4 | References |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P1 | 2017 | X | X | X | X | P19 | 2013 | X | X | X | |||
| P2 | 2018 | X | X | X | P20 | 2014 | X | X | |||||
| P3 | 2017 | X | X | X | P21 | 2010 | X | X | |||||
| P4 | 2013 | X | X | X | P22 | 2010 | X | X | |||||
| P5 | 2011 | X | X | X | P23 | 2019 | X | ||||||
| P6 | 2018 | X | X | X | P24 | 2015 | X | X | X | X | |||
| P7 | 2019 | X | X | X | P25 | 2018 | X | X | X | ||||
| P8 | 2018 | X | X | X | P26 | 2018 | X | X | X | ||||
| P9 | 2018 | X | X | P27 | 2009 | X | |||||||
| P10 | 2012 | X | X | X | X | P28 | 2019 | X | X | ||||
| P11 | 2017 | X | X | X | P29 | 2017 | X | X | X | X | |||
| P12 | 2014 | X | X | X | X | P30 | 2014 | X | X | X | |||
| P13 | 2016 | X | X | X | P31 | 2013 | X | ||||||
| P14 | 2018 | X | X | X | P32 | 2019 | X | ||||||
| P15 | 2011 | X | X | X | P33 | 2014 | X | X | |||||
| P16 | 2017 | X | X | X | P34 | 2013 | X | X | X | X | |||
| P17 | 2017 | X | X | X | P35 | 2019 | X | X | |||||
| P18 | 2018 | X | X | X |
Figure 4Number of relevant publications found per year.
Figure 5Source of selected publications.
(A) Percentage of publications in each source. (B) Distribution of publications in journals.
Machine learning algorithms used in publications selected in this study.
| Publication | Data source | Task | ML algorithm | Learning method | References |
|---|---|---|---|---|---|
| P2 | Numerical and categorical features based on coding regions | Detect LTR Retrotransposons at the super-family level | RF | Supervised | |
| P3 | Numerical and categorical features | Classify LTR Retrotransposons at the lineage level | DT, BN and lazy algorithms | Supervised | |
| P4 | Numerical and categorical features | Improve the detection and classification of TEs | NN, BN, RF, DT | Supervised | |
| P5 | Numerical features based on structure | Detect boundary sequences of mobile elements | HMM, SVM | Unsupervised and Supervised | |
| P6 | 85 Numerical features in four categories (genomic, epigenetic, expression, network) | Detection of cancer-related long non-coding RNA | RF, NB, SVM, LR and KNN | Supervised | |
| P8 | Z-score features, representing chromosome arm gains and losses | Detection of aneuploidy | SVM | Supervised | |
| P10 | K-mer frequencies and frequencies of certain patterns | Distinguishing endogenous retroviral LTRs from SINEs | RF | Supervised | |
| P11 | Dinucleotide frequencies | Identification and clustering of RNA structure motifs | Density-based clustering | Unsupervised | |
| P12 | Sequences of nucleotides (DNA) and categorical features | Automatization of the process of extracting discriminatory features for determining functional properties of biological sequences | Evolutionary feature construction and evolutionary feature selection | Unsupervised | |
| P14 | Numerical features | Analysis of mutants | RF | Supervised | |
| P15 | Insertion sites | Identification of potential insertion sites of mobile elements | SVM | Supervised | |
| P16 | Numerical features | Identification of somatic LINE-1 insertions | LR | Supervised | |
| P17 | Numerical features, RNA mononucleotides, dinucleotides and trinucleotides frequencies, Fickett score | Identification of most informative features of long non-coding transcripts | 11 different feature selection approaches, SVM, RF, and NB | Supervised | |
| P19 | Numerical and categorical features | Improve the detection and classification of TEs | NN, BN, RF, DT | Supervised | |
| P21 | K-mer frequencies | Classify repetitive sequences | SVM | Supervised | |
| P22 | Numerical features | Prediction of microRNA precursors | SVM | Supervised | |
| P24 | Sequences of nucleotides (DNA) | Detecting repeats de novo | HMM | Supervised | |
| P26 | K-mer frequencies | Classify TEs using hierarchical approaches | DT, RF, NB, KNN, MLP, SVM | Supervised | |
| P27 | K-mer frequencies | Classify TEs | SVM | Supervised | |
| P28 | Numerical features based on structure | Identify sequence motifs conserved in each of the five major TIR superfamilies | NN, KNN, RF, and Adaboost | Supervised | |
| P30 | Numerical features and k-mer frequencies | piRNA prediction | SVM | Supervised | |
| P31 | Aligned genomes and binary representation (1 for mismatches and 0 for matches) | Recognition of local relationship patterns | HMM, SOM | Unsupervised | |
| P32 | Numerical features | Compare multiple transposon insertion sequencing studies | PCA | Unsupervised | |
| P33 | Numerical and categorical features, nucleotide frequencies | Classify the precursors of small non-coding RNAs | RF | Supervised | |
| P34 | Normalized numerical and categorical features | Prediction of transcriptional effects by intronic endogenous retroviruses | MLP NN | Supervised |
Note:
RF, Random Forest; DT, Decision Trees; BN, Bayesian Networks; NN, Neural networks; HMM, Hidden Markov Model; SVM, Support Vector Machine; NB, Naïve Bayes; LR, Logistic Regression; KNN, K-Nearest Neighbors; SOM, Self-Organizing Map; PCA, Principal Component Analysis; MLP, Multi-Layer Perceptron; FORF, first-order random forests. The full version of this table can be consulted in Table S1.
Figure 6Source of selected publications.
(A) Proportion of publications using supervised and unsupervised learning. (B) Supervised learning algorithms found in publications. Abbreviations: Random Forest (RF), Decision Trees (DT), Bayesian Networks (BN), Neural networks (NN), Hidden Markov Model (HMM), Support Vector Machine (SVM), Naïve Bayes (NB), Logistic Regression (LR), K-Nearest Neighbors (KNN), and Multi-Layer Perceptron (MLP).
Figure 7Overall workflow in supervised learning ML tasks applied to TEs.
Deep learning architectures used in genomic data reviewed in Eraslan et al. (2019). Architecture details used in each work can be consulted in Table S2.
| Dataset features | Task | DNN type | Framework or language | Year | References |
|---|---|---|---|---|---|
| Presence of binding motifs of splice factors or sequence conservation | Predict the percentage of spliced exons | Fully connected NN | TensorFlow | 2017 | Jha, Gazzara & Barash (2017) |
| Numerical features, k-mer frequencies ( | Prioritize potential disease-causing genetic variants | Fully connected NN | Matlab | 2016 | Liu et al. (2016) |
| Chromatin marks, gene expression and evolutionary conservation | Predict cis-regulatory elements | Fully connected NN | Python | 2018 | Li, Shi & Wasserman (2018) |
| Microarray and sequencing data | Predict binarized in vitro and in vivo binding affinities | Convolutional NN | Python + CUDA | 2015 | Alipanahi et al. (2015) |
| A 1,000 bp sequence | Predict the presence or absence of 919 chromatin features | Convolutional NN | LUA | 2015 | Zhou & Troyanskaya (2015) |
| A 600bp sequence (one-hot matrix) | Predict 164 binarized DNA accessibility features | Convolutional NN | Torch7 | 2016 | Kelley, Snoek & Rinn (2016) |
| DNA sequence (one-hot matrix) | Classify transcription factor binding sites | Convolutional NN | Torch7 | 2018 | Wang et al. (2018) |
| DNA sequence (one-hot matrix) | Predict molecular phenotypes such as chromatin features | Convolutional NN | TensorFlow | 2018 | Kelley et al. (2018) |
| DNA sequence (one-hot matrix) and DNAse signal | DNA contact maps | Convolutional NN | Python | 2018 | Schreiber et al. (2018) |
| DNA sequence (one-hot matrix) and DNAse signal | DNA methylation | Convolutional NN | Theano + Keras | 2017 | Angermueller et al. (2017) |
| DNA sequences | Transform genomic sequences to epigenomic features | Convolutional NN | PyTorch | 2018 | Zhou et al. (2018) |
| K-mer frequencies and their positions | Predict translation efficiency | Convolutional NN | Keras | 2017 | Cuperus et al. (2017) |
| DNA sequence (one-hot matrix) and DNAse signal | Predict RNA-binding proteins | Convolutional NN | TensorFlow | 2018 | Budach & Marsico (2018) |
| Numerical features | Predict microRNA (miRNA) targets | Convolutional NN | – | 2016 | Cheng et al. (2015) |
| Numerical features | Aggregate the outputs of CNNs for predicting single-cell DNA methylation state | Recurrent NN | Theano + Keras | 2017 | Angermueller et al. (2017) |
| RNA sequence (one-hot matrix) | Predict RNA-binding proteins | Recurrent NN | Keras | 2018 | Pan et al. (2018) |
| DNA sequence (one-hot matrix) | Predict transcription factor binding and DNA accessibility | Recurrent NN | Theano + Keras | 2019 | Quang & Xie (2019) |
| RNA sequence (weight matrices) | Predict the occurrence of precursor miRNAs from the mRNA sequence | Recurrent NN | Theano + Keras | 2016 | Park et al. (2016) |
| Gene expression level (binary, over or under-expressed) | Predict binarized gene expression given the expression of other genes | Graph-convolutional NN | Torch7 | 2018 | Dutil et al. (2018) |
| Gene expression profile and protein-protein interaction network | Classify cancer subtypes | Graph-convolutional NN | – | 2017 | Rhee, Seo & Kim (2017) |
Figure 8Overall FNN architecture used by Nakano et al. to classify TEs.
Based on Nakano et al. (2018b).
Figure 9Overall CNN architecture used by da Cruz et al. to classify TEs.
Based on Da Cruz et al. (2019).
Metrics used in TEs and other similar task.
Adopted from (Kamath, De Jong & Shehu, 2014; Brayet et al., 2014; Ma, Zhang & Wang, 2014; Yu, Yu & Pan, 2017; Smith et al., 2017; Chen et al., 2018; Schietgat et al., 2018; Segal et al., 2018). D for detection and C for classification.
| Metric | Representation | Observations | Tasks in which it was used |
|---|---|---|---|
| Accuracy | Measures the percentage of samples that are correctly classified | D, C | |
| Precision | Percentage of correct predictions | D | |
| Sensitivity (recall) | Represents the proportion of positive samples that are correctly predicted | D, C | |
| Specificity | Represents the proportion of negative samples that are correctly predicted | D | |
| Matthews correlation coefficient | It can be a key measurement because it is a balanced measurement, even if the sizes of positive and negative samples have great differences | D | |
| Positive predictive value | Percentage of correctly classified positive samples among all positive-classified ones | D, C | |
| Performance coefficient | Ratio of correct predictions belonging to the positive class and predictions belonging to the false class | D | |
| F1 score | Harmonic mean of precision and sensitivity | D | |
| Precision-recall curves | Graphics | Plots the precision of a model as a function of its recall | D, C |
| Receiver operating characteristic curves (ROCs) | Graphics | Commonly used to evaluate the discriminative power of the classification model at different thresholds | C |
| Area under the ROC curve (AUC) | Graphics | Summary measure that indicates whether prediction performance is close to random (0:5) or perfect (1:0). Also describes the sensitivity vs. the specificity of the prediction | D, C |
| Area under the Precision-Recall (auPRC) | Graphics | Measures the fraction of negatives misclassified as positives and plots the precision vs. recall ratio | D |
| False positive rate | 1–Specificity | Percentage of predictions marked as belonging to the positive class, but that are part of the negative class. | D |
Figure 10Equations for hierarchical metrics. Z and C correspond, respectively, to the set of true and predicted classes for an instance i.
(A) Hierarchical precision, (B) hierarchical recall and (C) hierarchical F1-score.
Coding schemes for translating DNA characters in numerical representations. Adapted from (Yu, Yu & Pan, 2017).
| Encoding schemes | Codebook | References |
|---|---|---|
| DAX | {‘C’:0, ‘T’:1, ‘A’:2, ‘G’:3} | Yu et al. (2015) |
| EIIP | {‘C’:0.1340, ‘T’:0.1335, ‘A’:0.1260, ‘G’:0.0806} | Nair & Sreenadhan (2006) |
| Complementary | {‘C’:-1, ‘T’:-2, ‘A’:2, ‘G’:1} | Akhtar et al. (2008) |
| Enthalpy | {‘CC’:0.11, ‘TT’:0.091, ‘AA’:0.091, ‘GG’:0.11, ‘CT’:0.078, ‘TA’:0.06, ‘AG’:0.078, ‘CA’:0.058, ‘TG’:0.058, ‘CG’:0.119, ‘TC’:0.056, ‘AT’:0.086, ‘GA’:0.056, ‘AC’:0.065, ‘GT’:0.065, ‘GC’:0.1111} | Kauer & Blöcker (2003) |
| Galois (4) | {‘CC’:0.0, ‘CT’:1.0, ‘CA’:2.0, ‘CG’:3.0, ‘TC’:4.0, ‘TT’:5.0, ‘TA’:6.0, ‘TG’:7.0, ‘AC’:8.0, ‘AT’:9.0, ‘AA’:1.0, ‘AG’:11.0, ‘GC’:12.0, ‘GT’:13.0, ‘GA’:14.0, ‘GG’:15.0} | Rosen (2006) |
| Orthogonal (one-hot) Encoding | {‘A’: [1, 0, 0, 0], ‘C’: [0, 1, 0, 0], ‘T’: [0, 0, 1, 0], ‘G’: [0, 0, 0, 1]} | Baldi et al. (2001) |