| Literature DB >> 31390781 |
Simon Orozco-Arias1,2, Gustavo Isaza2, Romain Guyot3,4.
Abstract
Transposable elements (TEs) are genomic units able to move within the genome of virtually all organisms. Due to their natural repetitive numbers and their high structural diversity, the identification and classification of TEs remain a challenge in sequenced genomes. Although TEs were initially regarded as "junk DNA", it has been demonstrated that they play key roles in chromosome structures, gene expression, and regulation, as well as adaptation and evolution. A highly reliable annotation of these elements is, therefore, crucial to better understand genome functions and their evolution. To date, much bioinformatics software has been developed to address TE detection and classification processes, but many problematic aspects remain, such as the reliability, precision, and speed of the analyses. Machine learning and deep learning are algorithms that can make automatic predictions and decisions in a wide variety of scientific applications. They have been tested in bioinformatics and, more specifically for TEs, classification with encouraging results. In this review, we will discuss important aspects of TEs, such as their structure, importance in the evolution and architecture of the host, and their current classifications and nomenclatures. We will also address current methods and their limitations in identifying and classifying TEs.Entities:
Keywords: bioinformatics; classification; deep learning; detection; function; machine learning; retrotransposons; structure; transposable elements
Mesh:
Substances:
Year: 2019 PMID: 31390781 PMCID: PMC6696364 DOI: 10.3390/ijms20153837
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Transposable element domains and their function in the replication mechanism. Adapted from [6,55]. LTR, long terminal repeat.
| Complete Gene Name | Short Name | Function |
|---|---|---|
|
|
| Responsible for DNA synthesis using RNA as a template |
|
|
| Responsible for the degradation of the RNA template in the DNA-RNA hybrid |
|
|
| Responsible for catalyzing the insertion of the retrotransposon cDNA into the genome of a host cell |
|
|
| Responsible for processing large transposon transcripts into smaller protein products |
|
|
| Responsible for cell-to-cell transfer of retroviruses. |
|
|
| Structural protein for virus-like particles |
|
|
| Responsible for targeting the insertion of new LTR retrotransposon copies into heterochromatic regions by recognizing specific heterochromatic histone marks and/or other factors |
Figure 1Structure of LTR retrotransposon. The env gene might not be present in some elements. Orange arrows correspond to LTRs.
Figure 2Structure of non-autonomous elements. Orange arrows correspond to LTRs and single lines correspond to non-coding regions. PBS: primer binding site; PPT: Poly-Purine Tract; TRIM: Terminal-Repeat Retrotransposons in Miniature; LARD: LArge Retrotransposon Derivatives.
Figure 3Structure of non-LTR retrotransposons.
Figure 4Structure of Penelope-like elements (PLEs).
Figure 5Structure of Dictyostelium intermediate repeat sequences (DIRS).
Stress-activated retrotransposons reported in plant genomes. With information from [9,34,84,86,90,91].
| Retrotransposon | Stresses by External Conditions | Species | Reference |
|---|---|---|---|
|
| Protoplast and tissue culture, pathogens, pathogen elicitors, compounds related to plant defense, wounding, freezing, in vitro regeneration, mechanical damage, and microbial factors. | Tobacco | [ |
|
| Wounding, methyl jasmonate, tissue culture, fungal elicitors, chilling, cytosine demethylation, resistance to bacterial blight, and plant development. | Tobacco | [ |
|
| Tissue culture and viral infection. | Rice | [ |
|
| Wounding, jasmonic and salicylic acid, UV light, infection with an incompatible race of the crown rust fungus. | Oat | [ |
|
| UV light. | Melon | [ |
|
| Heat stress. | [ | |
|
| Heat stress. |
| [ |
|
| Hormonal treatments. | Strawberry | [ |
|
| Water-induced stress. | Barley, | [ |
|
| Phytohormones, wounding, protoplast preparation, high salt concentration and stress-associated signaling molecules. |
| [ |
|
| Fungal infection. | Wild wheat | [ |
|
| Barley stripe mosaic virus infection. | Maize | [ |
|
| Cold. | Maize | [ |
|
| Wounding and salt stress. | Lemon | [ |
|
| Heat shock. | Rice | [ |
|
| Interspecific hybridization. | Wheat | [ |
|
| Tissue culture. |
| [ |
Figure 6Classification of TEs following Rexdb and GyDB nomenclature. Adapted from [26].
Correspondences between names of superfamilies and lineages given for some classification systems and the International Committee on Taxonomy of Viruses (ICTV). Adapted from [16,37,63].
| Superfamilies | |||
|---|---|---|---|
| REXdb a | Wicker and Keller b | GyDB c | ICTV d |
| Copia | Copia | Ty1/Copia | Pseudoviridae |
| Gypsy | Gypsy | Ty3/Gypsy | Metaviridae |
| Bel-pao | Bel-pao | Bel-pao | Semotiviruses |
| Lineages (Copia) | |||
| Ale | Ale | Sirevirus/Retrofit | pseudovirus |
| Alesia | Ale | - | - |
| Angela | Angela | - | pseudovirus |
| Bianca | Bianca | - | - |
| Bryco | - | - | - |
| Lyco | - | - | - |
| Gymco-I, II, III, IV | - | - | - |
| Ikeros | Angela | Tork | pseudovirus |
| Ivana | Ivana | Sirevirus/Oryco | - |
| Osser | - | Osser | hemivirus |
| SIRE | Maximus | Sirevirus/SIRE | Sirevirus |
| TAR | TAR | Tork | - |
| Tork | - | Tork | pseudovirus |
| Lineages (Gypsy) | |||
| chromovirus|CRM | - | chromoviruses|CRM | - |
| chromovirus|Chlamyvir | - | - | - |
| chromovirus|Galadriel | - | chromoviruses|Galadriel | - |
| chromovirus|Reina | - | chromoviruses|Reina | - |
| chromovirus|Tekay | - | chromoviruses|Del | Metavirus (Del1) |
| non-chromovirus|OTA|Athila | - | Athila/Tat|Athila | Metavirus (Athila) |
| non-chromovirus|OTA|Tat|TatI | - | - | - |
| non-chromovirus|OTA|Tat|TatII | - | - | - |
| non-chromovirus|OTA|Tat|TatIII | - | - | - |
| non-chromovirus|OTA|Tat|Ogre | - | Athila/Tat|Tat (Ogre) | - |
| non-chromovirus|OTA|Tat|Retand | - | Athila/Tat|Tat | Metavirus (Tat4) |
| non-chromovirus|Phygy | - | - | - |
| non-chromovirus|Selgy | - | - | - |
a [16], b [166], c [14], d [167].
Bioinformatics software found in the literature. I for identification, C for classification, and O for other analysis; ML for machine learning. With information from [18,29,77].
| Software | Approach | TE Class or Order | Applies ML | Input Format Files | Tasks | Reference |
|---|---|---|---|---|---|---|
| Censor | Homology-based | Any | NO | Any | I | [ |
| Find_ltr | Structure-based, Homology-based | Complete LTR RTs, and solo LTRs | NO | Assembled sequences | I | [ |
| FORRepeats | Homology-based | Any | NO | Any | I | [ |
| Inpactor | Structure-based, Homology-based | LTR RTs | NO | Assembled sequences, LTR_Struc output or REPET output | C, O | [ |
| LTR-FINDER | Structure-based | LTR RTs | NO | Assembled sequences | I | [ |
| LTR_MINER | Structure-based | LTR RTs | NO | RepeatMasker output | I | [ |
| LTR_retriever | Structure-based | LTR RTs | NO | Assembled sequences | I | [ |
| LTR_STRUC | Structure-based | LTR RTs | NO | Assembled sequences | I | [ |
| LTRClassifier | Homology-based | LTR RTs | NO | Assembled sequences | C | [ |
| LTRdigest | Structure-based, Homology-based | LTR RTs | NO | LTRharvest output | C | [ |
| LTRHarvest | Structure-based | LTR RTs | NO | Assembled sequences | I | [ |
| LTRSift | Structure-based | LTR RTs | NO | LTRdigest output | C | [ |
| LTRType | Homology-based | LTR RTs | NO | Assembled sequences | I | [ |
| P-Clouds | De novo | Any | NO | Assembled sequences | I | [ |
| PASTEC | Structure-based, Homology-based | Any | NO | Assembled sequences | C | [ |
| PILER | Structure-based, De novo | Any | NO | Assembled sequences | I | [ |
| RAP | De novo | Any | NO | Assembled sequences | I | [ |
| REannotate | Other | Any | NO | RepeatMasker output | O | [ |
| ReAS | De novo | Any | NO | Unassembled sequence reads | I | [ |
| RECON | De novo | Any | NO | Unassembled and assembled sequences | I | [ |
| Red | De novo (HMM) | Any | YES | Unassembled and assembled sequences | I | [ |
| REDdenovo | De novo | Any | NO | Unassembled sequence reads | I | [ |
| REPCLASS | Structure-based, Homology-based | Any | NO | Assembled sequences | I | [ |
| RepeatExplorer | De novo | Any | NO | Unassembled sequence reads | I | [ |
| RepeatMasker | Homology-based | Any | NO | Assembled sequences | O |
|
| RepeatModeler | De novo | Any | NO | Assembled sequences | I |
|
| RepeatScout | De novo | Any | NO | Assembled sequences | I | [ |
| Repeat Pattern | De novo | Any | NO | Assembled sequences | I | [ |
| REPET | De novo, Structure-based, | Any | NO | Assembled sequences | I, C | [ |
| Repseek | De novo | Any | NO | Assembled sequences | I | [ |
| REPuter | De novo | Any | NO | Assembled sequences | I | [ |
| TEClass | De novo (SVM) | Any | YES | Assembled sequences | C | [ |
| TEdna | De novo | Any | NO | Unassembled sequence reads | I | [ |
| transposome | De novo | Any | NO | Unassembled sequence reads | I | [ |
TE databases available.
| Database | Genomes | Data Composition | URL |
|---|---|---|---|
| Gypsy database | Several plant genomes | Domains from LTR Retrotransposons |
|
| MASiVEdb | Several plant genomes | Sire Retrotransposons |
|
| Repbase | Several plant genomes | All TEs |
|
| RepPop |
| All TEs |
|
| RetrOryza | Rice | LTR Retrotransposons |
|
| REXdb | Several plant genomes | Domains form LTR Retrotransposons |
|
| SINEBase | Several plant genomes | SINEs |
|
| SoyTEdb | Soybean | All TEs |
|
| TIGR Maize repeat database | Maize | All TEs |
|
| TRansposable Elements Platform (TREP) database | Several cereal genomes | All TEs |
|
| Plant Genome and System Biology (PGSB) Repeat Database | Several plant genomes | All TEs |
|
| RepetDB | Several plant genomes | All TE consensus |
|
Coding schemes for the translation of DNA characters into numerical representations. Adapted from [209].
| Coding Schemes | Codebook | Reference |
|---|---|---|
| DAX | {‘C’:0, ‘T’:1, ‘A’:2, ‘G’:3} | [ |
| EIIP | {‘C’:0.1340, ‘T’:0.1335, ‘A’:0.1260, ‘G’:0.0806} | [ |
| Complementary | {‘C’:-1, ‘T’:-2, ‘A’:2, ‘G’:1} | [ |
| Enthalpy | {‘CC’:0.11, ‘TT’:0.091, ‘AA’:0.091, ‘GG’:0.11, ‘CT’:0.078, ‘TA’:0.06, ‘AG’:0.078, ‘CA’:0.058, ‘TG’:0.058, ‘CG’: 0.119, ‘TC’:0.056, ‘AT’:0.086, ‘GA’:0.056, ‘AC’:0.065, ‘GT’:0.065, ‘GC’:0.1111} | [ |
| Galois (4) | {‘CC’:0.0, ‘CT’:1.0, ‘CA’:2.0, ‘CG’:3.0, ‘TC’:4.0, ‘TT’:5.0, ‘TA’:6.0, ‘TG’:7.0, ‘AC:8.0, ‘AT: 9.0, ‘AA’:1.0, ‘AG:11.0, ‘GC’:12.0, ‘GT’:13.0, ‘GA’:14.0, ‘GG’:15.0 } | [ |
| Orthogonal Encoding | {‘A’: [1, 0, 0, 0], ‘C’: [0, 1, 0, 0], ‘T’: [0, 0, 1, 0], ‘G’: [0, 0, 0, 1]} | [ |
Figure 7Accuracy of machine learning (ML) algorithms tested for TE identification and classification problems. A Neural Network and Ridor were used for only one problem. Adapted from Loureiro et al. [170].
ML algorithms tested by Loureiro et al. [170] in TE identification and classification problems.
| Identification | Classification | ||
|---|---|---|---|
| Algorithm | Accuracy | Algorithm | Accuracy |
| Neural Network | 67.01 | Ridor | 96.43 |
| Naïve Bayes Net | 96.30 | Naïve Bayes Net | 96.37 |
| Random Forest | 98.90 | Random Forest | 96.56 |
| Decision Trees | 98.92 | Decision Trees | 96.56 |