| Literature DB >> 35769139 |
Etienne Routhier1, Julien Mozziconacci2.
Abstract
The tremendous amount of biological sequence data available, combined with the recent methodological breakthrough in deep learning in domains such as computer vision or natural language processing, is leading today to the transformation of bioinformatics through the emergence of deep genomics, the application of deep learning to genomic sequences. We review here the new applications that the use of deep learning enables in the field, focusing on three aspects: the functional annotation of genomes, the sequence determinants of the genome functions and the possibility to write synthetic genomic sequences.Entities:
Keywords: Bioinformatics; Deep learning; Epigenomics; Genetics; Genomics; Metagenomics; Neural networks; Personalized medecine; Review; Synthetic genomes
Year: 2022 PMID: 35769139 PMCID: PMC9235815 DOI: 10.7717/peerj.13613
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 3.061
Figure 1Positioning of this review within the field of deep learning for biological/biomedical application.
(A) We adapted the segmentation of the field proposed by Zemouri, Zerhouni & Racoceanu (2019) to position this review. (B) Zoom into the genomics field. This review focus on the application of deep learning to annotate the genome directly from the DNA sequence (red dashed line boxes).
Overview of studies applying deep learning in genomics, segmented by their usage.
| Annotation | Usage | Preprocessing | Data | Species | Architecture | Reference |
|---|---|---|---|---|---|---|
| TFBS | Transfer | one-hot-encoding | DNA + gene expression + DNaseI cleavage | human | CNN + RNN |
|
| DNA sequence | human + mouse | CNN |
| |||
| Bio. mechanism | one-hot-encoding | DNA sequence | human | CNN |
| |
| human + mouse + drosophilia | CNN |
| ||||
| RNA sequence | human | CNN |
| |||
| Syn. genomics | one-hot-encoding | DNA sequence | human | RNN + Attention |
| |
| CNN |
| |||||
| TFBS + histone + chromatin accessibility | Transfer | one-hot-encoding | DNA sequence | human + mouse | CNN |
|
| Bio. mechanism | one-hot-encoding | DNA sequence | human | CNN |
| |
| CNN |
| |||||
|
| ||||||
|
| ||||||
|
| ||||||
|
| ||||||
| Syn. genomics | one-hot-encoding | DNA sequence | human | CNN |
| |
| TFBS (circRNA) | Bio. mechanism | one-hot-encoding | RNA sequence | human | CNN |
|
| chromatin | Transfer + Bio. mechanism | one-hot-encoding | DNA + gene expression | human | CNN |
|
| accessibility | Bio. mechanism | one-hot-encoding + embedding | DNA sequence | human | CNN |
|
| gene expression | Transfer + Bio. mechanism | one-hot-encoding | DNA + TF expression level | yeast | CNN |
|
| Bio. mechanism | one-hot-encoding | RNA sequence | 7 species | CNN |
| |
| Syn. genomics | yeast | CNN |
| |||
| DNA sequence | Random promoters (yeast) | CNN + Attention + RNN |
| |||
| Bio. mechanism | one-hot-encoding | DNA + mRNA half-life + CG content + ORF length | human | CNN |
| |
| DNA + promoter-enhancer interaction | human | CNN |
| |||
| DNA sequence | human | CNN |
| |||
| gene expression + RNA splicing | Syn. genomics | one-hot-encoding | DNA sequence | human | CNN |
|
Note:
CNN, convolutional neural network; RNN, recurrent neural network. After the pioneering use of CNN in genomics in 2015, the methodologies have diversified according to four different aspects: the modelinputs (that may include other annotations on top of the sole DNA sequence), the sequence encoding (mainly one-hot-encoding or k-mer embedding), theneural network architecture (CNN, RNN, Attention mechanism) and the output format, which can be either binary or continuous.
Figure 2Different possible uses of deep learning in genomics.
Deep learning models trained with genome annotations together with the underlying genomic sequence (in light blue) can be used for thre different applications (in light green) (1) to automatically annotate the genome of a given species and for a given cell type, (2) to determine the sequence determinants of the genome functions by identifying sequence motifs (such as position weight matrix, PWM) and the effect of sequence variants, or (3) to design artificial sequences.
Overview of studies developing deep learning methodologies in genomics.
| Annotation | Usage | Preprocessing | Data | Species | Architecture | Reference |
|---|---|---|---|---|---|---|
| epigenetic mark | Bio. mechanism | one-hot-encoding | DNA + chromatin accessibility | human | CNN |
|
| DNA + CpG neighborhood of cells | human | CNN + RNN |
| |||
| DNA sequence | human | CNN |
| |||
| RNA sequence | human + mouse + zebrafish | CNN + RNN |
| |||
| polyadenylation | Bio. mechanism | one-hot-encoding | DNA sequence |
| CNN |
|
| human | CNN |
| ||||
| Syn. genomics | one-hot-encoding | DNA sequence | human | CNN |
| |
| polyadenylation + translation initiation site | transfer | one-hot-encoding | DNA sequence | human + mouse + bovine + drosophilia | CNN |
|
| splicing | Bio. mechanism | one-hot-encoding | DNA sequence | human | CNN |
|
|
| ||||||
|
| ||||||
| RNA sequence | human | CNN |
| |||
| CNN |
| |||||
| D architecture | Bio. mechanism | one-hot-encoding | DNA sequence | human | CNN |
|
| CNN |
| |||||
| CNN + RNN |
| |||||
| human + mouse | CNN |
| ||||
| nucleosome | Bio. mechanism | one-hot-encoding | DNA sequence | yeast | CNN |
|
| nucleosome + TFBS | Bio. mechanism | one-hot-encoding | DNA sequence | yeast + human | CNN |
|
| enhancer | transfer | embedding | DNA sequence | 6 species | CNN |
|
| Bio. mechanism + Syn. genomics | one-hot-encoding | DNA sequence | 6 pecies | CNN + RNN |
| |
| Bio. mechanism | one-hot-encoding | DNA sequence | human | CNN | ||
| promoter | transfer | one-hot-encoding | DNA sequence | 5 species | CNN |
|
| promoter + enhancer + TFBS + chromatin accessibility | Bio. mechanism | one-hot-encoding | DNA sequence | human | CNN |
|
| translation initiation site | Bio. mechanism | one-hot-encoding | DNA sequence | human | CNN + RNN |
|
| sgRNA | Syn. genomics | one-hot-encoding | DNA + TFBS + epigenetic + accessibility | human | CNN |
|
| binding site | DNA sequence | human + mouse | CNN |
| ||
| Virus integration | Bio. mechanism | one-hot-encoding | DNA sequence | human | CNN + Attention |
|
| Overview of studies applying deep learning in genomics, segmented by their usage. | ||||||