| Literature DB >> 35879805 |
Wardah S Alharbi1, Mamoon Rashid2.
Abstract
Genomics is advancing towards data-driven science. Through the advent of high-throughput data generating technologies in human genomics, we are overwhelmed with the heap of genomic data. To extract knowledge and pattern out of this genomic data, artificial intelligence especially deep learning methods has been instrumental. In the current review, we address development and application of deep learning methods/models in different subarea of human genomics. We assessed over- and under-charted area of genomics by deep learning techniques. Deep learning algorithms underlying the genomic tools have been discussed briefly in later part of this review. Finally, we discussed briefly about the late application of deep learning tools in genomic. Conclusively, this review is timely for biotechnology or genomic scientists in order to guide them why, when and how to use deep learning methods to analyse human genomic data.Entities:
Keywords: Deep learning applications; Disease variants; Epigenomics; Gene expression; Human genomics; NGS; Pharmacogenomics; Variant calling
Mesh:
Year: 2022 PMID: 35879805 PMCID: PMC9317091 DOI: 10.1186/s40246-022-00396-x
Source DB: PubMed Journal: Hum Genomics ISSN: 1473-9542 Impact factor: 6.481
Fig. 1Timeline of implementing deep learning algorithms in genomics. This timeline plot demonstrated the delay of implementing DL tools in genomics; for example, both (LSTM) and (BLSTM) algorithms have been invented in 1997 and the first genomic application was implemented in 2015. Similar observations are for the rest of the deep learning algorithms (Table 6)
Deep learning algorithms in genomics and their original development and applications
| ANN Algorithms | Natural Language Processing (NLP) | Feedforward neural network | Convolutional neural network (CNN) | Recurrent neural networks (RNNs) | Bidirectional long short-term memory networks (BLSTMs) | Long short-term memory networks (LSTMs) | Gated recurrent unit (GRU) |
|---|---|---|---|---|---|---|---|
| Algorithm Inventor | Applied in dictionary look-up system developed at Birkbeck College, London | Frank Rosenblatt | It was named as “neocognitron “ by Fukushima | Rumelhart, Hinton and Williams | Schuster and Paliwal | Hochreiter and Schmidhuber | Cho et al |
| Year of Development | 1948 | 1958 | 1980 | 1986 | 1997 | 1997 | 2014 |
| Year of Initial Genomics’ Function | 1996 | 1993 | 2015 | 2005 | 2015 | 2015 | 2017 |
| First User in Genomics | Schuler et al | S Eskiizmililer | Alipanahi et al | Maraziotis, Dragomir and Bezerianos | Quang and Xie | Quang and Xie | Angermueller et al |
| First Genomic Application | Entrez databases | Karyotyping architecture based on Artificial Neural Networks | DeepBind | Predicting the complicated causative associations between genes from microarray datasets based on recurrent neuro-fuzzy technique | DanQ model | DanQ model | DeepCpG |
| Genomic Function Exemplar(s) | Genetic counsellors AI-based chatbots and EPIs prediction | Karyotyping, Prenatal diagnostic for early detection of aneuploidy syndrome | Prediction of variant impacts on expression and disease risk, predicting drug response of tumours from genomic profiles, and pharmacogenomics | Predicting transcription factor binding sites, for Alignment and SNV identification | DNA function predictions and prediction of protein localisation, predict miRNA precursor | Enhancer–promoter interaction (EPI) prediction | Enhancers and methylation states predictions |
| Landmark References | [ | [ | [ | [ | [ | [ | [ |
Fig. 2Deep learning applications in genomics. This figure represents the application of deep learning tools in five major subareas of genomics. One example deep learning tool and underlying network architecture has been shown for each of the genomic subareas, and its input data type and the predictive output were mentioned briefly. Each bar plot depicts the frequency of most used deep learning algorithms underlying deep learning tools in that subarea of genomics (Tables 1, 2, 3, 4, 5)
Genomic tools/algorithm based on deep learning architecture for variant calling and annotations
| Tools | DL model | Application | Input/Output | Website Code Source | References |
|---|---|---|---|---|---|
| Clairvoyante | CNN | To predict variant type, zygosity, alternative allele and Indel length | BAM/VCF | [ | |
| DeepVariant | CNN | To call genetic variants from next-generation DNA sequencing data | BAM,CRAM/VCF | [ | |
| GARFIELD-NGS | DNN + MLP | To classify true and false variants from WES data | VCF/VCF | [ | |
| Intelli-NGS | ANN | To define good and bad variant calls from Ion Torrent sequencer data | VCF/xlsx | [ | |
| DAVI (Deep Alignment and Variant Identification) | CNN + RNN | To identify variants in NGS reads | FASTQ/VCF | [ | |
| DeepSV | CNN | To call genomic deletions by visualising sequence reads | BAM/VCF | [ |
Genomic tools/algorithm based on deep learning architecture for disease variants
| Tools | DL model | Application | Input/Output | Website Code Source | References |
|---|---|---|---|---|---|
| DeepPVP (PhenomeNet Variant Predictor) | ANN | to identify the variants in both whole exome or whole genome sequence data | VCF / VCF | [ | |
| ExPecto | CNN | Accurately predict tissue-specific transcriptional effects of mutations/functional SNPs | VCF/ CSV | [ | |
| PEDIA (Prioritisation of exome data by image analysis) | CNN | To prioritise variants and genes for diagnosis of patients with rare genetic disorders | VCF / CSV | [ | |
| DeepMILO (Deep learning for Modeling Insulator Loops) | CNN + RNN | to predict the impact of non-coding sequence variants on 3D chromatin structure | FASTA / TSV | [ | |
| DeepWAS | CNN | To identify disease or trait-associated SNPs | TSV / TSV | [ | |
| PrimateAI | CNN | To classify the pathogenicity of missense mutations | CSV / CSV + txt | [ | |
| DeepGestalt | CNN | To Identifying facial phenotypes of genetic disorders | Image / txt | Is available through the Face2Gene application, | [ |
| DeepMiRGene | RNN, LSTM | To predict miRNA precursor | FASTA / Cross-Validation (CV)-Splits file | [ | |
| Basset | CNN | To predict the causative SNP with sets of related variants | BED, FASTA/ VCF | [ |
Genomic tools/algorithm based on deep learning architecture for gene expression regulation
| Tools | DL model | Application | Input/Output | Website Code Source | References |
|---|---|---|---|---|---|
| DanQ | CNN + BLSTM | To predict DNA function directly from sequence data | .mat /.mat | [ | |
| SPEID | CNN + LSTM | For enhancer–promoter interaction (EPI) prediction | .mat /.mat | [ | |
| EP2vec | NLP + GBRT | To predict enhancer–promoter interactions (EPIs) | CSV / CSV | [ | |
| D-GEX (deep learning for gene expression) | FNN | To understand the expression of target genes from the expression of landmark genes | .cel, txt, BAM / txt | [ | |
| DeepExpression | CNN | To predict gene expression using promoter sequences and enhancer–promoter interactions | .txt /.txt | [ | |
| DeepGSR | CNN + ANN | To recognise various types of genomic signals and regions (GSRs) in genomic DNA (e.g. splice sites and stop codon) | FASTA /.txt | [ | |
| SpliceAI | CNN | To identify splice function from pre-mRNA sequencing | VCF / VCF | [ | |
| SpliceRover | CNN | For splice site prediction | FASTA /.txt | N/A | [ |
| Splice2Deep | CNN | For splice site prediction in Genomic DNA | FASTA /.txt | [ | |
| DeepBind | CNN | To characterise DNA- and RNA-binding protein specificity | FASTA /.txt | [ | |
| Gene2vec | NLP | To produce a representation of genes distribution and predict gene–gene interaction | .txt /.txt | [ | |
| MPRA-DragoNN | CNN | To predict and analyse the regulatory DNA sequences and non-coding genetic variants | N/A | [ | |
| BiRen | CNN + GRU + RNN | For enhancers predictions | BED, BigWig /CSV | [ | |
| APARENT (APA REgression NeT) | CNN | To predict and engineer the human 3' UTR Alternative Polyadenylation (APA) and annotate pathogenetic variants | FASTA / CSV | [ | |
| LaBranchoR (LSTM Branchpoint Retriever) | BLSTM | To predict the location of RNA splicing branchpoint | FASTA / FASTA | [ | |
| COSSMO | CNN, BLSTM + ResNet | To predict the splice site sequencing and splice factors | TSV, CSV /CSV | [ | |
| Xpresso | CNN | To predict gene expression levels from genomic sequence | FASTA /.txt | [ | |
| DeepLoc | CNN + BLSTM | To predict subcellular localisation of protein from sequencing data | FASTA/ prediction score | [ | |
| SPOT-RNA | CNN | To predict RNA Secondary Structure | FASTA /.bpseq,.ct, and.prob | [ | |
| DeepCLIP | CNN + BLSTM | For predicting the effect of mutations on protein–RNA binding | FASTA /.txt | [ | |
| DECRES (DEep learning for identifying Cis-Regulatory ElementS) | MLP + CNN | To predict active enhancers and promoters across the human genome | FASTA /.txt | [ | |
| DeepChrome | CNN | For prediction of gene expression levels from histone modification data | Bam / TSV | [ | |
| DARTS | DNN + BHT | Deep learning augmented RNA-seq analysis of transcript splicing | .txt |
Genomic tools/algorithm based on deep learning architecture for epigenomics
| Tools | DL model | Application | Input/Output | Website Code Source | References |
|---|---|---|---|---|---|
| DeepSEA | CNN | To predict multiple chromatin effects of DNA sequence alterations | N/A | [ | |
| FactorNet | CNN + RNN | For predict the cell-type specific transcriptional binding factors (TF) | BED / BED, gzipped bedgraph file | [ | |
| DeMo (Deep Motif Dashboard) | CNN + RNN | For transcription factor binding site perdition (TFBS) by classification task | FASTA / txt | [ | |
| DeepCpG | CNN + GRU | To predict the methylation states from single-cell data | TSV / TSV | [ | |
| DeepHistone | CNN | To accurately predict histone modification sites based on sequences and DNase-Seq (experimental) data | txt, CSV / CSV | [ | |
| DeepTACT | CNN | To predict 3D chromatin interactions | CSV / CSV | [ | |
| Basenji | CNN | To predict cell-type-specific epigenetic and transcriptional profiles in large mammalian genomes | FASTA / VCF | [ | |
| Deopen | CNN | To predict the chromatin accessibility from DNA sequence/ Downstream analysis also included QTL analysis | BED, hkl /hkl | [ | |
| DeepFIGV (Deep Functional Interpretation of Genetic Variants) | CNN | To predicts impact on chromatin accessibility and histone modification | FASTA / TSV | [ |
Genomic tools/algorithm based on deep learning architecture for pharmacogenomics
| Tools | Function | DL model | Application | Input/Output | Website Code Source | References |
|---|---|---|---|---|---|---|
| DeepDR | Drug Repositioning | DNN | To translate pharmacogenomics features identified from in vitro drug screening to predict the response of tumours | txt / txt | [ | |
| DNN-DTI (Drug–target interaction prediction) | Database | DNN | To predict drug-target interaction | txt / txt | [ | |
| DeepBL | Antibiotic Resistance | CNN | To predict the beta-lactamase (BLs) using protein or genome sequence datasets | FASTA / CSV | [ | |
| DeepDrug3D | Binding Site for drugs | CNN | To characterise and classify the protein 3D binding pockets | pdb / txt | [ | |
| DrugCell | Drug response and synergy for cancer cells | CNN | To predict drug response and synergy | txt / txt | [ | |
| DeepSynergy | Anticancer drug synergy | FNN | To predict anticancer drug synergy | CSV / CSV | [ |
Deep learning packages and resources
| Resource Name | Category | Application | Date created | Link | Free/paid |
|---|---|---|---|---|---|
| Janggua | Python package | facilitates deep learning in the context of genomics | 2020 | Free | |
| ExPectoa | Python-based repository | Contains code for predicting expression effects of human genome variants ab initio from sequence | 2018 | Free | |
| Selenea | PyTorch-based Library | A library for biological sequence data training and model architecture development | 2019 | Free | |
| Pysstera | TensorFlow-based Library | Used for learning sequence and structure motifs In biological sequences using convolutional neural networks | 2018 | Free | |
| Kipoia | Python package | Kipoi is an API and a repository of ready-to-use trained models for genomics | 2019 | Free | |
| Google Colaboratory (Colab) | PnP GPUs | Colab allows anybody to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education | 2017 | Free | |
| IBM Cloud | Cloud service | Cloud computing platform; Design complex neural networks, then experiment at scale to deploy optimised learning models within IBM Watson Studio | 2011 | Free tier Cost tier | |
| Google CloudML | PnP GPUs | For extreme scalability in the long run | 2008 | Paid | |
| Vertex AI | AI platform | Google Cloud’s new unified ML platform | 2021 | ||
| Amazon EC2 | Cloud service | A website facility which delivers secure, scalable compute power in the cloud | 2006 | Free Paid | |
aThese deep learning libraries/packages are specific to Genomic application