| Literature DB >> 34179082 |
Jan Zrimec1, Filip Buric1, Mariia Kokina1,2, Victor Garcia3, Aleksej Zelezniak1,4.
Abstract
Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology.Entities:
Keywords: chromatin accessibility; cis-regulatory grammar; deep neural networks; gene expression prediction; gene regulatory structure; mRNA & protein abundance; machine learning; regulatory genomics
Year: 2021 PMID: 34179082 PMCID: PMC8223075 DOI: 10.3389/fmolb.2021.673363
Source DB: PubMed Journal: Front Mol Biosci ISSN: 2296-889X
FIGURE 1Principles of gene expression. (A) Protein-DNA interactions in prokaryotic nucleoid and eukaryotic chromosome structure, epigenetics and transcription initiation. The basic repeating structural unit of chromatin is the nucleosome, which contains eight histone proteins. Bacterial nucleoid-associated proteins are the main regulators of nucleoid structure, where the circular genome is supercoiled and uncoiled by these proteins. In cells, genes are switched on and off based on the need for product in response to cellular and environmental signals. This is regulated predominantly at the level of transcription initiation, where chromatin and nucleoid structure open and close, controlling the accessibility of DNA and defining areas with high amounts of transcription (factories) upon demand. (B) Depiction of eukaryotic transcription across the gene regulatory structure that includes coding and non-coding regulatory regions. The open reading frame (ORF) carries the coding sequence, constructed in the process of splicing by joining introns and removing exons. Each region carries specific regulatory signals, including transcription factor binding sites (TFBS) in enhancers, core promoter elements in promoters, Kozak sequence in 5′ untranslated regions (UTRs), codon usage bias of coding regions and multiple termination signals in 3′ UTRs and terminators, which are common predictive features in ML (highlighted bold). RNAP denotes RNA polymerase, mRNA messenger RNA. (C) Depiction of eukaryotic translation across the mRNA regulatory structure, where initiation involves the 5′ cap, Kozak sequence and secondary structures in the 5′ UTR. Codon usage bias affects elongation, whereas RNA-binding protein (RBP) sites, microRNA (miRNA) response elements and alternative polyadenylation in the 3′ UTR affect post-translational processing and final expression levels. These regulatory elements are common predictive features in ML (highlighted bold).
FIGURE 2Principles of machine learning from nucleotide sequence. (A) Flowcharts of a typical supervised shallow modeling approach (top) and a typical supervised deep modeling approach (bottom), depicting a one-hot encoding that equals k-mer embedding with k = 1. (B) Overview of convolutional (CNN) and recurrent neural networks (RNN) in interpreting DNA regulatory grammar. A CNN scans a set of motif detectors (kernels) of a specified size across an encoded input sequence, learning motif properties such as specificity, orientation and co-association. An RNN scans the encoded sequence one nucleotide at a time, learning sequential motif properties such as multiplicity, distance from e.g. transcription start site and the relative order of motifs. (C) Interpreting shallow models (top) by evaluating their performance when trained on different feature sets can yield feature importance scores, motifs and motif interactions, as well as compositional and structural properties. Similarly, interpreting the regulatory grammar learned by deep models (bottom), by e.g. perturbing the input, visualizing kernels or using gradient-based methods, can yield feature importance scores spanning nucleotides up to whole regions, as well as motifs and motif interactions. (D) Example of a typical deep neural network (DNN) comprising three separate convolutional layers (Conv) connected via pooling layers (Pool) and a final fully connected network (FC) producing the output gene expression levels. Pool stages compute the maximum or average of each motif detector’s rectified response across the sequence, where maximizing helps to identify the presence of longer motifs and averaging helps to identify cumulative effects of short motifs. The DNN learns distributed motif representations in the initial Conv layers and motif associations that have a joint effect on predicting the target in the final Conv layer, representing DNA regulatory grammar that is mapped to gene expression levels.
Overview of studies modeling protein-DNA interactions that govern the initiation of gene expression from nucleotide sequence properties. Highest achieved or average scores are reported, on test sets where applicable, and include precision (prec) and recall (rec), area under the receiver operating characteristic curve (AUC), area under the precision recall curve (AUPRC), the coefficient of variation (R ), Pearson’s correlation coefficient (r), Spearman’s correlation coefficient (ρ) and Matthews correlation coefficient (MCC).
| Ref. | Strategy | Target var. | Explan. vars. | Method | Score | Organism |
|---|---|---|---|---|---|---|
| ( | Shallow | TFBS prediction | PWMs | PWM alignment algorithms |
| Human |
| ( | Shallow | TFBS prediction | DNA motif and chromatin-based features | Classifier ensembles | AUPRC = 0.81 | Human |
| ( | Shallow | TFBS prediction | PWMs, gapped k-mers | SVM classification | AUC = 0.98 | Human |
| ( | Shallow | TF binding specificity | k-mers, DNA structural variables | L1-regularized LR |
| Yeast |
| ( | Shallow | TF binding specificity | k-mers, DNA structural variables | L2‐regularized multiple LR |
| Human |
| ( | Shallow | TF binding specificity | k-mers | OLS regression |
| Human |
| ( | Shallow | σ54 promoter prediction | Pseudo k-tuple nucleotide composition | SVM classification | MCC = 0.88 |
|
| ( | Shallow | σ70 promoter prediction | Trinucleotide-based features | SVM classification | MCC = 0.92 |
|
| ( | Shallow | Histone modifications | k-mers, TF CHIP-seq data | LR classification | AUC = 0.95 | Human |
| ( | Shallow | Histone modifications, DNA methylation | DNA motifs | RF classification | AUC = 0.96 | Human |
| ( | Shallow | DNA chromatin accessibility | PWMs, gapped k-mers | SVM classification | AUC = 0.75 | Human |
| ( | Deep | TFBS prediction | k-mers | CNN + biLSTM classification | AUC = 0.93 | Human |
| ( | Deep | TFBS prediction | DNA sequence | CNN classification | AUC = 0.88 | Human |
| ( | Deep | TFBS prediction | DNA sequence | CNN classification | AUC = 0.82 | Human, mouse |
| ( | Deep | TFBS prediction | DNA sequence | CNN + biLSTM + attention classification | AUC = 0.99 | Human |
| ( | Deep | TF binding specificity | DNA sequence | CNN classification | AUC = 0.90 | Human |
| ( | Deep | TF binding specificity | DNA sequence | CNN regression |
| Human |
| ( | Deep | TF binding specificity | DNA sequence | CNN regression |
| Human |
| ( | Deep | Transcription initiation frequency | DNA sequence | CNN ordinal regression |
|
|
| ( | Deep | Multitask chromatin profiling data | DNA sequence | CNN classification | AUC = 0.96 | Human |
| ( | Deep | Multitask chromatin profiling data | DNA sequence | CNN + biLSTM classification | AUC = 0.97 | Human |
| ( | Deep | Multitask chromatin profiling data | DNA sequence | CNN + biLSTM + attention classification | AUC = 0.95 | Human |
| ( | Deep | Histone modifications | DNA sequence | CNN classification | AUC = 0.80 | Human |
| ( | Deep | Histone modifications | DNA sequence | LSTM + attention classification | AUC = 0.81 | Human |
| ( | Deep | DNA chromatin accessibility | DNA sequence | CNN classification | AUC = 0.90 | Human |
| ( | Deep | DNA chromatin accessibility | DNA sequence | CNN regression |
| Human |
| ( | Deep | DNA chromatin accessibility | DNA sequence | CNN + attention regression |
| Human |
| ( | Deep | DNA methylation | DNA sequence and features | CNN classification | AUC = 0.83 | Human |
| ( | Deep | DNA methylation | DNA sequence | CNN regression | AUC = 0.97 | Human |
Overview of studies modeling gene expression-related properties from separate regulatory or coding regions. Highest achieved or average scores are reported, on test sets where applicable, and include accuracy (acc), area under the receiver operating characteristic curve (AUC), area under the precision recall curve (AUPRC), the coefficient of variation (R ) and Pearson's correlation coefficient (r).
| Ref. | Strategy | Region | Target var. | Explan. vars. | Method | Score | Organism |
|---|---|---|---|---|---|---|---|
| ( | Shallow | Coding | Splice site prediction | Sequence and PWM features | Logistic regression |
| Human |
| ( | Shallow | Coding | Branch point prediction | Sequence features | Gradient boosting classification | AUC = 0.94 | Human |
| ( | Shallow | Coding | Branch point prediction | Sequence features | Mixture models classification | AUC = 0.82 | Human |
| ( | Shallow | Coding | Protein abundance | Codon usage features | COSEM mathematical model |
|
|
| ( | Shallow | Coding | Protein abundance | Codon usage | AdaBoost regression |
| Yeast |
| ( | Shallow | Coding | Ribosome density at each codon | Codon usage | NN regression |
| Yeast |
| ( | Deep | Coding | Splice site prediction | DNA sequence | CNN classification | AUPRC = 0.61 | Human, |
| ( | Deep | Coding | Splice site prediction | DNA sequence | CNN classification | AUC = 0.98 | Human |
| ( | Deep | Coding | Splice site prediction | DNA sequence | CNN classification | AUPRC = 0.98 | Human |
| ( | Deep | Coding | Branch point prediction | DNA sequence | biLSTM classification | AUC = 0.71 | Human |
| ( | Deep | Coding | Branch point prediction | DNA sequence | biLSTM + CNN classification | AUC = 0.81 | Human |
| ( | Deep | Coding | Alternative splicing prediction | Sequence and epigenetic features | Dense DNN classification | AUPRC = 0.89 | Human |
| ( | Deep | Coding | Alternative splicing prediction | Sequence and epigenetic features | RNN classification | AUPRC = 0.8 | Human |
| ( | Deep | Coding | Alternative splicing prediction | RNA-seq data | Dense DNN + bayesian hypothesis testing | AUC = 0.87 | Human |
| ( | Deep | Coding | Protein abundance | DNA sequence | Multilayer biLSTM regression |
|
|
| ( | Deep | Coding | Optimal codon usage | DNA sequence | biLSTM encoder-decoder |
|
|
| ( | Deep | Coding | Transcript abundance | DNA sequence | biLSTM transducer |
|
|
| ( | Shallow | Enhancer | Transcript abundance | Motifs and pairwise motif interactions | L1-regularized LR |
| Human |
| ( | Shallow | Enhancer | Enhancer prediction | k-mers | SVM classification | AUC = 0.93 | Human |
| ( | Deep | Enhancer | Enhancer prediction | DNA sequence | CNN classification | AUPRC = 0.92 | Human |
| ( | Deep | Enhancer | Enhancer prediction | DNA sequence | CNN classification | AUC = 0.92 | 17 mammalian species including human |
| ( | Deep | Enhancer | Transcript abundance | DNA sequence | CNN regression | AUC = 0.92 | Human |
| ( | Deep | Enhancer | Multitask regulatory properties | DNA sequence | Deep residual NN classification | AUPRC = 0.98 | Human |
| ( | Shallow | Promoter | Core promoter activity via reporter fluorescence | k-mers | LR |
| Yeast |
| ( | Shallow | Promoter | mRNA abundance | σ factor binding sites | NN regression |
|
|
| ( | Shallow | Promoter | mRNA abundance | σ factor binding sites | Biophysical model |
|
|
| ( | Shallow | Promoter | Protein abundance | TF binding and sequence features | L2-regularized multiple LR |
| Yeast |
| ( | Shallow | Promoter | mRNA abundance | TF binding and sequence features | L1-regularized multiple LR |
|
|
| ( | Deep | Promoter | Transcription initiation rate | DNA sequence | CNN regression |
|
|
| ( | Deep | Promoter | Protein abundance | DNA sequence | CNN regression |
| Yeast |
| ( | Shallow | 5′ UTR | Protein levels | DNA sequence features + k-mers | LR |
| Yeast |
| ( | Shallow | 5′ UTR | Protein abundance | RBS features | RF regression |
|
|
| ( | Shallow | 5′ UTR | Protein abundance | RBS features | Thermodynamic model, LR |
|
|
| ( | Shallow | 5′ UTR | Translation initiation rate | N-terminal mRNA structures | Biophysical model, LR |
|
|
| ( | Shallow | 5′ UTR | Protein abundance | DNA sequence activity relationships | Partial least-squares (PLS) regression |
| Yeast |
| ( | Shallow | 5′ UTR | Translation initiation rate | DNA sequence features | PLS regression |
| Yeast |
| ( | Deep | 5′ UTR | Protein abundance | DNA sequence | CNN regression |
| Yeast |
| ( | Deep | 5′ UTR | Mean ribosome load | DNA sequence | CNN regression |
| Human |
| ( | Shallow | 3′ UTR, terminator | Protein abundance | Nucleosome occupancy score | Weighted LR |
| Yeast |
| ( | Shallow | Terminator | Termination efficiency | DNA sequence features (12) | Multiple LR |
|
|
| ( | Shallow | 3′ UTR, terminator | mRNA abundance | k-mers | L1-regularized logistic regression |
| Yeast |
| ( | Deep | 3′ UTR | Alternative polyadenylation signals | DNA sequence | CNN regression |
| Human |
FIGURE 3Quantifying gene expression and interpreting its regulatory grammar with machine learning. (A) Recently identified DNA regulatory elements predictive of mRNA abundance that expand the base knowledge depicted in Figure 1B. These include motif associations (Zrimec et al., 2020) (red), structural motifs (e.g. DNA shape, blue) (Zhou et al., 2015; Yang et al., 2017), weak interactions (de Boer et al., 2020) (green), nucleotides upstream of the Kozak sequence (Li et al., 2017a) (yellow), CpG dinucleotides (Agarwal and Shendure, 2020) (gray) and mRNA stability features (Neymotin et al., 2016; Cheng et al., 2017; Agarwal and Shendure, 2020; Zrimec et al., 2020) (dashed line, see text for details) identified in specific regions or across the whole gene regulatory structure. The table specifies the variation of mRNA abundance explained by DNA sequence and features using deep learning (Zrimec et al., 2020). Note that with alternative approaches, higher predictive values were obtained for certain regions in Table 2. (B) mRNA regulatory elements recently found to be predictive of protein abundance apart from features depicted in Figure 1C. These include specific motifs found across all regions (Li et al., 2019a; Eraslan et al., 2019b) (red), upstream ORFs (Vogel et al., 2010; Li et al., 2019a) and AUGs (Neymotin et al., 2016; Li et al., 2019a) (blue), AA composition (Vogel et al., 2010; Guimaraes et al., 2014) and post-translational modifications (PTMs) (Eraslan et al., 2019b) (gray) as well as lengths and GC content of all regions (Neymotin et al., 2016; Cheng et al., 2017; Li et al., 2019a) (dashed line). The table specifies the variation of protein abundance explained by mRNA levels and translational elements, using comparable shallow approaches in E. coli (Guimaraes et al., 2014), S. cerevisiae (Lahtvee et al., 2017) and H. sapiens (Vogel et al., 2010). Note that with alternative approaches, higher values were obtained for certain regions in Table 2. (C) Quantifying the central dogma of molecular biology with variance explained by mapping DNA to mRNA levels (Agarwal and Shendure, 2020; Zrimec et al., 2020) and mRNA levels to protein abundance (Vogel et al., 2010; Guimaraes et al., 2014; Lahtvee et al., 2017), using deep and shallow learning, respectively. Note that highly different modeling approaches were used.
Overview of studies modeling transcript and protein-abundance related properties from combined regulatory and coding regions. Highest achieved or average scores are reported, on test sets where applicable, and include area under the receiver operating characteristic curve (AUC), area under the precision recall curve (AUPRC), the coefficient of variation (R ) and Spearman's correlation coefficient (ρ).
| Ref. | Strategy | Region | Target var. | Explan. vars. | Method | Score | Organism |
|---|---|---|---|---|---|---|---|
| ( | Shallow | Promoter, coding | mRNA abundance | DNA sequence features | LR |
| Yeast |
| ( | Shallow | mRNA transcript | mRNA stability (degradation rates) | mRNA features | Multiple LR |
| Yeast |
| ( | Shallow | mRNA transcript | mRNA stability (half-life) | mRNA features | Multivariate LR |
| Yeast |
| ( | Deep | Whole gene regulatory structure | mRNA abundance | DNA sequence | CNN + L2-regularized LR | AUC = 0.82 | Human |
| ( | Deep | Whole gene regulatory structure | mRNA abundance | DNA sequence and features | CNN regression |
| Yeast, |
| ( | Deep | Promoter, coding | mRNA abundance | DNA sequence and features | CNN regression |
| Human |
| ( | Deep | Whole gene regulatory structure | mRNA abundance | DNA sequence | ResNet regression |
| Human |
| ( | Shallow | mRNA transcript | Protein abundance | mRNA features | PLS regression |
|
|
| ( | Shallow | mRNA transcript | Protein abundance | mRNA features | MARS nonlinear regression |
| Yeast |
| ( | Shallow | mRNA transcript | Protein abundance | mRNA features | MARS nonlinear regression |
| Human |
| ( | Shallow | mRNA transcript | Translation rates | mRNA features | Multivariate LR |
| Yeast, human |
| ( | Shallow | mRNA transcript | Protein abundance | mRNA features of translation initiation | RF regression |
|
|
| ( | Shallow | mRNA transcript | Translation rates | mRNA features | Bayesian model |
| Yeast |
| ( | Shallow | mRNA transcript | Protein-to-RNA ratio | mRNA sequence and features | Multivariate LR |
| Human |
| ( | Deep | mRNA transcript | Translation initiation sites | mRNA sequence | CNN + RNN classification | AUPRC = 0.62 | Human |
| ( | Deep | mRNA transcript | Translation elongation dynamics | mRNA sequence | CNN classification | AUC = 0.88, 0.83, respectively | Yeast, human |
The current advantages, disadvantages and further challenges of machine learning methods in genetics and genomics.
| Deep methods | Shallow methods | |
|---|---|---|
| Advantages | Lower entry barrier to develop new models and save research time by abstracting mathematical details ( | |
| Scale effectively with data and support use of latest computational and technological advances, including large genomic datasets and results of HTS technologies ( | Classic statistical models are better characterized mathematically and some ML algorithms are easier to understand and explain ( | |
| Ability to automatically learn features from raw input data and unlock an additional level of information from it ( | Less computationally expensive and faster to train leading to more iterations and testing of different techniques in a shorter period of time | |
| Ability to learn and approximate complex functions without prior assumptions, frequently achieving improved predictive power ( | Possibility to train on much smaller datasets (e.g. hundreds of examples vs. thousands or more with deep learning) ( | |
| Capability to integrate multiple pre-processing steps into a single end-to-end model ( | Can be easier to interpret due to inherently interpretable structure and direct feature engineering/selection ( | |
| Ability to effectively model multimodal data ( | Usually a small number of hyperparameters ( | |
| Highly useful as experiment simulators due to the ability to generalize over an experimental dataset ( | Useful for proof-of-principle and initial model or parameter testing using only numerical variables | |
| Easily adaptable to different domains and applications, with transfer learning on pre-trained deep networks accelerating training and improving performance | — | |
| Disadvantages | Dependence on accurately labeled data: cannot achieve higher accuracy than that allowed by the noise inherent to the given experimental target labels ( | |
| Data driven instead of hypothesis driven modeling ( | ||
| Dependence on large amounts of data (at least thousands of training examples) and specialized computational resources (e.g. GPUs) | Dependence on feature engineering | |
| Potential problems with generalizability, as can be overfit to the experiment rather than biological function ( | Many different algorithms each with its own advantages and disadvantages can be daunting and require extensive specialized study ( | |
| Potential lack of model interpretability ( | Cannot unlock information directly from nucleotide sequence ( | |
| Challenges | Methods to interpret heterogeneous multi-omic and highly dimensional data ( | |
| Methods and high quality datasets to benchmark existing and new interpretation strategies ( | ||
| Methods to join findings from multiple interpretation strategies into more complete and coherent interpretations of both models ( | ||
| Making interpretable ML more accessible to biologists by further lowering the entry barriers and requirements of computational knowledge ( | ||