| Literature DB >> 35052043 |
Abstract
Mendel proposed an experimentally verifiable paradigm of particle-based heredity that has been influential for over 150 years. The historical arguments have been reflected in the near past as Mendel's concept has been diversified by new types of omics data. As an effect of the accumulation of omics data, a virtual gene concept forms, giving rise to genetical data science. The concept integrates genetical, functional, and molecular features of the Mendelian paradigm. I argue that the virtual gene concept should be deployed pragmatically. Indeed, the concept has already inspired a practical research program related to systems genetics. The program includes questions about functionality of structural and categorical gene variants, about regulation of gene expression, and about roles of epigenetic modifications. The methodology of the program includes bioinformatics, machine learning, and deep learning. Education, funding, careers, standards, benchmarks, and tools to monitor research progress should be provided to support the research program.Entities:
Keywords: bioinformatics; computational biology; data science; experimentalism; gene concept; genomics; molecular biology; scientific method; technology; virtualization
Year: 2021 PMID: 35052043 PMCID: PMC8774939 DOI: 10.3390/e24010017
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Examples of deep learning applied to the problem of prediction of gene expression.
| Reference | Main Result | Biological Interpretation of the Model | Data Inputs | Model Output | Other Points |
|---|---|---|---|---|---|
| Rajpurkar et al. [ | Convolutional neural networks predict gene expression better than dense neural networks or a random forest. Blanking could reveal important motifs (for example, enhancers and silencers). | Chromatin architecture predicts gene expression. However, the effect is diffuse, extending beyond sequence-identifiable motifs such as promoters or enhancers. | Optical reconstruction of chromatin architecture (ORCA) of the Bithorax gene cluster in | ON or OFF binary prediction of expression. | This was a remarkably innovative approach building on the strength of a remarkably novel dataset. |
| Zrimec et al. [ | Up to 82% of the variation of transcript levels could be predicted from DNA sequences. | Both coding and cis-regulatory regions contribute to prediction of gene expression. | DNA sequences of proximal promoters *, plus 64 codon frequencies from coding regions, and eight mRNA stability variables. | Expression levels recoded as transcripts per million. | Motif interactions were key for the control of the dynamic range of gene expression. |
| Singh et al. [ | A model derived from histone marks predicts expression better than traditional machine learning. | Histone marks correlate with expression, although it is unclear which marks are causative. | Histone marks from 56 different cell types [ | Binary | Complex interactions of chromatin features could be detected and visualized for intuitive interpretation. |
| Cuperus et al. [ | A CNN trained on random 50 bp 5′-UTRs can predict the expression of a reporter gene from both artificial and native UTRs. | Alternative 5′-UTRs confer different mRNA stability or translational efficiency. | Nucleotide sequence of the 5′-UTR. | Scalar score for each UTR (proportional to protein expression). | Shorter UTRs did not work as well. |
* One-hot encoding is a standard approach to sequence representation in deep learning.
Examples of deep learning applied to structural annotation of virtual genes.
| Reference | Main Result | Biological Interpretation of the Model | Data Inputs | Model Output | Other Points |
|---|---|---|---|---|---|
| Oubounyt et al. [ | The prediction method improves performance over comparable approaches (fewer false positives). The improvement is attributed to a novel negative learning set. | Short eukaryote promoter sequences are sufficient to predict both TATA and non-TATA promoters in both human and mouse. | Genomic sequence from −249 to +50 bps relative to the TSS *. | A real-valued promoter score. | Impact of mutations on output scores was also studied (150 substitutions on the interval from –40 to +10 bps relative to the TSS). |
| Kelley et al. [ | The method uses chromatin accessibility to predict gene promoters with high accuracy. | Promoters and transcription factor binding motifs could be predicted, but the method was developed to annotate point mutations. | DNase-seq mapping accessible genomic sites in 164 cell types (encoded as a binary vector of length 164). Plus, a DNA sequence of 600 bps *. | Probability value for chromatin accessibility. | Every mutation in the genome could be annotated with respect to its impact on chromatin accessibility. |
| Feng et al. [ | A deep learning model can predict Pol II pausing events. (An attention layer and data integration provide good interpretability.) | The pausing events also provide insights into alternative splicing, TF binding sites, and epigenetic modifications. | 200-50-bp DNA sequence * integrated with ChIP-seq and epigenetic data via an attention layer. | Probability value for Pol II pausing. | Strongest sequence determinants were typically −14 to 12 bp around the pausing sites. The model was relatively interpretable due to an attention mechanism analogous to DeepHINT [ |
* One-hot encoding is a standard approach to sequence representation in deep learning.