| Literature DB >> 35365602 |
Nicolae Sapoval1, Amirali Aghazadeh2, Michael G Nute1, Dinler A Antunes3, Advait Balaji1, Richard Baraniuk4, C J Barberan4, Ruth Dannenfelser1, Chen Dun1, Mohammadamin Edrisi1, R A Leo Elworth1, Bryce Kille1, Anastasios Kyrillidis1, Luay Nakhleh1, Cameron R Wolfe1, Zhi Yan1, Vicky Yao1, Todd J Treangen5,6.
Abstract
Deep Learning (DL) has recently enabled unprecedented advances in one of the grand challenges in computational biology: the half-century-old problem of protein structure prediction. In this paper we discuss recent advances, limitations, and future perspectives of DL on five broad areas: protein structure prediction, protein function prediction, genome engineering, systems biology and data integration, and phylogenetic inference. We discuss each application area and cover the main bottlenecks of DL approaches, such as training data, problem scope, and the ability to leverage existing DL architectures in new contexts. To conclude, we provide a summary of the subject-specific and general challenges for DL across the biosciences.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35365602 PMCID: PMC8976012 DOI: 10.1038/s41467-022-29268-7
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1An overview of machine learning scenarios and commonly used DL architectures.
Top panel encapsulates the three most common paradigms of machine learning: supervised learning in which dataset contains ground truth labels, unsupervised learning in which dataset does not contain ground truth labels, and reinforcement learning in which an algorithmic agent interacts with a real or simulated environment. The bottom panels provide an overview of the most prevalent DL architecture ideas each designed to achieve specific highlighted goals. An additional set of short descriptions is provided for other common components of DL architectures mentioned in the manuscript.
Fig. 2A summary view of the major labeled and unlabeled datasets, and the architectures being used in deep-learning methods in computational biology.
For each of the areas considered in this manuscript, it summarizes estimated sizes of key datasets and databases, as well as the projected growth rate of these. Additionally the rightmost column summarizes the most popular DL architectures applied to the corresponding areas in biosciences.
Impact of Deep Learning on Computational Biology.
Each of the subareas in biosciences considered in this manuscript is assigned a level of success of the DL applications based on the relative performance of DL as compared to other ML and classical methods.
Commonly faced challenges in computational biology and potential solution avenues when using DL.
| Challenge | Experimental/non-DL solution | DL solution |
|---|---|---|
| Biased results | Improve study design | Identify forms and sources of technical bias |
| Fair AI approaches | ||
| High infrastructure costs | Optimize code performance | Optimize DL architecture |
| Parallelize code | Parallelize to low-cost devices | |
| Sub-sample analyzed data | Condense training data (e.g. coresets) | |
| Lack of explainability | Statistical analyses | Explainable post-hoc methods |
| Limited training data | Generate and label more data | Data augmentation (e.g. GANs) |
| Overfitting | Regularization | Dropout |
| Early stopping | ||
| Smaller models | ||
| Additional training data | ||
| Poor performance on novel data | Expand databases | Use larger models |
| Analyze generalization potential |
Fig. 3Standard and DL approaches to phylogenetic inference.
The input consists of sequences (DNA sequences in this illustration) obtained from the taxa of interest. Here, the taxa are A, B, C, and D. In standard approaches, such as maximum likelihood and maximum parsimony, a generative model in the form of a tree whose leaves are labeled by the four taxa is inferred. In the recently introduced DL approach to phylogenetic inference, the problem is viewed as a classification task where the network outputs correspond to the three possible tree topologies whose leaves are labeled by the taxa A, B, C, and D.