| Literature DB >> 33809353 |
Noam Auslander1, Ayal B Gussow1, Eugene V Koonin1.
Abstract
The exponential growth of biomedical data in recent years has urged the application of numerous machine learning techniques to address emerging problems in biology and clinical research. By enabling the automatic feature extraction, selection, and generation of predictive models, these methods can be used to efficiently study complex biological systems. Machine learning techniques are frequently integrated with bioinformatic methods, as well as curated databases and biological networks, to enhance training and validation, identify the best interpretable features, and enable feature and model investigation. Here, we review recently developed methods that incorporate machine learning within the same framework with techniques from molecular evolution, protein structure analysis, systems biology, and disease genomics. We outline the challenges posed for machine learning, and, in particular, deep learning in biomedicine, and suggest unique opportunities for machine learning techniques integrated with established bioinformatics approaches to overcome some of these challenges.Entities:
Keywords: bioinformatics methods; deep learning; machine learning; phylogenetics
Mesh:
Year: 2021 PMID: 33809353 PMCID: PMC8000113 DOI: 10.3390/ijms22062903
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Machine learning algorithms frequently used in bioinformatics research. An example of the usage of each algorithm and the respective input data are indicated on the right. Abbreviations: SVM, support vector machines; KNN, K-nearest neighbors; CNN, convolutional neural networks; RNN, recurrent neural networks; PCA, principal component analysis; t-SNE, t-distributed stochastic neighbor embedding, NMF, non-negative matrix factorization.
Figure 2Applications of integrated machine learning techniques with bioinformatics in molecular evolution, protein structure analysis, systems biology, and disease genomics.
Representative problems and methods addressing them by incorporating machine learning (ML) with bioinformatics tools in four areas.
| Bioinformatics Area | Problem Category | Goal | ML Method | Bioinformatic Tools |
|---|---|---|---|---|
| Molecular evolution | Biological sequence clustering | Protein family prediction | CNN | Clusters of Orthologous Groups (COGs) and G protein-coupled receptor (GPCR) dataset [ |
| Protein function prediction | deep RNN | BLAST and HMMER search [ | ||
| Anti-CRISPR proteins identification | Random forest | MSA and PSI-BLAST [ | ||
| EXtreme Gradient Boosting | K-mer based clustering (CD-HIT), BLAST [ | |||
| Viral pathogenicity feature identification | SVM | MSA, phylogenetic tree construction [ | ||
| Alignment free biological sequence analysis | Identification of viral genomes | RNN | BLAST, Sequence clustering, HHPRED [ | |
| CNN | BLAST [ | |||
| protein structure analysis | Post translational modifications | Phosphorylation sites prediction | KNN | Local sequence similarity [ |
| CNN | K-mer based clustering (CD-HIT), BLAST [ | |||
| Glycosylation sites prediction | ensemble SVM | curated glycosylated protein database (O-GLYCBASE) [ | ||
| Protein structure prediction | Protein contact prediction | CNN | MSA [ | |
| Prediction of distances between pairs of residues | CNN | MSA, HHPRED, PSI-BLAST [ | ||
| systems biology | inference of biological networks | Gene regulatory network prediction | SVM | GeneNetWeaver, RegulonDB [ |
| Protein-protein interaction network prediction | SVM | Domain affinity and frequency tables [ | ||
| Elastic-net regression | Protein descriptors [ | |||
| Analysis of biological networks | Drug target prediction | K-means | Network analysis tools [ | |
| Drug side effect prediction | SVM | Genome scale metabolic modeling [ | ||
| Drug Synergism prediction | Random Forest Ensemble | A chemical-genetic interaction matrix [ | ||
| Multi-omics integration | Cancer subtype prediction | Neighborhood based clustering | Similarity based integration [ | |
| Drug response prediction | logistic regression | Cancer hallmarks datasets, pathway data [ | ||
| biomarker analysis for disease research | Disease-associated genes investigation | Pulmonary sarcoidosis genes identification | Hierarchical clustering | Differential expression analysis [ |
| Identification of miRNA-disease association | NMF | Disease semantic information and miRNA functional information [ | ||
| Disease-phenotype visualization | t-SNE | OMIM database and human disease networks [ | ||
| Biomarker discovery | Cancer diagnosis | SVM | Reference gene selection [ | |
| Biomarker signature identification | SVM | Network-based gene selection [ | ||
| Cancer outcome prediction | Random forest | Evolutionary conservation estimation [ |
Challenges posed for ML and DL in biomedicine, existing strategies to overcome these challenges and proposed solutions by integrating ML techniques with established bioinformatics approaches.
| Problem | Bottleneck | Example Solutions | Potential Integrated ML/DL and Bioinformatics Solutions |
|---|---|---|---|
| Small and dependent datasets | Data availability | Restricting the number of parameters [ | Neural network architectures for small and sparse datasets |
| Separating training and test sets by phylogenetic similarity [ | Methods to evaluate data dependency by protein and sequence similarities | ||
| Biological sequence representation | Methodological | NLP with neural networks-based modeling [ | Incorporating amino acid substitution and codon usage matrices to representation frameworks |
| Incorporating conserved domain databases to the training framework | |||
| Incorporation of different data types | Methodological | Integration of multi-omics datasets through existing network topologies | |
| Reproducibility | Acceptance | Documentation and deposition of the processed data [ | - |
| Benchmarking of the processing pipeline and optimized parameters [ | - | ||
| Interpretability | Acceptance | Incorporation of established bioinformatic methods and databases with ML and DL frameworks [ | |
| Generation of interpretable DL models [ | |||