| Literature DB >> 35154283 |
Babak Arjmand1, Shayesteh Kokabi Hamidpour1, Akram Tayanloo-Beik1, Parisa Goodarzi1, Hamid Reza Aghayan1, Hossein Adibi2, Bagher Larijani3.
Abstract
Cancer is defined as a large group of diseases that is associated with abnormal cell growth, uncontrollable cell division, and may tend to impinge on other tissues of the body by different mechanisms through metastasis. What makes cancer so important is that the cancer incidence rate is growing worldwide which can have major health, economic, and even social impacts on both patients and the governments. Thereby, the early cancer prognosis, diagnosis, and treatment can play a crucial role at the front line of combating cancer. The onset and progression of cancer can occur under the influence of complicated mechanisms and some alterations in the level of genome, proteome, transcriptome, metabolome etc. Consequently, the advent of omics science and its broad research branches (such as genomics, proteomics, transcriptomics, metabolomics, and so forth) as revolutionary biological approaches have opened new doors to the comprehensive perception of the cancer landscape. Due to the complexities of the formation and development of cancer, the study of mechanisms underlying cancer has gone beyond just one field of the omics arena. Therefore, making a connection between the resultant data from different branches of omics science and examining them in a multi-omics field can pave the way for facilitating the discovery of novel prognostic, diagnostic, and therapeutic approaches. As the volume and complexity of data from the omics studies in cancer are increasing dramatically, the use of leading-edge technologies such as machine learning can have a promising role in the assessments of cancer research resultant data. Machine learning is categorized as a subset of artificial intelligence which aims to data parsing, classification, and data pattern identification by applying statistical methods and algorithms. This acquired knowledge subsequently allows computers to learn and improve accurate predictions through experiences from data processing. In this context, the application of machine learning, as a novel computational technology offers new opportunities for achieving in-depth knowledge of cancer by analysis of resultant data from multi-omics studies. Therefore, it can be concluded that the use of artificial intelligence technologies such as machine learning can have revolutionary roles in the fight against cancer.Entities:
Keywords: artificial intelligence; cancer; data analysis; machine learning; multi-omics
Year: 2022 PMID: 35154283 PMCID: PMC8829119 DOI: 10.3389/fgene.2022.824451
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1Cancer treatment approaches (Tacón, 2003; Roffe et al., 2005; Jones and Demark-Wahnefried, 2006; Sagar et al., 2007; Lu et al., 2008; Giustini et al., 2010; Masafi et al., 2011; Stanczyk, 2011; Fleisher et al., 2014; Drãgãnescu and Carmocan, 2017; Bilgin et al., 2018; Carlson et al., 2018; Hojman et al., 2018; Yadav et al., 2018; Bidram et al., 2019; Psihogios et al., 2019; Pucci et al., 2019; Laoudikou and McCarthy, 2020; Najafpour and Shayanfard, 2020).
FIGURE 2ML approaches. The main approaches of machine learning include: 1) Supervised learning, 2) Unsupervised learning, 3) Semi-supervised Learning, and 4) Reinforcement learning. In supervised learning, the input and output are specified and the data is labeled. In unsupervised learning, specific data does not already exist and is not intended to be an input-output connection, but only to categorize them. Semi-supervised learning uses both labeled and unlabeled data simultaneously to improve learning accuracy. Reinforcement learning loop has a sequence of modes, actions, and rewards (Sedghi et al., 2020).
Challenges of multi-omics data analysis by ML and their solutions.
| Challenges | Consequences of the challenge | Solution | References |
|---|---|---|---|
| Complex data sets with a large amount of additional meaningless information in it | Existing patterns can difficultly be analyzed or described in the vast amount of omics data layers. In addition, classification accuracy will be decrease and the prediction of meaningful data will be difficult | 1) Ensemble techniques |
|
| 2) Distance based algorithms | |||
| 3) Single learning based techniques | |||
| 4) Deep learning method based on an auto-encoder architecture | |||
| 5) EMD method | |||
| 6) dubbed ELCs | |||
| High number of omics data variables compared to the study sample | Data dimensionality increases (curse of dimensionality) | Dimensionality reduction methods include: |
|
| 1) Linear FE methods such as PCA, MCIA, joint NMF, and MOFA | |||
| 2) Nonlinear FE methods such as t-SNE, autoencoders, and representation learning | |||
| 3) Filter methods of FS technique such as mRMR, FCS, Information Gain, and ReliefF | |||
| 4) Wrapper methods of FS techniques such as RFE-SVM, Boruta, and jackstraw | |||
| 5) Embedded methods of FS technique such as LASSO, Elastic Net, and stability selection | |||
| Data heterogeneity (Data with different types or different distributions) | The balance of ML is upset and data integrity is prevented | If there is naive feature concatenation-based data integration: |
|
| 1) Tree-based methods such as decision trees and random forest | |||
| 2) penalized linear models such as Elastic net, LASSO, and TANDEM | |||
| If there is simple feature concatenation-based integration | |||
| 1) MKL methods such as simple MKL and Bayesian multitask MKL | |||
| 2) Graphs and networks methods such as SNF, NetICS, PARADIGM, and HetroMed | |||
| 3) Latent sub-space methods such as iCluster+, Scluster, and MV-RBM | |||
| 4) Deep learning methods such as multimodal DBN, multimodal DNN, improved CPR, and AuDNNsynergy | |||
| Class imbalance | It can lead to increase in the degree of overlapping among the classes and limit the size of training data. In addition, class distributions become highly imbalanced. If the balance within a class is lost then a small disjuncts is appeared | 1) Data sampling methods such as: under sampling the majority class algorithms, oversampling the minority class algorithms, and combination of both under sampling the majority class and oversampling the minority class algorithms |
|
| 2) Cost-sensitive learning methods such as Mnet, UNIPred, SVM_weight, and Spotlite | |||
| 3) Ensemble methods such as Balanced Cascade, EasyEnsemble, ensemble with WMV, and WELM | |||
| 4) Evaluation measures methods such as Diablo, SNN, WMV, and FPRF | |||
| Missing data | It can lead to increase in parameters bias and complexity of the analysis and reduction in representative sample and statistical power | If there are sufficient amounts of sample: |
|
| 1) Listwise deletion | |||
| In the other cases: | |||
| 1) Matrix factorization methods such as ALRA, SVD-impute, and SparRec | |||
| 2) Autoencoders methods such as MIDA, multilayer autoencoder, and AutoImpute | |||
| 3) Integrative imputation methods such as MOFA, LF-IMVC, and ensemble regression imputation | |||
| 4) Maximum likelihood approaches such as EM algorithm and Direct Maximization | |||
| 5) Single imputation methods such as replacement with mean or mode values, hot-deck imputation, regression imputation, and k-nearest neighbor | |||
| 6) MI methods for liner analysis such as MI-MFA, MCMC, and MICE | |||
| 7) MI methods for non-liner analysis such as MICE with RF, MIDA, and GMM-ELM | |||
| Data scalability | Practical data processing workflow for multi-OMICS projects based on ML approaches becomes difficult and problematic on a single computer | 1) Efficient algorithms for big data such as non-iterative neural networks, scalable MKL methods, and convex optimization for big data |
|
| 2) Online training algorithms such as OS-ELM, IDSVM, and online deep learning | |||
| 3) Distributed data processing methods such as Spark’s MLlib, Apache Mahout, and Google’s Tensor Flow | |||
| 4) Cloud computing-based solutions such as Galaxy Cloud, MetaboAnalyst, XCMS online, Omics pipe, and ML-as-a-service |
ALRA, Adaptively-thresholded low-rank approximation; AuDNNsynergy, Deep Neural Network Synergy model with Autoencoders; Diablo, data integration analysis for biomarker discovery using latent components; ELCs, embedding label correlations; ELM, extreme learning machine; EM, expectation-minimization; EMD, empirical mode decomposition; FCS, correlation-based FS; FE, Feature extraction; FPRF, fuzzy pattern random forest; FS, Feature selection; GMM, Gaussian mixture model; IDSVM, incremental and decremental support vector machine; improved CPR, improved Clustering and PageRank; Joint NMF, Joint non-negative matrix factorization; KRR, kernel ridge regression; LASSO, least absolute shrinkage and selection operator; LF-IMVC, Late Fusion Incomplete Multi-View Clustering; MCIA, Multiple co-inertia analysis; MCMC, Markov-chain Monte Carlo; MI, Multiple imputation; MICE, multivariate imputation by chained equation; MIDA, denoising autoencoder-based MI; MI-MFA, MI for multiple factor analysis; MKL, Multiple kernel learning; ML, Machine learning; MOFA, Multi-omics factor analysis; mRMR, maximal-relevance and minimal-redundancy; multimodal DBN, multimodal deep belief networks; multimodal DNN, multimodal deep neural networks; MV-RBM, mixed variable restricted Boltzmann machine; NetICS, Network-based Integration of Multi-omics Data; OS-ELM, online sequential extreme learning machine; PARADIGM, PAthway Recognition Algorithm using Data Integration on Genomic Models; PCA, Principal component analysis method; RF, random forest; RFE-SVM, recursive feature elimination-support vector machine; SNF, similarity network fusion; SNN, super-layered neural network architecture; SparRec, Sparse Recovery; SVD, singular value decomposition; SVM, support vector machine; t-SNE, t-distributed stochastic neighbor embedding; UNIPred, unbalance-aware network integration and prediction of protein functions; WELM, weighted extreme learning machine; WMV, weighted majority voting.