| Literature DB >> 34616886 |
Hala Ahmed1, Louai Alarabi2, Shaker El-Sappagh3,4, Hassan Soliman1, Mohammed Elmogy1.
Abstract
BACKGROUND AND OBJECTIVES: This paper presents an in-depth review of the state-of-the-art genetic variations analysis to discover complex genes associated with the brain's genetic disorders. We first introduce the genetic analysis of complex brain diseases, genetic variation, and DNA microarrays. Then, the review focuses on available machine learning methods used for complex brain disease classification. Therein, we discuss the various datasets, preprocessing, feature selection and extraction, and classification strategies. In particular, we concentrate on studying single nucleotide polymorphisms (SNP) that support the highest resolution for genomic fingerprinting for tracking disease genes. Subsequently, the study provides an overview of the applications for some specific diseases, including autism spectrum disorder, brain cancer, and Alzheimer's disease (AD). The study argues that despite the significant recent developments in the analysis and treatment of genetic disorders, there are considerable challenges to elucidate causative mutations, especially from the viewpoint of implementing genetic analysis in clinical practice. The review finally provides a critical discussion on the applicability of genetic variations analysis for complex brain disease identification highlighting the future challenges.Entities:
Keywords: Brain disease; Deep learning; Genetic analysis; Machine learning; Microarrays; Single nucleotide polymorphism (SNP)
Year: 2021 PMID: 34616886 PMCID: PMC8459785 DOI: 10.7717/peerj-cs.697
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Early symptoms comparison of various brain diseases.
| Comparison between early symptom | Dementia of Lewy body | Parkinson’s disease (PD) | Alzheimer’s disease (AD) |
|---|---|---|---|
| Age of onset | >60 years old | >70 years old | >60 years old |
| Gender-specific | Men > Women | Conflicting | Men = women |
| Family history | No | Conflicting | Yes |
| Significant Loss of Memory | Possible | Possible Years After Diagnosis | Always |
| Problems of Language | Possible | Possible | Possible |
| Fluctuating Cognitive Abilities | Likely | Possible | Possible |
| Planning or Problem-solving Abilities | Likely | Possible | Possible |
| Decline in Thinking Abilities that Interfere with Everyday Life | Always | Possible Years After Diagnosis | Always |
| Difficulty with a Sense of Direction or Spatial Relationships between Objects | Likely | Possible | Possible |
Figure 1The structure of the survey.
Figure 2The guidelines for systematic review analysis.
The used databases for selecting the academic articles in this article.
| Academic database | Source |
|---|---|
| Science direct |
|
| Springerlink |
|
| IEEEXplore |
|
| Web of science |
|
| PubMed |
|
| PeerJ |
|
| Scopus |
|
Figure 3The haplotype of SNPs for individuals.
Figure 4The common ML procedure for the complex brain disease diagnosis.
Some of the benchmark datasets.
| Dataset | Diseases | Source |
|---|---|---|
| GEO Database | ASD |
|
| (KEGG) database | AD |
|
| ADNI/Whole-genome sequencing (WGS) datasets | AD |
|
| TCGA | Cancer | The Cancer Genome Atlas (TCGA): |
| UCI for Cancer | Cancer | UCI Machine Learning Repository: |
| NCBI Gene Expression Omnibus (GEO) | Cancer | NCBI repository: |
Figure 5Methods for pattern classification with missing data.
The main feature selection techniques.
| Feature selection | Advantages | Disadvantages |
|---|---|---|
| Filter | Univariate | |
| Fast | Neglect dependencies with feature | |
| Scalable | Neglect the classifier interaction | |
| The selection of classifier is independent | ||
| Multivariate | ||
| Models feature dependencies | Slower | |
| Regardless of the classifier | Less scalable | |
| Better computational complexity than wrapper methods | Interaction is neglected with the classifier | |
| Deterministic | ||
| Wrapper | Simple | Risk of overfitting |
| Interacts with the classifier | More prone than randomized | |
| Feature of models is dependent | Algorithms to obtaining stuck in a local optimum (greedy search) | |
| Less computationally | The selection of classifier is dependent | |
| Intensive than randomized methods | ||
| Randomized | ||
| To local optima is less prone | Computationally intensive | |
| Models feature dependencies | The selection of classifier is dependent | |
| Overfitting with higher risk | ||
| Than deterministic algorithms | ||
| Embedded | Interacts with the classifier | The selection of classifier is dependent |
| Computational complexity is better than wrapper methods | ||
| Feature of models is dependent |
A comparison between selection and extraction techniques.
| Method | Advantages | Disadvantages |
|---|---|---|
| Selection | Data Preserving for data interpretability | Discriminative power |
| Lower times of training | ||
| Reducing overfitting | ||
| Extraction | Higher distinguishing power | Interpretability of data is lost |
| Overfitting Controlled when it is unsupervised | Switching can be costly |
An overview of ASD using different ML methods.
| Authors | Application on diseases | Method | Results | Problem with method |
|---|---|---|---|---|
|
| ASD | FPM algorithms& contrast mining | Including 193 novel autism candidates as significant associations from connected 286 genes. | It is a challenge for FPM to store many combinations of items as a memory requirement problem. |
|
| ASD | Data mining and fuzzy rule | Fuzzy rules achieved accuracy up to 91.35% and 91.40% sensitivity rate. | Considering the assessment of features on the dataset in this study is not extensive and does not consider other target data sets like adults, adolescents, and infants. |
|
| ASD | SVM, NB, linear discriminant analysis, and KNN | Classification accuracy up to 96% | Their work remains incomplete until the basis of genetic diseases, and traits well understand. |
|
| ASD | ML approach | Area Under the Receiver–Operator curve (AUC) = 0.80. | The accuracy of the system needs to enhance |
|
| ASD | SVM | A mean accuracy = 76.7% | - |
|
| ASD | DSs, ADTrees, and FlexTrees | DS and FlexTree With an accuracy = 67%, | One limitation of this work is that this study includes only 29 SNPs. |
An overview of AD diseases using different ML methods.
| Authors | Application on diseases | Methods | Results | Problem with method |
|---|---|---|---|---|
|
| AD | Classification techniques fed by each subset feature selection | Classification accuracy | They cannot generalize all machine learning techniques are efficient in the classification of disease. |
|
| AD | Feature selection & ML &10-fold validation | Classification accuracy 91.6% | The system cannot apply as general for all diseases, so we need to propose a novel system to detect all diseases. |
|
| AD | Classification of ML techniques | Accuracy of NB 99.63% | Their system does not support integrating some metadata like gender, age, and smoking to check if these data are associated SNPs or not. |
|
| AD | ML & DM techniques | Accuracy of GLM is 88.24% | The model cannot be trained with unbalanced data and insufficient data for all classes of disease. |
|
| AD | ANN | The methodology can reliably generate novel markers | The power of the algorithms and speed are needed to be improved. |
|
| AD | CNN | The accuracy 90.91% and an F1-score of 0.897 | This methodology needs to be used in a cloud architecture system to collect accelerometer data and a service that users can subscribe to for monitoring changes in the AD stage. |
Different used applications to detect biomarkers for diseases.
| Authors | Application on diseases | Methods | Results | Problem with method |
|---|---|---|---|---|
|
| CAD for malaria, brain tumors, etc | ML | Their approach provided reasonable performance for all these applications | Their research needs to further extend by studying the ROI determined by class activation mapping. |
|
| Bacterial and archaeal | ML approach using ANN and SVM | RNA genes could be recognized with high confidence using the ML | Future studies are necessary to characterize the extent of these elements and the accuracy of predection. |
|
| Lung cancer | SVM | CAD perform sensitivity with 82.82 | They should optimize the feature set for SVM classification. |
|
| AD | ML | The AUC to | |
|
| LOAD | ML | Classification performance is 72% | The classification performance needs more enhancement. |
|
| Business big data business analytics | Fuzzy logic dependent on ML tools | Their improved version led to benefits of additional 10% accuracy. | Their tool need to handle further outliers and edge cases and explore factors that are more explicit and/or implicit in business processes. |
|
| Network systems | Distributed Mean-Field | Mean Square Error =2.01 | They use no more complex detectors based on vision. In addition to reducing measuring noise they don’t use deep learned features. |
|
| Smoking prediction Models | SVM and RF | AUC of SVM = 0.720, and AUC of RF = 0.667. | Their work in the future needs to identify the inner complex relations between these SNPs and smoking status. |
Some publicly available datasets.
| Authors | Application on diseases | Dataset |
|---|---|---|
|
| ASD | UCI data repository |
|
| Breast Cancer disease | NCBI |
|
| AD | (KEGG) database |
|
| AD | Phase 1 (ADNI-1)/Whole genome sequencing (WGS) datasets ( |
The confusion matrix elements.
| Predicted | |||
|---|---|---|---|
| Normal | Abnormal | ||
| Actual | Normal |
|
|
| Abnormal |
|
| |
Some of the performance evaluation metrics.
| Metrics | Description | Formula |
|---|---|---|
| Accuracy | This is a relation between the sum of TP and TN divided by the total sum of the population |
|
| Sensitivity, Recall | This is a relation between TP divided by the total sum of TP, FN |
|
| Specificity | This is a relation between TN divided by the total sum of TN, FP |
|
| AUROC | This metric is used to measure the average area under ROC |
|
| DSC | This is a relation between TP divided by the total sum of TP, FN, FP |
|
| Precision | It is a relation between TP divided by the total sum of TP, FP |
|
| MCC | An effective solution overcoming the class imbalance issue comes from MCC |
|
Figure 6Diagnosis system for ASD based on DM techniques.
Figure 7The main tasks of the gene-based analysis system.
Figure 8The computational system for AD disorders.
Figure 9The identification of genetic interactions system.
Different applications of ML and their accuracy.
| Authors | Methods | Accuracy | Disease |
|---|---|---|---|
|
| ML | 96% | ASD |
|
| ML | 93% | AD |
|
| DM and Fuzzy Rules | 91.35% | ASD |
|
| NB | 100% | Cancer |
|
| ML | 91.6% | AD |
|
| SVM | 82.82% | Lung Cancer |
|
| NB | 99.63% | AD |
|
| GLM | 88.24% | AD |
|
| CNN | 90.91% | AD |
|
| ML | 84% | AD |
Figure 10Limitations and future directions.
An overview of cancer diseases using different ML methods.
| Authors | Application on diseases | Method | Results | Problem with method |
|---|---|---|---|---|
|
| Cancer disease | Classification by ANN | The best classifier with mean square error 0.0000001. | The system cannot apply as general for all diseases, so we need to propose a novel system to detect all diseases. |
|
| Cancer | NB classifier with cross-validation of stratified 10-fold | An accuracy up to 100% | In the microarray datasets, this approach is less reliable and having a small sample size. |
|
| Periodontal Disease (PD), and Cardiovascular Disease (CVD) | NBSVM, and UDC. | The performance of NB and SVM better than (uncorrelated normal based quadratic Bayes classifier) UDC. | The number of features is limited, so the differences between the test’s accuracy levels were not noticeable. |
|
| Cancer | ANN, k-NN, DTs, NB, RF, and SVM | RF provides the best performance for the classification |