| Literature DB >> 34068248 |
Xiaotian Dai1, Guifang Fu1, Shaofei Zhao1, Yifei Zeng1.
Abstract
Despite the fact that imbalance between case and control groups is prevalent in genome-wide association studies (GWAS), it is often overlooked. This imbalance is getting more significant and urgent as the rapid growth of biobanks and electronic health records have enabled the collection of thousands of phenotypes from large cohorts, in particular for diseases with low prevalence. The unbalanced binary traits pose serious challenges to traditional statistical methods in terms of both genomic selection and disease prediction. For example, the well-established linear mixed models (LMM) yield inflated type I error rates in the presence of unbalanced case-control ratios. In this article, we review multiple statistical approaches that have been developed to overcome the inaccuracy caused by the unbalanced case-control ratio, with the advantages and limitations of each approach commented. In addition, we also explore the potential for applying several powerful and popular state-of-the-art machine-learning approaches, which have not been applied to the GWAS field yet. This review paves the way for better analysis and understanding of the unbalanced case-control disease data in GWAS.Entities:
Keywords: GWAS; disease; genomic prediction; genomic selection; unbalanced case-control
Year: 2021 PMID: 34068248 PMCID: PMC8153154 DOI: 10.3390/genes12050736
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
The mean (standard error) of the simulation example across 100 replications.
| Simulation Settings | Standard Error | |
|---|---|---|
| Balanced data | 0.5956 (0.0275) | 0.0275 (0.0689) |
| Unbalanced data | 0.9731 (0.1410) | 0.1916 (0.2664) |
Figure 1Overview of GEV-NN structure.
A summarization of the methods evaluated from different aspects mentioned in the manuscript. Each method has its own advantages and limitations.
| Can the Method Be Applied to Genomic Selections? | Can the Method Be Applied to Genomic Predictions? | Can the Method Handle Unbalanced Binary Response? | |
|---|---|---|---|
| GMMAT | √ GMMAT is designed for performing the significance test of each variant. | ✘ GMMAT is a single-SNP method and is not good for prediction. | ✘ Its significance test assumes a Gaussian distribution, which is not the case for unbalanced data. |
| SAIGE | √ SAIGE is designed for performing the significance test of each variant. | ✘ SAIGE is a single-SNP method and is not good for prediction. | √ SAIGE use the entire cumulant generating function to approximate |
| B-LORE | √ B-LORE is a joint Bayesian variable selection regression method designed for high-dimensional variants. | √ B-LORE is a joint Bayesian regression and can be used for prediction. | ✘ B-LORE cannot handle extremely unbalanced binary data. |
| SVM | √ SVM has not been widely used in GWAS field yet, but it has the potential to select important variants or use permutation-based testing to obtain significance. | √ SVM is a machine method with the strength of producing accurate prediction. | √ SVM with weighted DWD can handle extremely unbalanced binary data. |
| AdaBoost | √ AdaBoost has not been widely used in GWAS field yet, but it has the potential to select important variants or use permutation-based testing to obtain significance. | √ AdaBoost is a machine method with the strength of producing accurate prediction. | √ AdaBoost can handle extremely unbalanced binary data by assigning higher misclassification costs to the minority class. |
| Neural Network | √ Neural Network has not been widely used in GWAS field yet, but it has the potential to select important variants or use permutation-based testing to obtain significance. | √ Neural Network is a machine method with the strength of producing accurate prediction. | √ The RCOSNet and GEV-NN can handle extremely unbalanced binary data. |