| Literature DB >> 32351543 |
Hannah L Nicholls1,2, Christopher R John2,3, David S Watson2,4, Patricia B Munroe1,5, Michael R Barnes1,2,5,6, Claudia P Cabrera1,2,5.
Abstract
Genome-wide association studies (GWAS) have revealed thousands of genetic loci that underpin the complex biology of many human traits. However, the strength of GWAS - the ability to detect genetic association by linkage disequilibrium (LD) - is also its limitation. Whilst the ever-increasing study size and improved design have augmented the power of GWAS to detect effects, differentiation of causal variants or genes from other highly correlated genes associated by LD remains the real challenge. This has severely hindered the biological insights and clinical translation of GWAS findings. Although thousands of disease susceptibility loci have been reported, causal genes at these loci remain elusive. Machine learning (ML) techniques offer an opportunity to dissect the heterogeneity of variant and gene signals in the post-GWAS analysis phase. ML models for GWAS prioritization vary greatly in their complexity, ranging from relatively simple logistic regression approaches to more complex ensemble models such as random forests and gradient boosting, as well as deep learning models, i.e., neural networks. Paired with functional validation, these methods show important promise for clinical translation, providing a strong evidence-based approach to direct post-GWAS research. However, as ML approaches continue to evolve to meet the challenge of causal gene identification, a critical assessment of the underlying methodologies and their applicability to the GWAS prioritization problem is needed. This review investigates the landscape of ML applications in three parts: selected models, input features, and output model performance, with a focus on prioritizations of complex disease associated loci. Overall, we explore the contributions ML has made towards reaching the GWAS end-game with consequent wide-ranging translational impact.Entities:
Keywords: artificial intelligence; candidate gene; clinical translation; data science; deep learning; genome-wide association study; genomics; machine learning
Year: 2020 PMID: 32351543 PMCID: PMC7174742 DOI: 10.3389/fgene.2020.00350
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1Supervised Machine Learning Algorithm Training. (A) Data containing labeled genes (e.g., genes labeled as causal or non-causal for blood pressure – BP) and columns of features describing those genes are input into a machine learning algorithm. Machine learning algorithms firstly initialize themselves by applying their rules to a subset of the data (deemed training data) and its features at random. E.g., an algorithm’s first practice iteration can involve assigning feature importance at random (importance denoted by size of feature image). The algorithm uses its feature initialization to classify genes into either affecting BP (red genes) or not affecting BP (blue genes). Algorithms then use the practice predictions to calculate loss (an error rate) and iterate over the data again with applying the previous iteration’s loss calculation to adjust feature handling (B). With using the loss calculations the algorithm aims to improve predictive performance with each training iteration.
FIGURE 2Supervised Machine Learning Models. Diagram detailing three machine learning model bases used in supervised learning, each providing varying algorithms most commonly used in post-GWAS prioritization.
Curation of machine learning studies applied to post-GWAS prioritization of variants and genes.
| 30692607* | LR; Genes – Crohn’s disease | Uses backward stepwise regression to build significant expression datasets (with emphasis on epigenetic data) to give prediction in combination with genotype data. Expression data reduces the uncertainty of smaller effect loci shown in fine-mapping and prioritization was followed-up with protein network analyses for validation | 10-fold cross validation (2,000 genes per fold) |
| 25935003* | LR; Genes – Crohn’s disease | Combines GWAS results with gene expression features and whether genes are associated with other autoimmune diseases to better identify disease-related genes. More powerful for prioritizing rare missense variants | Cross-validation performed. 50:50 training:testing ratio. Training iterated 500 times. 54 Crohn’s disease genes used as positively labeled training genes |
| 29407288 | SVM, LASSO, classification-regression trees; Variants – major depressive disorder and adverse drug response (duloxetine) | Models used features selected by LASSO regression and classified variants based on a clinical depression scoring defining drug response and remission | Dataset size: 186 patients. Nested 5-fold cross-validation. 80:20 training:testing ratio |
| 21317188* | SVM, RF; Variants – arthritis and T1D | Compares support vector machine and random forest performance to chi-squared ranking | Dataset size: 452,176 T1D SNPs 63 arthritis SNPs |
| 31779641 | RF; Variants – intronic variants associated to cellular sensitivity to clofarabine-induced cytotoxicity | Focuses on integrating splicing data features with other types. Validates model prioritization with laboratory follow-up – limited by technical noise during laboratory work | 3-fold cross-validation. Training data size: 6,676 variants. Testing data size: 1,222 variants |
| 24564704* | Parallel RF Regression; Variants – brain structure and function. Alzheimer’s disease GWAS | Designed to run on large Hadoop clusters, including those available through cloud computing. Multivariate applications not available on Hadoop | Each tree bootstraps to form training data (63.2%) with out-of-bag samples for test data. 500 simulated datasets |
| 28592878* | RF Hyper-ensemble; Non-coding variants – curated mendelian diseases | Addresses class imbalance via resampling using simultaneous oversampling of minority class and undersampling of majority class. HyperSMURF can detect disease variants nearby to non-disease variants | 10-fold cross-validation partitioning variants into chromosomal bands so no variants had same location, gene or disease in training and testing. GWAS total size approximately 2,000 variants |
| 25633252* | GB; Genes – cardiovascular diseases and traits | Explored prioritization of 38 phenotypes (predominantly cardiovascular). Each tree within model updates a log-odds of disease association per gene. GWA-prediction assigns scores to genes in loci based on reasonings (transcription sites, experimental evidence, etc.) to identify likely positives which are used in training for phenotypes with GWAS training data | Six rounds of 8-fold cross-validation. Seventy percent of loci as positive training examples with matching numbers of negative samples |
| 30591030* | LR and DL; Genes and variants – schizophrenia and autism | Performed variant prioritization which fed into gene prioritization. Variant prioritization used eQTL and pathogenic scoring data features. Gene prioritization used the variant rank in combination with genotypic data. Used to prioritize an individual’s variants and genes and can be re-applied to GWAS data | 10-fold cross-validation on four training and test sets |
| 28795970 | LR with elastic net, RF, SVM with polynomial kernel, extreme GB; Genes – inflammatory bowel diseases | All genes in dataset were annotated with 1,027 features. 16,390 genes scored and classified, with prediction as a score between 0 and 1. Models evaluated separately and together in combined performance score | 5-fold cross-validation repeated 10 times. Training data: 314 positive genes and 1,736 negative genes |
| 30013180* | DL – ExPecto; Variants – publicly available GWAS for four immune diseases | Data profiling >140 million promoter-proximal mutations allowed for deep learning to predict variant effect, with effect feeding into the prioritization of SNPs | Dataset size: 390,085 variants. Whole-chromosome holdout of chromosome 8 with 990 genes – using these genes for testing |
| 30859622 | LR with stochastic gradient descent, SVM, RF, K-Nearest Neighbors; Genes – colorectal cancer | Used a network approach – collecting both global and local data to create an epistasis network. Topology of the network was then used as features in machine learning, with different types of feature selection compared, to prioritize genes biologically relevant to colorectal cancer | Dataset size: 185,180 SNPs. Training on 90% of the dataset with 10-fold cross validation |
| doi: | SVM, RF, extra trees, GB, extreme GB, DNN and a stacking classifier with four base classifiers (RF, extra trees, GB and SVM) followed by a DNN in the second layer. Genes – chronic kidney disease, amyotrophic lateral sclerosis, epilepsy | Models applied with positive-unlabeled learning – stochastic semi-supervised learning. Explored combinational impact of all models, and chose best performing model for each disease. There was a dependency on existing patterns – beneficial for finding new causal associated genes which may impact known mechanisms | 10-fold cross validation. Gene samples: 25,000 for chronic kidney disease, 17,000 for epilepsy and 79,500 for amyotrophic lateral sclerosis |
| 21687685 | Bayesian latent variable model; Variants – ovarian GWAS | Used features about a SNP to estimate a latent quality score, with SNPs prioritized based on the posterior probability distribution of the rankings of latent quality scores. Incorporated the uncertainty of the ranking into the prioritization via probability calculation | NA |
| 23369106* | Genetic algorithm; Variants – select OMIM diseases | Algorithm estimates feature weights to characterize SNPs related to an input dataset of genes, biological processes or GWAS results. Users can select features and assign a custom relevance and model relies on data mining of public data | Leave one out cross validation – single disease in the set used to validate (repeats for each disease) |
| 29874547* | Network representation learning (random walk); Genes – Parkinson’s, RA, Crohn’s, Ulcerative Colitis, CAD, T2D | Unsupervised model learns embeddings of genes from multiple gene networks and develops hierarchical statistical model to integrate the learned embeddings of genes with GWAS summary data. Gene-level | NA |
| 21977986* | Multi-task learning ProDiGe; Genes – 265 diseases and 936 associations | Model learns from positive and unlabeled examples. The model shared information across diseases to improve the predictive performance for diseases with minimal positive labeled genes. The information shared is weighed depending on similarity of one disease to another | Training set: at least one known disease gene in training data. Training data per disease >11 genes. Leave one out validation on select diseases |
| 26504140* | Unsupervised model – bayes classifier – GenoWAP; Variants – schizophrenia and Crohn’s disease | Unsupervised learning – integrates GenoCanyon (their previous model) functional prediction and GWAS | NA |
| 27058395* | Unsupervised model – bayes classifier – Genoskyline; Variants – schizophrenia and coronary artery disease | Successor of GenoWAP model, building from it by using annotations integrating tissue-specificity. Customizable with researchers able to input many feature annotations. Whilst tissue-specific it also lacked data from all tissue types | NA |
Comparison of machine learning model performance. Comparison of the most common models used in post-GWAS prioritization including performance metrics, comparing metrics of each model’s highest performance score per study.
| Logistic regression | 25935003 | 0.94 (AUC) – Crohn’s disease | Advantages: Easy to implement Efficient to train High interpretability Can act as a benchmark for exploring more complex algorithms Difficulty recognizing complicated data patterns Difficulty handling large datasets |
| 28795970 | 0.775 (ROC) – inflammatory bowel diseases | ||
| Random forest | 28592878 | 0.635 (AUCROC) – curated Mendelian diseases | Advantages: It can handle large data with higher dimensions Ensemble method reduces overfitting by several models testing multiple hypotheses Many parameters to tune, affecting computational efficiency Ensemble method lows interpretability |
| 31779641 | 0.96 (AUCROC) – cellular sensitivity to clofarabine-induced cytotoxicity | ||
| 21317188 | 0.81 (AUC) – T1D | ||
| 28795970 | 0.80 (ROC) – inflammatory bowel diseases | ||
| doi: | 0.85 (AUC) – average between all diseases | ||
| Gradient boosting | 28795970 | 0.783 (ROC) – inflammatory bowel diseases | Advantages: High power performance Flexible with several parameter tuning options Ensemble method reduces overfitting by several models testing multiple hypotheses Reliance on high quality training data Many parameters to tune, affecting computational efficiency |
| doi: | 0.848 (AUC) – average between all diseases | ||
| 25633252 | 0.959 (ROC) – HCM | ||
| Support vector machine | 28795970 | 0.786 (ROC) – inflammatory bowel diseases | Advantages: Computationally efficient It handle can handle large data and high dimensions Does not provide class probabilities Difficulty to interpret |
| 29407288 | 0.66 (Accuracy) – major depressive disorder and adverse drug response (duloxetine) | ||
| doi: | 0.832 (AUC) – average between all diseases | ||
| Deep neural network | 30013180 | 0.815 (AUCROC) – lymphoblastoid expression | Advantages: Recognizes patterns in large complex data High power performance Able to handle noisy data Difficulty to interpret Computationally expensive requiring GPUs for high power performance |