| Literature DB >> 23448398 |
Sebastian Okser1, Tapio Pahikkala, Tero Aittokallio.
Abstract
A central challenge in systems biology and medical genetics is to understand how interactions among genetic loci contribute to complex phenotypic traits and human diseases. While most studies have so far relied on statistical modeling and association testing procedures, machine learning and predictive modeling approaches are increasingly being applied to mining genotype-phenotype relationships, also among those associations that do not necessarily meet statistical significance at the level of individual variants, yet still contributing to the combined predictive power at the level of variant panels. Network-based analysis of genetic variants and their interaction partners is another emerging trend by which to explore how sub-network level features contribute to complex disease processes and related phenotypes. In this review, we describe the basic concepts and algorithms behind machine learning-based genetic feature selection approaches, their potential benefits and limitations in genome-wide setting, and how physical or genetic interaction networks could be used as a priori information for providing improved predictive power and mechanistic insights into the disease networks. These developments are geared toward explaining a part of the missing heritability, and when combined with individual genomic profiling, such systems medicine approaches may also provide a principled means for tailoring personalized treatment strategies in the future.Entities:
Year: 2013 PMID: 23448398 PMCID: PMC3606427 DOI: 10.1186/1756-0381-6-5
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Figure 1The figure illustrates how the external and internal cross-validation results behave as functions of the number of selected features. The external-cross validation consists of three training/test splits. The wrapper-based feature selection method, greedy RLS [23], is separately run during each round of the external cross-validation. Greedy RLS, in turn, employs an internal leave-one-out cross-validation on the training set for scoring the feature set candidates. The red curve depicts the mean values over these internal cross-validation errors. As can be easily observed from the blue curve, this internal cross-validation MSE used for the model training keeps constantly improving, which is expected, because the internal cross-validation quickly overfits to the training data when it is used as a selection measure. The blue curve depicts the area under curve (AUC) on the test data, held out during the external cross-validation round, that is, data completely unseen during the internal cross-validation and feature selection process. In contrast to the red curve, the blue curve starts to level off soon after the number of selected variants reaches around 10, indicating that adding extra features is not beneficial anymore even if the internal scoring function keeps improving. The green curve depicts the AUC of the RLS model trained using features selected by single-locus p-value based filter method, Fisher’s exact test, which is run with the same external training/test split as the greedy selection method. Similarly to the blue curve, the green one also stops improving soon after a relatively small set of features has been selected. The data used in the experiments is the Wellcome Trust Case Controls Consortium (WTCCC) Hypertension dataset combined with the UK National Blood Services’ controls.
Figure 2Sample network visualization constructed for type 1 diabetes. The risk variants were selected using the greedy RLS on the WTCCC type 1 diabetes GWAS data and the UK National Blood Services’ controls, extended with those genes selected in another work [62]. The biological processes and pathways were then mapped using DAVID [112,113], and the network visualization was done with the Enrichment Map plugin for Cytoscape [114,115]. The nodes represent pathways and the edges are the amount of overlap between the members of the pathways. The visualized network represents a selected sub-network of complex interconnections and cross-talks between a number of pathways, including MHC-related processes and other biological pathways associated with diabetes phenotypes. The pathways were identified initially using DAVID, with the criteria that they demonstrate enrichment when compared to the genome-wide background. The retrieved pathways were subsequently filtered in Cytoscape through the Enrichment Map plugin using the false-discovery rate and overlap coefficient to filter out non-significant pathways.