| Literature DB >> 29331490 |
Daniel R Schrider1, Andrew D Kern2.
Abstract
As population genomic datasets grow in size, researchers are faced with the daunting task of making sense of a flood of information. To keep pace with this explosion of data, computational methodologies for population genetic inference are rapidly being developed to best utilize genomic sequence data. In this review we discuss a new paradigm that has emerged in computational population genomics: that of supervised machine learning (ML). We review the fundamentals of ML, discuss recent applications of supervised ML to population genetics that outperform competing methods, and describe promising future directions in this area. Ultimately, we argue that supervised ML is an important and underutilized tool that has considerable potential for the world of evolutionary genomics.Entities:
Mesh:
Year: 2018 PMID: 29331490 PMCID: PMC5905713 DOI: 10.1016/j.tig.2017.12.005
Source DB: PubMed Journal: Trends Genet ISSN: 0168-9525 Impact factor: 11.639
Figure IAn Imaginary Training Set of Two Types of Fruit, Oranges (Orange Filled Points) and Apples (Green Filled Points), Where Two Measurements Were Made for Each Fruit
With a training set in hand we can use supervised ML to learn a function that can differentiate between classes (broken line) such that the unknown class of new datapoints (unlabeled points above) can be predicted.
Figure IIIA Visualization of S/HIC Feature Vector and Classes
The S/HIC feature vector consists of π [77], [74], [34], the number (#) of distinct haplotypes, average haplotype homozygosity, H12 and H2/H1 [78,79], Z [37], and the maximum value of ω [48]. The expected values of these statistics are shown for genomic regions containing hard and soft sweeps (as estimated from simulated data). Fay and Wu’s H [34] and Tajima’s D [39] are also shown, though these may be omitted from the vector because they are redundant with π, , and . To classify a given region the spatial patterns of these statistics are examined across a genomic window to infer whether the center of the window contains a hard selective sweep (blue shaded area on the left, using statistics calculated within the larger blue window), is linked to a hard sweep (purple shaded area and larger window, left), contains a soft sweep (red, on the right), is linked to soft sweep (orange, right), or is evolving neutrally (not shown).