Literature DB >> 29331490

Supervised Machine Learning for Population Genetics: A New Paradigm.

Abstract

As population genomic datasets grow in size, researchers are faced with the daunting task of making sense of a flood of information. To keep pace with this explosion of data, computational methodologies for population genetic inference are rapidly being developed to best utilize genomic sequence data. In this review we discuss a new paradigm that has emerged in computational population genomics: that of supervised machine learning (ML). We review the fundamentals of ML, discuss recent applications of supervised ML to population genetics that outperform competing methods, and describe promising future directions in this area. Ultimately, we argue that supervised ML is an important and underutilized tool that has considerable potential for the world of evolutionary genomics.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2018 PMID： 29331490 PMCID： PMC5905713 DOI： 10.1016/j.tig.2017.12.005

Source DB: PubMed Journal: Trends Genet ISSN： 0168-9525 Impact factor: 11.639

Machine Learning for Population Genetics

Population genetics over the past 50 years has been squarely focused on reconciling molecular genetic data with theoretical models that describe patterns of variation produced by a combination of evolutionary forces. This interplay between empiricism and theory means that many advances in the field have come from the introduction of new stochastic population genetic models, often of increasing complexity, that describe how population parameters (e.g., recombination or mutation rates) might generate specific features of genetic polymorphism (e.g., the site frequency spectrum, SFS; see Glossary). The goal, broadly stated, is to formulate a model that describes how nature will produce patterns of variation that we observe. With such a model in hand, all one would need to do would be to estimate its parameters, and in so doing learn everything about the evolution of a given population. Thus an overwhelming majority of population genetics research has focused on classical statistical estimation from a convenient probabilistic model (i.e., the Wright–Fisher model), or through an approximation to that model (i.e., the coalescent). The central assertion here is that the model sufficiently describes the data such that insights into nature can be made through parameter estimation. This mode of analysis that pervades population genetics is what Leo Breiman [1] famously referred to as the ‘data modeling culture’, wherein independent variables (i.e., the evolutionary and genomic parameters) are fed into a model and the response variables (some aspect of genetic variation) come out the other side. Models are validated in this worldview through the use of goodness-of-fit tests or examination of residuals (a recent modern example can be found in [2]). In this review we argue that researchers should consider utilizing a powerful mode of analysis that has recently emerged within population genetics – the ‘algorithmic modeling culture’, or what is now commonly called machine learning (ML). Over the past decade ML methods have revolutionized entire fields, including speech recognition [3], natural language processing [4], image classification [5], and bioinformatics [6-8]. However, the application of ML to problems in population and evolutionary genetics is still in its infancy, except for a few examples [9-18]. ML approaches have several desirable features, and perhaps foremost among them is their potential to be agnostic about the process that creates a given dataset. ML, as a field, aims to optimize the predictive accuracy of an algorithm rather than perform parameter estimation of a probabilistic model. What this means in practice is that ML methods can teach us something about nature, even if our models used to describe nature are imprecise. An equally important advantage of the ML paradigm is that it enables the efficient use of high-dimensional inputs which act as dependent variables, without specific knowledge of the joint probability distribution of these variables. Inputs that consist of thousands of variables (also known as ‘features’ in the ML world) have been used with great success (e.g., [19,20]), and increases in the number of features can often yield greater predictive power [1]. Given the ever-increasing dimensionality of modern genomic data, this is a particularly desirable property of ML. In this paper we describe several examples where, through a hybrid of the ‘data modeling’ and ‘algorithmic modeling’ paradigms, ML methods can leverage high-dimensional data to attain far greater predictive power than competing methods. These early successes demonstrate that ML approaches could have the potential to revolutionize the practice of population genetic data analysis.

An Introduction to Machine Learning

ML is generally divided into two major categories (although hybrid strategies exist): supervised learning [21] and unsupervised learning [22]. Unsupervised learning is concerned with uncovering structure within a dataset without prior knowledge of how the data are organized (e.g., identifying clusters). A familiar example of unsupervised learning is principal component analysis (PCA), which in the context of population genetics is used for discovering unknown relatedness relationships among individuals. PCA takes as input a matrix of genotypes (often of very high dimensionality) and then produces a lower-dimensional summary that can reveal how genotypes cluster. An excellent example of the application of PCA to population genetics can be found in Novembre et al. [23] where PCA was used to show how relationships among individuals sampled from Europe largely mirrored geography. Supervised learning, by contrast, relies on prior knowledge about an example dataset to make predictions about new datapoints. Generally, supervised ML is concerned with predicting the value of a response variable, or label (either a categorical or continuous value), on the basis of the input variables/features. Supervised learning accomplishes this feat through the use of a training set of labeled data examples, whose true response values are known, to train the predictor (Boxes 1 and 2). Perhaps the simplest way to understand supervised ML is graphically. Imagine a scenario in which we wish to train a computer to differentiate between two kinds of fruit, for examples apples and oranges, on the basis of two measurements (x1 and x2) taken from each example (Figure I). In supervised ML we will use known, labeled examples, in other words a ‘training set’ (the filled-in datapoints in Figure I) to learn a function that can discriminate between our data classes. Once we have ‘learned’ this function we can then use our trained oracle to predict class membership of new, unlabeled examples (the unfilled datapoints in Figure I).

Figure I

An Imaginary Training Set of Two Types of Fruit, Oranges (Orange Filled Points) and Apples (Green Filled Points), Where Two Measurements Were Made for Each Fruit

With a training set in hand we can use supervised ML to learn a function that can differentiate between classes (broken line) such that the unknown class of new datapoints (unlabeled points above) can be predicted.

Supervised ML approaches algorithmically create from a given dataset a function that takes as input a vector and then emits a predicted value for each datapoint. More formally, these methods learn a function,f, that predicts a response variable, y, from a feature vector, x, containing M input variables, such that f(x) = y. If y is a categorical variable, we refer to the task as a classification problem, whereas if y is a continuous variable we refer to it as regression. In supervised learning, the objective is to optimize f:x → y using a ‘training set’ of labeled data (i.e., whose response values are known). That is, we assume we have a set of training data of length n of the form {(x1, y1), ..., (x, y)}, where x ∈ R. A variety of learning algorithms exist which can create functions that can perform either classification or regression, including support vector machines (SVMs [80]), decision trees [81], random forests [50], boosting [82], and artificial neural networks (ANNs [83]) which in modern form are subsumed under the umbrella of deep learning [84]. These algorithms differ in how they structure and train f (see brief descriptions in the Glossary). To proceed with building f we must define a loss function, L, that indicates how good or bad a given prediction is. A simple choice for a loss function in the context of classification would be the indicator function such that L(f(x), y) = 1(f (x) ≠ y). For regression one might consider the squared deviation L(f(x), y) = (f(x) – y)2. Finally, we define the risk function which is typically the average value of L across the training set. Training is the process of minimizing this risk function. Once training is complete, we must evaluate our performance on an independent test set. This step allows one to assess whether f has become sensitive to the general characteristics of the problem at hand, rather than to characteristics particular to data examples in the training set (what is known as overfitting). For binary classification we might characterize the false positive and false negative rates or related measures such as precision and recall. A particularly helpful construct in the case of multiclass classification is the confusion matrix, which is simply the contingency table of true versus predicted class labels for each class. For regression, one could use any tool for evaluating model fit (e.g., R) or examine the distribution of values of one or more loss functions. Residuals can also be checked for evidence of bias to anticipate which types of data are likely to produce erroneous predictions. There have been a multitude of important applications of unsupervised ML in evolutionary genomics beyond PCA. One popular methodology that has been wildly successful in population and evolutionary genetics is hidden Markov models (HMMs [24]). HMMs are a class of probabilistic graphical model that are well suited to segmenting data that appears as a linear sequence, such a chromosomes. For instance, with phylogenetic data HMMs have been used to uncover differences in evolutionary rates along a chromosome [25,26]. Furthermore, HMMs have been used to infer how the phylogeny itself changes across chromosomes as a result of recombination [10,27,28]. In the context of population genetic data HMMs have been leveraged to detect regions of the genome under positive or negative selection [11], as well as to localize selective sweeps [12,29]. Although unsupervised ML has been deployed widely and effectively throughout the field, to date less attention has been paid to supervised learning. We give here a brief overview of the paradigm of supervised ML and highlight recent population genetic studies leveraging these approaches.

Why use Machine Learning?

Our basic description of supervised ML approaches in Box 2 demonstrates their central rationale: ML focuses on algorithmically constructed models with optimal prediction as their goal rather than parametric data modeling. Furthermore, ML offers several advantages in addition to accurate prediction. Perhaps most important among them is the ability to circumvent using idealized, parametric models of the data when labeled training data can be obtained from empirical observation (an example of this scenario is given in the following section). Indeed, in such cases we can use ML to train algorithms to recognize phenomena as they are in nature, rather than how we choose to represent them in a model. Further, in cases where empirically derived training sets are not available, simulation can be used to generate training sets. This ability to use simulation as a stand-in for observed data is key for population genetics applications, where adequately sized datasets with high-confidence labels are currently hard to obtain. Of course, using simulation for training obviates the model agnosticism that is attractive about ML in the first place, and thus in using simulation to generate training sets one must be concerned with issues of model mis-specification exactly as when working with traditional, generative models. While that is so, discriminative ML models have been shown to be more robust to model mis-specification than traditional data models [30]. Even when empirical training data cannot feasibly be obtained, there are notable advantages of supervised ML methods. Most importantly, these methods are specifically geared toward using high-dimensional data as the input. Typically, classical statistical methods suffer from what has been called the ‘curse of dimensionality’ whereby high-dimensional data become sparse and thus very difficult to fit models to. By contrast, most supervised ML methods perform better when the input data have a large number of features, in what is commonly called the ‘blessing of dimensionality’ (e.g., [1,31]). A good example of this comes from the highly cited work of Amit and Geman [19] on using a random forest-like procedure for handwriting recognition: it took as input a feature vector containing thousands of variables, and proved to be highly accurate. In a more modern setting, deep learning methods have been shown both theoretically and in practice to be able to circumvent the curse of dimensionality in many settings [32,33]. This attribute lends significant strength to population genetics analysis: while inferences are traditionally based on a single summary statistic devised for the given task (e.g., [34-40]), below we describe several recent studies which demonstrate that far greater statistical power can be achieved by simultaneously examining multiple aspects of genetic variation across the genome. Importantly, many ML methods offer direct ways to assess which features of the input are driving inferences, information which can yield insights about the underlying processes [1]. The last benefit we wish to touch upon is computational efficiency. While training of supervised ML algorithms is computationally costly – especially if simulation is used for the training set – once an algorithm is trained, prediction from it is exceedingly fast even in situations where a large number of predictions is required (e.g., genome-wide scans). This means that there will be an upfront cost to training (typically hours or days), but genome-wide inference proceeds rapidly thereafter. Moreover, because many ML approaches (e.g., deep learning) have the ability to generalize beyond their input parameters (e.g., [41]), training sets can be considerably smaller than those used by approaches such as approximate Bayesian computation (ABC [42]).

Supervised ML in Population Genetics by Training on Real Data: Finding Purifying Selection

When empirically derived training data are available, supervised ML can be used to make accurate predictions in datasets that cannot be adequately modeled with a reasonable number of parameters. For instance, a current goal in modern genomics is to be able to predict functional regions of the genome using bioinformatics techniques. While there are numerous sources of information to leverage for this problem, including comparative [26] and functional genomics [43], the best manner in which to incorporate population genomic variation to aid in these predictions is a matter of active research. Toward this end a supervised ML approach was recently used to discriminate between genomic regions experiencing purifying selection and those free from selective constraint on the basis of population genomic data alone [16]. In this study a support vector machine (SVM) was used that employed as its input the SFS from all 1092 individuals from the Phase I release of 1000 Genomes Dataset which consisted of 14 population samples from diverse global locations [44]. Had this been done using all these data simultaneously in a ‘classical’ population genetics setting the researchers would have been forced to fit a demographic model that described the joint divergence and population size changes of all 14 population samples, a daunting task indeed. While the SFS is well known to be affected by demography as well as by selection [45], by constructing a training set of regions experiencing purifying selection (inferred from a phylogenetic comparison of non-human mammals) the intractable problem of modeling the joint demographic history of the dataset was able to effectively be sidestepped. An SVM was thus able to be trained and tested using empirical data, achieving ~88% accuracy [16]. By comparing the predictions from this classifier, which reveal purifying selection occurring in recent evolutionary history, with phylogenetic signatures of more ancient selection, regions showing evidence of functional turnover in the human genome were able to be identified. These candidate regions were found to be highly enriched in the regulatory domains of genes important for proper central nervous system development. Moreover, another study [46] recently found that the presence of these candidate regions near a gene was more predictive of human-specific changes of expression in the brain than was the presence of well-known human-accelerated regions identified from interspecific comparisons [47]. This result lends credence both to our own predictions and more generally to the utility of supervised ML approaches in evolutionary genetics.

Finding Selective Sweeps in the Genome

One population genetic question that has received recent attention using ML approaches is that of detecting selective sweeps: the signature left by an adaptive mutation that rapidly increases in allele frequency until reaching fixation [48]. While the classical population genetic strategy for finding sweeps has been to carefully devise test statistics sensitive to selective perturbations [34-40], in recent years several groups have begun leveraging combinations of statistics through supervised ML to improve inferential power. While each of these methods differ in the exact combination of summary statistics used, their unifying feature is that training sets are generated using coalescent simulations with and without selective sweeps. The first of these studies [13] used a SVM to combine the ω statistic of Kim and Nielsen (which measures the spatial pattern of LD expected around a sweep [38]) with composite-likelihood ratio of Nielsen et al. (also known as CLR, which highlights the spatial skew in the SFS expected around a sweep [49]). They found that these two statistics in concert had greater power to detect sweeps. Another study [15] took the approach of encoding the SFS as the feature vector (i.e., each bin in the SFS is one feature), and then used an SVM to discriminate between selective sweeps and neutrality. Others [9] have used boosting to identify sweeps on the basis of a feature vector containing six different summary statistics each measured across several genomic subwindows surrounding the focal window. In a related effort, a series of boosting classifiers were recently used to detect selective sweeps and classify them according to whether they have reached fixation (complete vs incomplete) as well as by their timing (recent vs ancient) [14]. Finally, S/HIC (soft/hard inference through classification), which uses a variant of a random forest [50] called an extra-trees classifier [51] to detect both classic hard sweeps from de novo mutations and soft sweeps resulting from selection on previously segregating variants [52,53], was recently reported [17]. As described in Box 3, S/HIC is able to detect sweeps with high sensitivity and specificity even in the face of non-equilibrium demography which confounds many other methods. The success of S/HIC and the other efforts listed above demonstrates that an appropriately designed ML approach can make rapid advances in performance on difficult problems that have received attention for decades. S/HIC [17] uses a feature vector designed to be not only sensitive to hard and soft sweeps but also robust to the confounding effects of both linked positive selection (i.e., ‘soft shoulders’ [85]) and non-equilibrium demography [45,86]. This feature vector included values of nine different statistics that were each measured in several adjacent subwindows (Figure III) in a similar vein to the evolBoosting of Lin et al. [9]. What set this feature vector apart is that, for each statistic, the value in each subwindow was normalized by dividing by the sum across all subwindows. Thus, the true value of a given statistic in a given subwindow is ignored, while the relative values across the larger window are examined. The reasoning behind this choice is that, although demographic events may affect values of population genetic summaries genome-wide (which S/HIC ignores), selective sweeps may result in more dramatic localized skews in these statistics (which S/HIC captures). The results of this design are impressive: S/HIC is able to detect sweeps under challenging demographic scenarios, often with no loss in power even when the demographic history is grossly mis-specified during training (e.g., if there is an unknown population bottleneck), a scenario which catastrophically compromises many other methods [17,87]. Thus, ML methods – especially those with appropriately designed feature vectors – can be robust to modeling choices even when training data are simulated.

Figure III

A Visualization of S/HIC Feature Vector and Classes

The S/HIC feature vector consists of π [77], [74], [34], the number (#) of distinct haplotypes, average haplotype homozygosity, H12 and H2/H1 [78,79], Z [37], and the maximum value of ω [48]. The expected values of these statistics are shown for genomic regions containing hard and soft sweeps (as estimated from simulated data). Fay and Wu’s H [34] and Tajima’s D [39] are also shown, though these may be omitted from the vector because they are redundant with π, , and . To classify a given region the spatial patterns of these statistics are examined across a genomic window to infer whether the center of the window contains a hard selective sweep (blue shaded area on the left, using statistics calculated within the larger blue window), is linked to a hard sweep (purple shaded area and larger window, left), contains a soft sweep (red, on the right), is linked to soft sweep (orange, right), or is evolving neutrally (not shown).

In Figure III we illustrate the S/HIC classification strategy and the values included in its feature vector. This figure demonstrates how much additional information S/HIC utilizes in making its predictions in comparison to more traditional population genetic tests, especially those relying on a single statistic. In particular, the S/HIC feature vector not only includes multiple statistics, each of which is designed to capture different aspects of genealogies, but also how these statistics vary along the chromosome. In addition to greater robustness to demography as discussed above, incorporating all of this information yields greater discriminatory power, and for this reason such multidimensional methods will be preferable to univariate approaches. We recently applied S/HIC to six human populations with complex demographic histories, where it revealed that soft sweeps appear to account for the majority of recent adaptive events in humans [88]; the success of this analysis demonstrates the practicality of applying such ML strategies to real data. The methods listed above have two commonalities: they use ML to perform classification on multidimensional input, and they handily outperform more traditional univariate methods. However, these methods also differ from one another substantially in several facets: the particular ML framework used, the makeup of the feature vector, and the types of sweeps they seek to detect. Thus, the success of these methods underscores not only the power but also remarkable flexibility of supervised ML. By working within the supervised ML paradigm one can effectively tailor a predictor to whatever task is at hand simply by altering the construction of the feature vector and training dataset, and in so doing make more detailed predictions than is possible using a single statistic. Unlike the problem of detecting purifying selection, for which a training set may be constructed, we lack an adequate number of selective sweeps whose parameters are known precisely (e.g., the time of the sweep, strength of selection). Thus, the studies described above used simulation to generate training sets. The general idea is to simulate data from one or several population genetic models in which parameters are either specified precisely or defined by prior distributions, use those data to train an ML algorithm, and then perform either classification or regression (i.e., parameter estimation). In this context supervised ML allows for likelihood-free inference of population genetic models similar in spirit to ABC. Although, like ABC, this approach requires modeling assumptions, it nonetheless offers numerous advantages as described in Box 4 where we contrast ABC with supervised ML. Using supervised ML with training data simulated from a specified set of population genetic models is similar in spirit to approximate Bayesian computation (ABC), except for some notable distinctions. ABC begins by simulating a large number of examples whose model parameters are drawn from prior distributions, and then summarizes these simulations with vectors of population genetic summary statistics. Next, in ‘classical’ ABC, only those simulations most similar to the observed dataset are retained – a process known as rejection sampling – to approximate the probability distribution for each parameter value given the observed data. ABC is easy to implement, flexible, and has been proven effective in several scenarios. However, ABC has some important drawbacks that ML overcomes. Most importantly, when using large feature vectors, ABC is susceptible to the curse of dimensionality [59] – much effort has therefore gone into dimensionality reduction and feature selection for ABC (reviewed in [89]). While this is so, reducing dimensionality might lead to loss of information if the remaining summaries are not sufficient statistics of the data. This contrasts with modern ML algorithms which can benefit from high-dimensional data rather that suffer from them. A second drawback of classical ABC is its computational burden. Although both ML and ABC require a large number of simulations, ABC does not make efficient use of all of this computation because it typically depends on rejection sampling. Work has been done to retain more of the simulations in ABC, for instance by weighing their influence on parameter estimation according to their similarity to the observed data [62]. However, ML methods naturally use all of the simulations to learn the mapping of data to parameters. Further, deep learning methods have the potential to generalize non-locally [32], allowing them to make accurate predictions for data very different from those in the training set. For these reasons, ML may require considerably fewer simulations than ABC. Furthermore, ML methods need not re-examine these simulations to perform downstream prediction, unlike ABC, and thus further inference is very fast. A third difference between ML and ABC is that of interpretability. In the realm of ABC it is not clear which summaries are responsible for a signal. By contrast, many ML methods allow direct measurement of the contribution of each feature. Thus, despite their use of algorithmically generated models, ML algorithms are far from black boxes. Finally, it is important to note that newer versions of ABC, that do not depend on rejection sampling, are often simply examples supervised ML approaches at their core [62,63,90], thus to some large degree the dichotomy pointed to above is destined to become moot.

Inferring Demography and Recombination

Another emerging use of supervised ML in population genetics has been for inference of demographic history and recombination rates. Indeed, much attention in the field has been placed on developing methods for the inference of population size histories and patterns of population splitting and migration [54-58]. ABC methods are among the most popular for inferring demographic histories [59]. Interestingly, several groups have experimented with augmenting ABC by using ML for selecting the optimal combination of summary statistics [60] or even generating them [61]. While this is a promising direction for feature engineering, others have directly used ML to estimate posterior distributions of demographic parameters. For instance, Blum and François [62] used a feed-forward artificial neural network (ANN) to learn the mapping of summary statistics onto parameters with excellent results, particularly with respect to computational cost savings. In addition to demographic parameter estimation, supervised ML has been used recently for demographic model selection (a possibility pointed to by Blum and François). For instance, it was recently shown [63] that random forests outperform ABC in both accuracy and computational cost when performing demographic model selection, together with greater robustness to the choice of summary statistics included in the input vector. In a recent preprint [64], Extra-Trees classifiers were applied to a problem of locus-specific demographic model selection: that of identifying regions with gene flow between a pair of closely related species with far greater accuracy than previous methods. Thus in general, ML methods show great promise in demographic estimation and model selection, and may soon be the preferred choice over ABC. Supervised ML has also been applied to characterize the rates and patterns of recombination in the genome. This work has again been done with or without simulation of training data. For instance, a random forest classifier was trained to distinguish among recombination rate classes on the basis of sequence motifs to show that such motifs are predictive of recombination rate in Drosophila melanogaster [65]. This work used annotated rates of recombination based on a classical population genetics estimator to define the training set. By contrast, methodology has been developed [66,67] that uses boosting to infer recombination rate maps from large sample sizes on the basis of simulated training data. The latest method, FastEPRR (fast estimation of population recombination rates), has much greater computational efficiency than, and equal accuracy to, the widely used LDhat [68]. Although application of supervised ML methods to this problem has begun only recently, the success of FastEPRR suggests the potential of future gains using these approaches.

Coestimation of Selection and Demography

It is well known that demographic events can mimic the effects of selection [45], and conversely that selection can confound demographic estimation [69,70]. This implies that, although one can attempt to design more robust approaches (e.g., S/HIC, discussed above), the ideal strategy would be to simultaneously make inferences about both of these phenomena. How then can one perform coestimation of parameters related to multiple evolutionary phenomena? A promising approach that utilizes supervised ML, in this case deep learning, was recently introduced by [18]. Here a deep neural network, called evoNet, was developed to simultaneously infer population size changes in a three-epoch model and detect hard and soft selective sweeps as well as regions under balancing selection. What makes this research particularly important is that using this method the researchers were able to perform simultaneous classification of loci into selective classes and demographic parameter estimation (based on averages estimated over loci classified as neutral) through the use of a neural network architecture that outputs both categorical and continuous parameters. This inherent flexibility of ML, and deep learning architectures in particular, opens up a whole slew of opportunities for doing population genomic inference in ways that have never before been possible (discussed below).

Concluding Remarks and Future Directions

The future of population genomic analysis rests in our ability to make sense of large and ever-growing datasets. Toward this end, supervised ML techniques represent a new paradigm for analysis, one uniquely suited for making inferences in the context of high-dimensional data produced by an unknown or imprecisely parameterized model. We have reviewed here a selection of early applications of supervised ML tools to population genomic data. The overwhelming take-home message is that supervised ML provides robust, computationally efficient inference for several problems that are difficult to gain traction on via classical statistical approaches. We believe that population genetics is now poised for an explosion in the use of supervised ML approaches. Deep learning in particular, with its incredibly flexible input and output structure, should be an important area of future research, and its earliest application [18] has yielded the crucial ability to coestimate selection and demography, a central goal of population genetics analysis over the past 15 years. Indeed, deep learning could potentially alter the way that we even think about the nature of our input data itself. For example, one flavor of deep learning, convolutional neural networks (CNNs), have made astounding advances in our ability to learn parameters from image data [71]. Rather than learning on population genetic summary statistics calculated from a multiple sequence alignment (e.g., [9,17]), one could instead treat an image of the alignment itself as the input. While these data would be extremely high-dimensional, the structure of CNNs allows them to implicitly perform dimensionality reduction while capturing salient structures in the input data [72], allowing accurate and efficient classification and regression (additional possible future avenues of ML in population genetics are listed in the Outstanding Questions). While these are exciting prospects, a general challenge lies ahead in making more structured population genetics inferences beyond simple parameter estimation or classification. For instance, it is not clear to what extent the supervised ML techniques discussed above could be used to infer genealogies or other tree-like structures (but see [73]). In general, however, the current explosion in deep learning research promises future improvements in our ability to make evolutionary inferences well beyond current capabilities; the challenge for population geneticists then is to adapt such methods for our own uses. While a few comparisons have shown that ML can outperform ABC, a more thorough assessment of the strengths and limitations of each approach across a variety of problems (e.g., on simulated data) is warranted. In what scenarios would either strategy be preferable? Like more traditional methods, ML applications relying on simulated training data must make modeling assumptions. To what extent can ML methods be made more robust to these assumptions (e.g., by appropriately designing the feature vector, as done by S/HIC, or through simulating a greater breadth of training examples)? ML methods have the ability to infer the values of multiple parameters simultaneously. How feasible will parameter estimation be in more complex evolutionary models using ML tools such as deep neural networks? As described here, supervised ML relies on summaries of population genetic data as feature vectors, but what summaries are best, and can we do better than standard population genetic statistics? The recent rise of convolutional neural networks for image recognition suggests that encoding alignments as images might enable more powerful population genetics inferences – how best can we encode population genetic data? Can we use ML to infer structured output in population genetics such as genealogies or ancestral recombination graphs? A type of ANN called generative adversarial networks has been shown to generate data examples that can mimic true data with increasing accuracy. Can such methods be used as a substitute for population genetic simulation, perhaps to generate very large samples and chromosomes that are computationally costly to simulate? Applications of supervised ML to population genetic data can be relatively involved, necessitating simulating data, encoding both simulated and real data as feature vectors, training the algorithm, and applying it. Can efforts to create self-contained, efficient, and user-friendly software packages capable of performing this entire workflow streamline this approach and make it more accessible to researchers? While point estimation of population genetic model parameters is important, equally important is establishing credible intervals on our parameter estimates. How can we most effectively use ML for estimating intervals associated with parameter estimates?

60 in total

1. On the number of segregating sites in genetical models without recombination.

Authors: G A Watterson
Journal: Theor Popul Biol Date: 1975-04 Impact factor: 1.570

2. Searching for footprints of positive selection in whole-genome SNP data from nonequilibrium populations.

Authors: Pavlos Pavlidis; Jeffrey D Jensen; Wolfgang Stephan
Journal: Genetics Date: 2010-04-20 Impact factor: 4.562

3. Soft shoulders ahead: spurious signatures of soft and partial selective sweeps result from linked hard sweeps.

Authors: Daniel R Schrider; Fábio K Mendes; Matthew W Hahn; Andrew D Kern
Journal: Genetics Date: 2015-02-25 Impact factor: 4.562

4. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism.

Authors: F Tajima
Journal: Genetics Date: 1989-11 Impact factor: 4.562

5. The hitch-hiking effect of a favourable gene.

Authors: J M Smith; J Haigh
Journal: Genet Res Date: 1974-02 Impact factor: 1.588

6. Statistical tests of neutrality of mutations.

Authors: Y X Fu; W H Li
Journal: Genetics Date: 1993-03 Impact factor: 4.562

7. Inference of human population history from individual whole-genome sequences.

Authors: Heng Li; Richard Durbin
Journal: Nature Date: 2011-07-13 Impact factor: 49.962

8. Forces shaping the fastest evolving regions in the human genome.

Authors: Katherine S Pollard; Sofie R Salama; Bryan King; Andrew D Kern; Tim Dreszer; Sol Katzman; Adam Siepel; Jakob S Pedersen; Gill Bejerano; Robert Baertsch; Kate R Rosenbloom; Jim Kent; David Haussler
Journal: PLoS Genet Date: 2006-08-23 Impact factor: 5.917

9. Soft Sweeps Are the Dominant Mode of Adaptation in the Human Genome.

Authors: Daniel R Schrider; Andrew D Kern
Journal: Mol Biol Evol Date: 2017-08-01 Impact factor: 16.240

10. An integrated map of genetic variation from 1,092 human genomes.

Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal: Nature Date: 2012-11-01 Impact factor: 49.962

87 in total

1. Comparative evolutionary genetics of deleterious load in sorghum and maize.

Authors: Roberto Lozano; Elodie Gazave; Jhonathan P R Dos Santos; Markus G Stetter; Ravi Valluru; Nonoy Bandillo; Samuel B Fernandes; Patrick J Brown; Nadia Shakoor; Todd C Mockler; Elizabeth A Cooper; M Taylor Perkins; Edward S Buckler; Jeffrey Ross-Ibarra; Michael A Gore
Journal: Nat Plants Date: 2021-01-15 Impact factor: 15.793

Review 2. Methods for detecting introgressed archaic sequences.

Authors: Sriram Sankararaman
Journal: Curr Opin Genet Dev Date: 2020-07-24 Impact factor: 5.578

3. Population inference based on mitochondrial DNA control region data by the nearest neighbors algorithm.

Authors: Fu-Chi Yang; Bill Tseng; Chun-Yen Lin; Yu-Jen Yu; Adrian Linacre; James Chun-I Lee
Journal: Int J Legal Med Date: 2021-02-14 Impact factor: 2.686

Review 4. The importance of genomic variation for biodiversity, ecosystems and people.

Authors: Madlen Stange; Rowan D H Barrett; Andrew P Hendry
Journal: Nat Rev Genet Date: 2020-10-16 Impact factor: 53.242

5. Genomic islands of differentiation in a rapid avian radiation have been driven by recent selective sweeps.

Authors: Hussein A Hejase; Ayelet Salman-Minkov; Leonardo Campagna; Melissa J Hubisz; Irby J Lovette; Ilan Gronau; Adam Siepel
Journal: Proc Natl Acad Sci U S A Date: 2020-11-16 Impact factor: 11.205

Review 6. Host-parasite co-evolution and its genomic signature.

Authors: Dieter Ebert; Peter D Fields
Journal: Nat Rev Genet Date: 2020-08-28 Impact factor: 53.242

7. Accurate Inference of Tree Topologies from Multiple Sequence Alignments Using Deep Learning.

Authors: Anton Suvorov; Joshua Hochuli; Daniel R Schrider
Journal: Syst Biol Date: 2020-03-01 Impact factor: 15.683

8. A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices.

Authors: James A Watson; Aimee R Taylor; Elizabeth A Ashley; Arjen Dondorp; Caroline O Buckee; Nicholas J White; Chris C Holmes
Journal: PLoS Genet Date: 2020-10-09 Impact factor: 5.917

9. ProkEvo: an automated, reproducible, and scalable framework for high-throughput bacterial population genomics analyses.

Authors: Natasha Pavlovikj; Joao Carlos Gomes-Neto; Jitender S Deogun; Andrew K Benson
Journal: PeerJ Date: 2021-05-21 Impact factor: 2.984

10. A demonstration of unsupervised machine learning in species delimitation.

Authors: Shahan Derkarabetian; Stephanie Castillo; Peter K Koo; Sergey Ovchinnikov; Marshal Hedin
Journal: Mol Phylogenet Evol Date: 2019-07-16 Impact factor: 4.286