Literature DB >> 34664624

BalLeRMix+: Mixture model approaches for robust joint identification of both positive selection and long-term balancing selection.

Abstract

SUMMARY: The growing availability of genomewide polymorphism data has fueled interest in detecting diverse selective processes affecting population diversity. However, no model-based approaches exist to jointly detect and distinguish the two complementary processes of balancing and positive selection. We extend the BalLeRMix B-statistic framework described in Cheng and DeGiorgio (2020) for detecting balancing selection and present BalLeRMix+, which implements five B statistic extensions based on mixture models to robustly identify both types of selection. BalLeRMix+ is implemented in Python and computes the composite likelihood ratios and associated model parameters for each genomic test position. AVAILABILITY: BalLeRMix+ is freely available at https://github.com/bioXiaoheng/BallerMixPlus. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Year: 2021 PMID： 34664624 PMCID： PMC8756184 DOI： 10.1093/bioinformatics/btab720

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Footprints of natural selection provide valuable insights into the evolutionary history of populations. As a result, they have been key features that evolutionary biologists probe for within sequenced genomes. Positive selection increases the prevalence of beneficial genetic variation and can reduce genetic diversity in regions nearby the selected loci, and is one of the most examined modes of natural selection (Booker , offers a good review). Meanwhile, balancing selection maintains polymorphisms at selected loci and sharply increases genetic diversity in regions adjacent to the selected loci. The deluge of polymorphism data available from contemporary sequencing technologies has fueled interest in both method development (e.g. Bitarello ; Cheng and DeGiorgio, 2019, 2020; DeGiorgio ; Isildak ; Sheehan and Song, 2016; Siewert and Voight, 2017, 2020) and empirical data analysis (e.g. Croze ; Leffler ; Teixeira ) on balancing selection. However, despite these methodological advancements, few model-based methods exist to jointly detect and distinguish positive and balancing selection from genomic data. Most approaches suited to this task, such as the summary statistics Tajima’s D (Tajima, 1989) and the HKA test (Hudson ), as well as the anomaly detection approach of Tsel (Hunter-Zinck and Clark, 2015), identify genomic regions displaying patterns of variation unexpected under neutrality. Though Tsel showcases improved power and robustness compared with previous summary statistics, it cannot indicate the nature of selection, and none of these statistics provide direct details about selected footprint features at outlier regions, such as balanced polymorphism frequency, width of the footprint and magnitude of distortion of the distribution of allele frequencies in support of either positive or balancing selection. Instead, alternative contemporary machine learning strategies for distinguishing between balancing selection and positive selection have been developed and proven to be powerful (Isildak ; Sheehan and Song, 2016), yet these methods often rely on accurate estimates of key population parameters such as demographic histories, extensive training datasets and substantial computational resources to deploy. Hence, it is desirable to have a computationally efficient model-based approach that makes minimal assumptions and that has power to discriminate both positive and balancing selection from neutrality, as well as the ability to classify the mode of selection at genomic regions strongly deviating from neutrality. Initially aiming at accommodating the variability of footprint sizes of long-term balancing selection, Cheng and DeGiorgio (2020) described a flexible mixture model framework, collectively termed B statistics, that we extend here to consider positive selection as well. The B statistics assume the number of balanced alleles follows a binomial distribution with n trials (sample size) and success rate x (equilibrium frequency). Given n and x, the mean and variance of observed allele counts are fixed. However, many factors not accounted for by the binomial model can inflate the variance, such as accumulation of mutations and uncertainty or oscillation in equilibrium frequency (e.g. Bergland ). We extend these B statistics to adopt a beta-binomial distribution instead to approximate the allele count probability distribution. By incorporating the overdispersion parameter of the beta-binomial into the models, the B statistics not only fit the observed data under long-term balancing selection (where a > 1, suggesting an enrichment of sites with intermediate frequencies; example in Fig. 1A), but can also fit data generated by selective sweeps (when a < 1, suggesting a depletion of sites with intermediate frequencies; example in Fig. 1B). Therefore, extending the mixture models in this way broadens the applicability of the Cheng and DeGiorgio (2020) B statistics to jointly detect and distinguish (depending on value of a) balancing and positive selection using a unifying model based on the same set of assumptions.

Fig. 1.

(A, B) Extended B2 score along the simulated sequences undergoing (A) long-term balancing selection and (B) recent positive selection at center of sequence. Line color reflects sign and magnitude of the estimated dispersion parameter , and the color bar is common to both panels A and B. Positive values of suggest more support for balancing selection, whereas negative values suggest greater support for positive selection. The line colors plotted in panels A (balancing selection) and B (positive selection) are consistent with expectations based on the sign of . (C, D) Receiver operating characteristic curves of the original (dashed lines) and extended (solid lines) B statistics for identifying sequences under (C) balancing selection or (D) positive selection

2 Implementation

BalLeRMix+ is written in Python and employs basic packages as well as numpy and scipy.special (Harris ), and is currently compatible with Python3.6 and above. Its primary input is a plain text file listing the physical and genetic positions, allele counts and sample sizes of each informative site along a chromosome. For convenience, we include an auxiliary script in the software repository to help parse common standard formatted files, i.e. VCF, AXT and recombination maps, into formats fit for BalLeRMix+. In addition to the input file, users are expected to provide a helper file that is either the site frequency spectrum (B2 and statistics) or the genomewide polymorphism to substitution ratios among all informative sites (B1 statistic). Both helper files can be generated by BalLeRMix+ from the concatenated genomewide allele count input files across chromosomes. Users will specify which B statistics (B1, B2 or ) to perform a scan with using –nofreq or –MAF arguments. We do not recommend users to apply the B0 or statistics to detect positive selection based on their performance on simulated data (see Supplementary Notes). We ran BalLeRMix+ on a single core Intel i5-6300U CPU (2.4 GHz) with 8 GB of RAM to compute the B2 statistic across a simulated 100 kilobase sequence with 757 informative sites, and the software ran in ∼17 min. During a genomic scan, BalLeRMix+ computes the composite likelihood ratio of selection versus neutrality for each informative site across the parameter space and outputs the maximum composite likelihood ratio, optimal equilibrium frequency , optimal dispersion parameter , optimal linkage parameter (related to width of selection signal), as well as the number of informative sites included in the computation. The sign of can be indicative of the mode of selection, as exemplified in Figure 1A and B.

3 Performance evaluation

To evaluate the performance of BalLeRMix+ compared to BalLeRMix to detect balancing selection, we simulated sequences under both neutrality and long-term balancing selection using SLiM3.3 (Haller and Messer, 2019) following the protocol of Cheng and DeGiorgio (2020). Both the original and extended B statistics show comparable power (Fig. 1C), confirming that BalLeRMix+ can powerfully detect long-term balancing selection. For positive selection, we simulated sequences evolving along the inferred demographic history of Europeans (see Supplementary Note), and introduced a de novo mutation with per-generation selective advantage of s = 0.01 at 104 or 2500 generations before sampling. Unlike the original B statistics that show little to no power, the extended B1, B2 and statistics of BalLeRMix+ exhibit high power to identify the selective sweep (Fig. 1D). Moreover, Figure 1A and B displays peaks of high-magnitude B2 statistic at the centers of the simulated sequences, showing that the signal of both balancing and positive selection can be localized. We also simulated scenarios of partial selective sweeps, sweeps on standing variation, adaptive introgression and recent balancing selection (see Supplementary Note), and our results confirm that BalLeRMix+ can powerfully and robustly identify and distinguish diverse modes of selection. Given the overall power and robustness of BalLeRMix+, we believe that it represents a comprehensive suite of statistics and will be a welcome addition to the evolutionary biologists’ toolbox. Click here for additional data file.

18 in total

1. A genome-wide scan for genes under balancing selection in Drosophila melanogaster.

Authors: Myriam Croze; Andreas Wollstein; Vedran Božičević; Daniel Živković; Wolfgang Stephan; Stephan Hutter
Journal: BMC Evol Biol Date: 2017-01-13 Impact factor: 3.260

2. A test of neutral molecular evolution based on nucleotide data.

Authors: R R Hudson; M Kreitman; M Aguadé
Journal: Genetics Date: 1987-05 Impact factor: 4.562

3. Distinguishing between recent balancing selection and incomplete sweep using deep neural networks.

Authors: Ulas Isildak; Alessandro Stella; Matteo Fumagalli
Journal: Mol Ecol Resour Date: 2021-03-22 Impact factor: 7.090

4. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism.

Authors: F Tajima
Journal: Genetics Date: 1989-11 Impact factor: 4.562

5. Long-Term Balancing Selection in LAD1 Maintains a Missense Trans-Species Polymorphism in Humans, Chimpanzees, and Bonobos.

Authors: João C Teixeira; Cesare de Filippo; Antje Weihmann; Juan R Meneu; Fernando Racimo; Michael Dannemann; Birgit Nickel; Anne Fischer; Michel Halbwax; Claudine Andre; Rebeca Atencia; Matthias Meyer; Genís Parra; Svante Pääbo; Aida M Andrés
Journal: Mol Biol Evol Date: 2015-01-19 Impact factor: 16.240

6. A model-based approach for identifying signatures of ancient balancing selection in genetic data.

Authors: Michael DeGiorgio; Kirk E Lohmueller; Rasmus Nielsen
Journal: PLoS Genet Date: 2014-08-21 Impact factor: 5.917

7. Deep Learning for Population Genetic Inference.

Authors: Sara Sheehan; Yun S Song
Journal: PLoS Comput Biol Date: 2016-03-28 Impact factor: 4.475

Review 8. Detecting positive selection in the genome.

Authors: Tom R Booker; Benjamin C Jackson; Peter D Keightley
Journal: BMC Biol Date: 2017-10-30 Impact factor: 7.431

Review 9. Array programming with NumPy.

Authors: Charles R Harris; K Jarrod Millman; Stéfan J van der Walt; Ralf Gommers; Pauli Virtanen; David Cournapeau; Eric Wieser; Julian Taylor; Sebastian Berg; Nathaniel J Smith; Robert Kern; Matti Picus; Stephan Hoyer; Marten H van Kerkwijk; Matthew Brett; Allan Haldane; Jaime Fernández Del Río; Mark Wiebe; Pearu Peterson; Pierre Gérard-Marchant; Kevin Sheppard; Tyler Reddy; Warren Weckesser; Hameer Abbasi; Christoph Gohlke; Travis E Oliphant
Journal: Nature Date: 2020-09-16 Impact factor: 49.962

10. Detecting Long-Term Balancing Selection Using Allele Frequency Correlation.

Authors: Katherine M Siewert; Benjamin F Voight
Journal: Mol Biol Evol Date: 2017-11-01 Impact factor: 16.240