Literature DB >> 35080599

Struct-f4: a Rcpp package for ancestry profile and population structure inference from F4 statistics.

Pablo Librado1, Ludovic Orlando1.   

Abstract

SUMMARY: Visualisation and inference of population structure is increasingly important for fundamental and applied research. Here, we present Struct-f4, providing automated solutions to characterize and summarize the genetic ancestry profile of individuals, assess their genetic affinities, identify admixture sources and quantify admixture levels. AVAILABILITY AND IMPLEMENTARION: Struct-f4 is written in Rcpp and relies on f4-statistics and MCMC optimization. It is freely available under GNU General Public License in Bitbucket (https://bitbucket.org/plibradosanz/structf4/). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2022. Published by Oxford University Press.

Entities:  

Year:  2022        PMID: 35080599      PMCID: PMC8963280          DOI: 10.1093/bioinformatics/btac046

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Next-generation sequencing has opened for the routine characterization of genome variation at the population scale, including in non-model organisms, which provides invaluable insights into evolution (Nielsen ). Enhanced characterization of population structure has also found a range of applications in medicine, forensics, conservation biology and more. Many statistical methods are available to visualize (e.g. Principle Component Analysis; Patterson ) and model population structure, e.g. as combinations of K ancestry clusters (e.g. ADMIXTURE; Alexander ). However, these methods can be biased by the amount of genetic drift exclusive to single populations (Lawson ). Other methods based on shared drift aimed to overcome such limitations and increasingly gained popularity in the last decade (Patterson ). For example, qpGraph and qpAdm leverage patterns of allele sharing between population quartets (the so-called f4-statistics) to model evolutionary histories, including admixture coefficients. This methodology is, however, highly supervised through the specification of homogeneous groups, potentially acting as admixture sources, and requires to assess alternative models individually. This becomes practically challenging as the number of populations and/or admixture events increase, rapidly exceeding the current capacity of automated solutions (Leppälä ). To remediate this situation, we developed Struct-f4, a package leveraging the power of f4-statistics and automating statistical inference within an MCMC framework. Struct-f4 first estimates the shared drift across pairs of individuals, allowing visualization of population structure through Multi-Dimensional Scaling (MDS). It also models individual genetic profiles as mixtures from K ancestral populations, not assumed to follow Hardy–Weinberg equilibrium, and accommodates both supervised and unsupervised analyses.

2 Materials and methods

Struct-f4 was originally proposed by Fages to visualize the genetic structure within ancient and modern horse populations. The methodology involved maximum likelihood optimization to place individuals within the 3D-Euclidean space that best fits the observed combination of f4-statistics. Here, we redesigned the underlying statistical model to retrieve direct estimates of the shift in allele frequency that occurred between pairs of individuals, as follows: where d12 the difference in allele frequency between individuals H1 and H2. Assuming f4-statistics follow normal distributions, the likelihood of the d parameters can be calculated, allowing for their optimization within an adaptive Metropolis-Hastings MCMC framework (Supplementary Information). This approach can be generalized to model individual profiles as mixtures of K ancestral populations, where K is user-defined and d now represents the allele frequency shift that occurred between the ancestral components i and j, while Q1 the proportion of the i ancestry inherited by individual H1 (i.e. its mixture coefficient). Struct-f4 is implemented in Rcpp for computational efficiency and requires a matrix of f4-statistics as input. We also provide the Calc-f4 C program, which was parallelized and optimized for fast computation of f4-statistics. This reduces the running time to calculate 82 215 f4-statistics on a single 2700 MHz core to 48ʹ49″, versus 1011ʹ11″ for qpDstat. Struct-f4 outputs posterior mean values and credible intervals for each estimated parameter, together with the full MCMC sample and the corresponding probability used to assess convergence. It also provides (i) an MDS plot of genetic affinities between individuals (Supplementary Fig. S1), (ii) an unsupervised clustering based on the allele frequency shifts that occurred across pairs of individuals and/or K ancestral populations (Supplementary Fig. S2) and (iii) a barplot representation of ancestry profiles (Fig. 1).
Fig. 1.

(a) Simulated tree including 18 populations, and three sampled individuals per population. One admixture was simulated (m) into population 14 [0,0.25]. (b) Admixture proportions estimated by Struct-f4. For consistency with Harney et al. (2021), we simulated the full population history but restricted inference to 10 populations only. The name of each sample is composed of its corresponding population of origin, concatenated to a unique sample identifier. Samples from population 14 show an increasing ancestry from populations 0, 5 and 7 (colored in red) for greater simulated introgression proportions, as expected, and despite the true donor population remained unsampled

(a) Simulated tree including 18 populations, and three sampled individuals per population. One admixture was simulated (m) into population 14 [0,0.25]. (b) Admixture proportions estimated by Struct-f4. For consistency with Harney et al. (2021), we simulated the full population history but restricted inference to 10 populations only. The name of each sample is composed of its corresponding population of origin, concatenated to a unique sample identifier. Samples from population 14 show an increasing ancestry from populations 0, 5 and 7 (colored in red) for greater simulated introgression proportions, as expected, and despite the true donor population remained unsampled

3 Results

We evaluated Struct-f4 using the same simulation framework as that implemented by (Harney ) for assessing qpAdm performance. Individuals simulated as belonging to populations either closely related or increasingly connected by gene-flow appeared next to each other in the MDS space. Each individual was found to cluster according to the phylogenetically closest cladal group in the simulated model and showed genetic profiles consistent with the intensity of admixture (Fig. 1). Slight underestimates of the admixture proportions were returned for unsupervised analyses, if sampling only three haploid individuals per population. Supervised inference, nevertheless, completely fixed this bias (Fig. 1). Therefore, Struct-f4 can be used even when sampling efforts are limited, advantageously expanding the analytical toolkit in statistical genomics by providing a robust, flexible and user-friendly platform to automatically characterize population genetic structure. We successfully applied Struct-f4 to characterize the population structure underlying 284 ancient and modern horse genomes from over 11 million permutations of f4-statistics (Librado ).

Funding

This work was received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme [Grant 681605]. Conflict of Interest: none declared. Click here for additional data file.
  9 in total

1.  Fast model-based estimation of ancestry in unrelated individuals.

Authors:  David H Alexander; John Novembre; Kenneth Lange
Journal:  Genome Res       Date:  2009-07-31       Impact factor: 9.043

2.  Ancient admixture in human history.

Authors:  Nick Patterson; Priya Moorjani; Yontao Luo; Swapan Mallick; Nadin Rohland; Yiping Zhan; Teri Genschoreck; Teresa Webster; David Reich
Journal:  Genetics       Date:  2012-09-07       Impact factor: 4.562

Review 3.  Tracing the peopling of the world through genomics.

Authors:  Rasmus Nielsen; Joshua M Akey; Mattias Jakobsson; Jonathan K Pritchard; Sarah Tishkoff; Eske Willerslev
Journal:  Nature       Date:  2017-01-18       Impact factor: 49.962

4.  Population structure and eigenanalysis.

Authors:  Nick Patterson; Alkes L Price; David Reich
Journal:  PLoS Genet       Date:  2006-12       Impact factor: 5.917

5.  Assessing the performance of qpAdm: a statistical tool for studying population admixture.

Authors:  Éadaoin Harney; Nick Patterson; David Reich; John Wakeley
Journal:  Genetics       Date:  2021-04-15       Impact factor: 4.562

6.  The origins and spread of domestic horses from the Western Eurasian steppes.

Authors:  Pablo Librado; Naveed Khan; Antoine Fages; Mariya A Kusliy; Tomasz Suchan; Laure Tonasso-Calvière; Stéphanie Schiavinato; Duha Alioglu; Aurore Fromentier; Aude Perdereau; Jean-Marc Aury; Charleen Gaunitz; Lorelei Chauvey; Andaine Seguin-Orlando; Clio Der Sarkissian; John Southon; Beth Shapiro; Alexey A Tishkin; Alexey A Kovalev; Saleh Alquraishi; Ahmed H Alfarhan; Khaled A S Al-Rasheid; Timo Seregély; Lutz Klassen; Rune Iversen; Olivier Bignon-Lau; Pierre Bodu; Monique Olive; Jean-Christophe Castel; Myriam Boudadi-Maligne; Nadir Alvarez; Mietje Germonpré; Magdalena Moskal-Del Hoyo; Jarosław Wilczyński; Sylwia Pospuła; Anna Lasota-Kuś; Krzysztof Tunia; Marek Nowak; Eve Rannamäe; Urmas Saarma; Gennady Boeskorov; Lembi Lōugas; René Kyselý; Lubomír Peške; Adrian Bălășescu; Valentin Dumitrașcu; Roxana Dobrescu; Daniel Gerber; Viktória Kiss; Anna Szécsényi-Nagy; Balázs G Mende; Zsolt Gallina; Krisztina Somogyi; Gabriella Kulcsár; Erika Gál; Robin Bendrey; Morten E Allentoft; Ghenadie Sirbu; Valentin Dergachev; Henry Shephard; Noémie Tomadini; Sandrine Grouard; Aleksei Kasparov; Alexander E Basilyan; Mikhail A Anisimov; Pavel A Nikolskiy; Elena Y Pavlova; Vladimir Pitulko; Gottfried Brem; Barbara Wallner; Christoph Schwall; Marcel Keller; Keiko Kitagawa; Alexander N Bessudnov; Alexander Bessudnov; William Taylor; Jérome Magail; Jamiyan-Ombo Gantulga; Jamsranjav Bayarsaikhan; Diimaajav Erdenebaatar; Kubatbeek Tabaldiev; Enkhbayar Mijiddorj; Bazartseren Boldgiv; Turbat Tsagaan; Mélanie Pruvost; Sandra Olsen; Cheryl A Makarewicz; Silvia Valenzuela Lamas; Silvia Albizuri Canadell; Ariadna Nieto Espinet; Ma Pilar Iborra; Jaime Lira Garrido; Esther Rodríguez González; Sebastián Celestino; Carmen Olària; Juan Luis Arsuaga; Nadiia Kotova; Alexander Pryor; Pam Crabtree; Rinat Zhumatayev; Abdesh Toleubaev; Nina L Morgunova; Tatiana Kuznetsova; David Lordkipanize; Matilde Marzullo; Ornella Prato; Giovanna Bagnasco Gianni; Umberto Tecchiati; Benoit Clavel; Sébastien Lepetz; Hossein Davoudi; Marjan Mashkour; Natalia Ya Berezina; Philipp W Stockhammer; Johannes Krause; Wolfgang Haak; Arturo Morales-Muñiz; Norbert Benecke; Michael Hofreiter; Arne Ludwig; Alexander S Graphodatsky; Joris Peters; Kirill Yu Kiryushin; Tumur-Ochir Iderkhangai; Nikolay A Bokovenko; Sergey K Vasiliev; Nikolai N Seregin; Konstantin V Chugunov; Natalya A Plasteeva; Gennady F Baryshnikov; Ekaterina Petrova; Mikhail Sablin; Elina Ananyevskaya; Andrey Logvin; Irina Shevnina; Victor Logvin; Saule Kalieva; Valeriy Loman; Igor Kukushkin; Ilya Merz; Victor Merz; Sergazy Sakenov; Victor Varfolomeyev; Emma Usmanova; Viktor Zaibert; Benjamin Arbuckle; Andrey B Belinskiy; Alexej Kalmykov; Sabine Reinhold; Svend Hansen; Aleksandr I Yudin; Alekandr A Vybornov; Andrey Epimakhov; Natalia S Berezina; Natalia Roslyakova; Pavel A Kosintsev; Pavel F Kuznetsov; David Anthony; Guus J Kroonen; Kristian Kristiansen; Patrick Wincker; Alan Outram; Ludovic Orlando
Journal:  Nature       Date:  2021-10-20       Impact factor: 49.962

7.  Tracking Five Millennia of Horse Management with Extensive Ancient Genome Time Series.

Authors:  Antoine Fages; Kristian Hanghøj; Naveed Khan; Charleen Gaunitz; Andaine Seguin-Orlando; Michela Leonardi; Christian McCrory Constantz; Cristina Gamba; Khaled A S Al-Rasheid; Silvia Albizuri; Ahmed H Alfarhan; Morten Allentoft; Saleh Alquraishi; David Anthony; Nurbol Baimukhanov; James H Barrett; Jamsranjav Bayarsaikhan; Norbert Benecke; Eloísa Bernáldez-Sánchez; Luis Berrocal-Rangel; Fereidoun Biglari; Sanne Boessenkool; Bazartseren Boldgiv; Gottfried Brem; Dorcas Brown; Joachim Burger; Eric Crubézy; Linas Daugnora; Hossein Davoudi; Peter de Barros Damgaard; María de Los Ángeles de Chorro Y de Villa-Ceballos; Sabine Deschler-Erb; Cleia Detry; Nadine Dill; Maria do Mar Oom; Anna Dohr; Sturla Ellingvåg; Diimaajav Erdenebaatar; Homa Fathi; Sabine Felkel; Carlos Fernández-Rodríguez; Esteban García-Viñas; Mietje Germonpré; José D Granado; Jón H Hallsson; Helmut Hemmer; Michael Hofreiter; Aleksei Kasparov; Mutalib Khasanov; Roya Khazaeli; Pavel Kosintsev; Kristian Kristiansen; Tabaldiev Kubatbek; Lukas Kuderna; Pavel Kuznetsov; Haeedeh Laleh; Jennifer A Leonard; Johanna Lhuillier; Corina Liesau von Lettow-Vorbeck; Andrey Logvin; Lembi Lõugas; Arne Ludwig; Cristina Luis; Ana Margarida Arruda; Tomas Marques-Bonet; Raquel Matoso Silva; Victor Merz; Enkhbayar Mijiddorj; Bryan K Miller; Oleg Monchalov; Fatemeh A Mohaseb; Arturo Morales; Ariadna Nieto-Espinet; Heidi Nistelberger; Vedat Onar; Albína H Pálsdóttir; Vladimir Pitulko; Konstantin Pitskhelauri; Mélanie Pruvost; Petra Rajic Sikanjic; Anita Rapan Papeša; Natalia Roslyakova; Alireza Sardari; Eberhard Sauer; Renate Schafberg; Amelie Scheu; Jörg Schibler; Angela Schlumbaum; Nathalie Serrand; Aitor Serres-Armero; Beth Shapiro; Shiva Sheikhi Seno; Irina Shevnina; Sonia Shidrang; John Southon; Bastiaan Star; Naomi Sykes; Kamal Taheri; William Taylor; Wolf-Rüdiger Teegen; Tajana Trbojević Vukičević; Simon Trixl; Dashzeveg Tumen; Sainbileg Undrakhbold; Emma Usmanova; Ali Vahdati; Silvia Valenzuela-Lamas; Catarina Viegas; Barbara Wallner; Jaco Weinstock; Victor Zaibert; Benoit Clavel; Sébastien Lepetz; Marjan Mashkour; Agnar Helgason; Kári Stefánsson; Eric Barrey; Eske Willerslev; Alan K Outram; Pablo Librado; Ludovic Orlando
Journal:  Cell       Date:  2019-05-02       Impact factor: 66.850

8.  admixturegraph: an R package for admixture graph manipulation and fitting.

Authors:  Kalle Leppälä; Svend V Nielsen; Thomas Mailund
Journal:  Bioinformatics       Date:  2017-06-01       Impact factor: 6.937

9.  A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots.

Authors:  Daniel J Lawson; Lucy van Dorp; Daniel Falush
Journal:  Nat Commun       Date:  2018-08-14       Impact factor: 14.919

  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.