Martin Petr1, Benjamin Vernot1, Janet Kelso1. 1. Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany.
Abstract
SUMMARY: We present a new R package admixr, which provides a convenient interface for performing reproducible population genetic analyses (f3, D, f4, f4-ratio, qpWave and qpAdm), as implemented by command-line programs in the ADMIXTOOLS software suite. In a traditional ADMIXTOOLS workflow, the user must first generate a set of text configuration files tailored to each individual analysis, often using a combination of shell scripting and manual text editing. The non-tabular output files then need to be parsed to extract values of interest prior to further analyses. Our package simplifies this process by automating all low-level configuration and parsing steps, making analyses as simple as running a single R command. Furthermore, we provide a set of R functions for processing, filtering and manipulating datasets in the EIGENSTRAT format. By unifying all steps of the workflow under a single R framework, this package enables the automation of analytic pipelines, significantly improving the reproducibility of population genetic studies. AVAILABILITY AND IMPLEMENTATION: The source code of the R package is available under the MIT license. Installation instructions, reference manual and a tutorial can be found on the package website at https://bioinf.eva.mpg.de/admixr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
SUMMARY: We present a new R package admixr, which provides a convenient interface for performing reproducible population genetic analyses (f3, D, f4, f4-ratio, qpWave and qpAdm), as implemented by command-line programs in the ADMIXTOOLS software suite. In a traditional ADMIXTOOLS workflow, the user must first generate a set of text configuration files tailored to each individual analysis, often using a combination of shell scripting and manual text editing. The non-tabular output files then need to be parsed to extract values of interest prior to further analyses. Our package simplifies this process by automating all low-level configuration and parsing steps, making analyses as simple as running a single R command. Furthermore, we provide a set of R functions for processing, filtering and manipulating datasets in the EIGENSTRAT format. By unifying all steps of the workflow under a single R framework, this package enables the automation of analytic pipelines, significantly improving the reproducibility of population genetic studies. AVAILABILITY AND IMPLEMENTATION: The source code of the R package is available under the MIT license. Installation instructions, reference manual and a tutorial can be found on the package website at https://bioinf.eva.mpg.de/admixr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
The growing number of ancient and modern genome sequences have transformed our understanding of the evolutionary history of humans and other species. Several statistical methods have been developed to make inferences about past population movements and admixtures from genomic data. Chief among these has been a series of population genetic methods (D, ,,-ratio, qpWave and qpAdm) for estimating the amounts of genetic drift shared between populations, testing admixture hypotheses and estimating admixture proportions, implemented as command-line utilities in the ADMIXTOOLS software suite (Patterson ). Although ADMIXTOOLS has been used in many recent studies of human ancient DNA (Fu ; Haak ; Hajdinjak ; Lazaridis ), the tools in this package are rather cumbersome to use. First, each individual analysis or hypothesis test relies on a set of configuration files, which have to be generated using a combination of shell scripting and manual editing. Second, after running an ADMIXTOOLS command on the command-line, the user needs to extract relevant values from a non-tabular text file before they can be imported into software such as R for further analysis and plotting. This workflow is slow and potentially error-prone, especially if the user wishes to quickly iterate through different hypotheses involving many different populations or samples. Most importantly, however, it makes it challenging to conduct fully reproducible research. To overcome these challenges, we present a new R package for population admixture analyses which utilizes the ADMIXTOOLS software suite for the underlying calculations, but that provides a unified and convenient R interface. The package completely automates the generation, processing and parsing of all intermediate files, hiding all low-level details from the user, and allowing them to focus on the analysis itself. Importantly, unifying the entire analytic workflow in a single environment makes it possible to implement and share fully automated, reproducible analytic pipelines.
2 Implementation
The admixr package is implemented using the R programming language. It consists of several wrapper functions (calling ADMIXTOOLS commands internally from R), and a set of complementary functions for filtering and processing datasets in the EIGENSTRAT file format required by ADMIXTOOLS (Patterson ).An EIGENSTRAT dataset is represented by an S3 object of the class EIGENSTRAT, which is created using the eigenstrat() constructor function, and encapsulates the paths to a trio of ‘ind’, ‘snp’ and ‘geno’ files:> snps <- eigenstrat(“∼/path/to/eigenstrat/data”)> snpsEIGENSTRAT object=================components:ind file: ∼/path/to/eigenstrat/data.indsnp file: ∼/path/to/eigenstrat/data.snpgeno file: ∼/path/to/eigenstrat/data.genoAll other functions in the package accept this object as their first argument, and perform either a requested calculation on it (returning an R data frame for further analysis), or return a new, modified EIGENSTRAT S3 object (in case of filtering and processing functions) which can be used in additional downstream steps or calculations.The core functionality of the package consists of the following set of R functions: f3(), d(), f4(), f4ratio(), qpWave() and qpAdm(), each implemented as a wrapper around one of the command-line programs distributed as part of the ADMIXTOOLS package.
3 Example usage
Performing even the most trivial analysis using ADMIXTOOLS presents a significant amount of overhead for the user. For example, to estimate the proportion of Neandertal ancestry in a set of individuals, the user would typically calculate an -ratio statistic such as:The user first needs to create a file with a list of samples in each position of both f statistics, a parameter file specifying the paths to a trio of EIGENSTRAT component files, then manually run the qpF4ratio command-line program, and then capture and parse its output to obtain relevant values (see Supplementary Information for a complete example workflow using a traditional ADMIXTOOLS approach). Note that changing the analysis setup [such as including a different set of populations in Equation (1)], performing the analysis on a subset of the genome, or modifying the analysis in another way, requires changes to be made to its configuration files. This presents a significant overhead for the user, especially when iterating through a complex set of population genetic hypotheses.In contrast, using the admixr package, the same analysis can be performed with just the following snippet of R code:result <- f4ratio(X = c(“French”, “Han”, “Papuan”),A = “Altai”, B = “Vindija”, C = “Mbuti”, O = “Chimp”,data = eigenstrat(“”))Internally, the f4ratio() function performs all configuration and parsing work, and returns an R data frame which can be immediately used for further statistical analysis and plotting:> resultA B X C O alpha stderr ZscoreAltai Vindija French Mbuti Chimp 0.019696 0.003114 6.324Altai Vindija Han Mbuti Chimp 0.024379 0.003364 7.248Altai Vindija Papuan Mbuti Chimp 0.032167 0.003499 9.193All other admixr wrapper functions have a similar interface and are described in the tutorial vignette on the package website in more detail.
4 Additional functionality
The fact that ADMIXTOOLS requires the data to be in EIGENSTRAT format presents additional challenges for quality control, processing and filtering, as this format is not supported by standard bioinformatics tools. Our R package therefore provides additional functionality to simplify the processing and filtering of EIGENSTRAT genotype data. This includes:Reading and writing of ind, snp and geno file components.Filtering of SNPs based on regions specified in a BED file.Restricting analyses to sites carrying transversion SNPs.Renaming samples or grouping them into larger population groups.Merging of EIGENSTRAT datasets.Counting the number of sites present or missing in each sample.Click here for additional data file.
Authors: Wolfgang Haak; Iosif Lazaridis; Nick Patterson; Nadin Rohland; Swapan Mallick; Bastien Llamas; Guido Brandt; Susanne Nordenfelt; Eadaoin Harney; Kristin Stewardson; Qiaomei Fu; Alissa Mittnik; Eszter Bánffy; Christos Economou; Michael Francken; Susanne Friederich; Rafael Garrido Pena; Fredrik Hallgren; Valery Khartanovich; Aleksandr Khokhlov; Michael Kunst; Pavel Kuznetsov; Harald Meller; Oleg Mochalov; Vayacheslav Moiseyev; Nicole Nicklisch; Sandra L Pichler; Roberto Risch; Manuel A Rojo Guerra; Christina Roth; Anna Szécsényi-Nagy; Joachim Wahl; Matthias Meyer; Johannes Krause; Dorcas Brown; David Anthony; Alan Cooper; Kurt Werner Alt; David Reich Journal: Nature Date: 2015-03-02 Impact factor: 49.962
Authors: Mateja Hajdinjak; Qiaomei Fu; Alexander Hübner; Martin Petr; Fabrizio Mafessoni; Steffi Grote; Pontus Skoglund; Vagheesh Narasimham; Hélène Rougier; Isabelle Crevecoeur; Patrick Semal; Marie Soressi; Sahra Talamo; Jean-Jacques Hublin; Ivan Gušić; Željko Kućan; Pavao Rudan; Liubov V Golovanova; Vladimir B Doronichev; Cosimo Posth; Johannes Krause; Petra Korlević; Sarah Nagel; Birgit Nickel; Montgomery Slatkin; Nick Patterson; David Reich; Kay Prüfer; Matthias Meyer; Svante Pääbo; Janet Kelso Journal: Nature Date: 2018-03-21 Impact factor: 49.962
Authors: Iosif Lazaridis; Dani Nadel; Gary Rollefson; Deborah C Merrett; Nadin Rohland; Swapan Mallick; Daniel Fernandes; Mario Novak; Beatriz Gamarra; Kendra Sirak; Sarah Connell; Kristin Stewardson; Eadaoin Harney; Qiaomei Fu; Gloria Gonzalez-Fortes; Eppie R Jones; Songül Alpaslan Roodenberg; György Lengyel; Fanny Bocquentin; Boris Gasparian; Janet M Monge; Michael Gregg; Vered Eshed; Ahuva-Sivan Mizrahi; Christopher Meiklejohn; Fokke Gerritsen; Luminita Bejenaru; Matthias Blüher; Archie Campbell; Gianpiero Cavalleri; David Comas; Philippe Froguel; Edmund Gilbert; Shona M Kerr; Peter Kovacs; Johannes Krause; Darren McGettigan; Michael Merrigan; D Andrew Merriwether; Seamus O'Reilly; Martin B Richards; Ornella Semino; Michel Shamoon-Pour; Gheorghe Stefanescu; Michael Stumvoll; Anke Tönjes; Antonio Torroni; James F Wilson; Loic Yengo; Nelli A Hovhannisyan; Nick Patterson; Ron Pinhasi; David Reich Journal: Nature Date: 2016-07-25 Impact factor: 49.962
Authors: Qiaomei Fu; Cosimo Posth; Mateja Hajdinjak; Martin Petr; Swapan Mallick; Daniel Fernandes; Anja Furtwängler; Wolfgang Haak; Matthias Meyer; Alissa Mittnik; Birgit Nickel; Alexander Peltzer; Nadin Rohland; Viviane Slon; Sahra Talamo; Iosif Lazaridis; Mark Lipson; Iain Mathieson; Stephan Schiffels; Pontus Skoglund; Anatoly P Derevianko; Nikolai Drozdov; Vyacheslav Slavinsky; Alexander Tsybankov; Renata Grifoni Cremonesi; Francesco Mallegni; Bernard Gély; Eligio Vacca; Manuel R González Morales; Lawrence G Straus; Christine Neugebauer-Maresch; Maria Teschler-Nicola; Silviu Constantin; Oana Teodora Moldovan; Stefano Benazzi; Marco Peresani; Donato Coppola; Martina Lari; Stefano Ricci; Annamaria Ronchitelli; Frédérique Valentin; Corinne Thevenet; Kurt Wehrberger; Dan Grigorescu; Hélène Rougier; Isabelle Crevecoeur; Damien Flas; Patrick Semal; Marcello A Mannino; Christophe Cupillard; Hervé Bocherens; Nicholas J Conard; Katerina Harvati; Vyacheslav Moiseyev; Dorothée G Drucker; Jiří Svoboda; Michael P Richards; David Caramelli; Ron Pinhasi; Janet Kelso; Nick Patterson; Johannes Krause; Svante Pääbo; David Reich Journal: Nature Date: 2016-05-02 Impact factor: 49.962
Authors: Yo Y Yamasaki; Ryo Kakioka; Hiroshi Takahashi; Atsushi Toyoda; Atsushi J Nagano; Yoshiyasu Machida; Peter R Møller; Jun Kitano Journal: Philos Trans R Soc Lond B Biol Sci Date: 2020-07-13 Impact factor: 6.237
Authors: Thaís C de Oliveira; Priscila T Rodrigues; Angela M Early; Ana Maria R C Duarte; Julyana C Buery; Marina G Bueno; José L Catão-Dias; Crispim Cerutti; Luísa D P Rona; Daniel E Neafsey; Marcelo U Ferreira Journal: J Infect Dis Date: 2021-12-01 Impact factor: 5.226
Authors: Muriel Gros-Balthazard; Jonathan M Flowers; Khaled M Hazzouri; Sylvie Ferrand; Frédérique Aberlenc; Sarah Sallon; Michael D Purugganan Journal: Proc Natl Acad Sci U S A Date: 2021-05-11 Impact factor: 11.205
Authors: Roberto F Nespolo; Carlos A Villarroel; Christian I Oporto; Sebastián M Tapia; Franco Vega-Macaya; Kamila Urbina; Matteo De Chiara; Simone Mozzachiodi; Ekaterina Mikhalev; Dawn Thompson; Luis F Larrondo; Pablo Saenz-Agudelo; Gianni Liti; Francisco A Cubillos Journal: PLoS Genet Date: 2020-05-01 Impact factor: 5.917
Authors: Julius Mulindwa; Harry Noyes; Hamidou Ilboudo; Luca Pagani; Oscar Nyangiri; Magambo Phillip Kimuda; Bernardin Ahouty; Olivier Fataki Asina; Elvis Ofon; Kelita Kamoto; Justin Windingoudi Kabore; Mathurin Koffi; Dieudonne Mumba Ngoyi; Gustave Simo; John Chisi; Issa Sidibe; John Enyaru; Martin Simuunza; Pius Alibu; Vincent Jamonneau; Mamadou Camara; Andy Tait; Neil Hall; Bruno Bucheton; Annette MacLeod; Christiane Hertz-Fowler; Enock Matovu Journal: Am J Hum Genet Date: 2020-08-10 Impact factor: 11.025
Authors: Begoña Dobon; Rob Ter Horst; Hafid Laayouni; Mayukh Mondal; Erica Bianco; David Comas; Mihai Ioana; Elena Bosch; Jaume Bertranpetit; Mihai G Netea Journal: Sci Rep Date: 2020-09-30 Impact factor: 4.379
Authors: Suriani Surbakti; Heidi G Parker; James K McIntyre; Hendra K Maury; Kylie M Cairns; Meagan Selvig; Margaretha Pangau-Adam; Apolo Safonpo; Leonardo Numberi; Dirk Y P Runtuboi; Brian W Davis; Elaine A Ostrander Journal: Proc Natl Acad Sci U S A Date: 2020-08-31 Impact factor: 11.205