Benet Oriol Sabat1,2, Daniel Mas Montserrat2, Xavier Giro-I-Nieto1, Alexander G Ioannidis2,3. 1. Department of Signal Theory and Communications, Universitat Politecnica de Catalunya, Barcelona 08034, Spain. 2. Department of Biomedical Data Science, Stanford Medical School. 3. Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA 94305, USA.
Abstract
MOTIVATION: Local ancestry inference (LAI) is the high resolution prediction of ancestry labels along a DNA sequence. LAI is important in the study of human history and migrations, and it is beginning to play a role in precision medicine applications including ancestry-adjusted genome-wide association studies (GWASs) and polygenic risk scores (PRSs). Existing LAI models do not generalize well between species, chromosomes or even ancestry groups, requiring re-training for each different setting. Furthermore, such methods can lack interpretability, which is an important element in each of these applications. RESULTS: We present SALAI-Net, a portable statistical LAI method that can be applied on any set of species and ancestries (species-agnostic), requiring only haplotype data and no other biological parameters. Inspired by identity by descent methods, SALAI-Net estimates population labels for each segment of DNA by performing a reference matching approach, which leads to an interpretable and fast technique. We benchmark our models on whole-genome data of humans and we test these models' ability to generalize to dog breeds when trained on human data. SALAI-Net outperforms previous methods in terms of balanced accuracy, while generalizing between different settings, species and datasets. Moreover, it is up to two orders of magnitude faster and uses considerably less RAM memory than competing methods. AVAILABILITY AND IMPLEMENTATION: We provide an open source implementation and links to publicly available data at github.com/AI-sandbox/SALAI-Net. Data is publicly available as follows: https://www.internationalgenome.org (1000 Genomes), https://www.simonsfoundation.org/simons-genome-diversity-project (Simons Genome Diversity Project), https://www.sanger.ac.uk/resources/downloads/human/hapmap3.html (HapMap), ftp://ngs.sanger.ac.uk/production/hgdp/hgdp_wgs.20190516 (Human Genome Diversity Project) and https://www.ncbi.nlm.nih.gov/bioproject/PRJNA448733 (Canid genomes). SUPPLEMENTARY INFORMATION: Supplementary data are available from Bioinformatics online.
MOTIVATION: Local ancestry inference (LAI) is the high resolution prediction of ancestry labels along a DNA sequence. LAI is important in the study of human history and migrations, and it is beginning to play a role in precision medicine applications including ancestry-adjusted genome-wide association studies (GWASs) and polygenic risk scores (PRSs). Existing LAI models do not generalize well between species, chromosomes or even ancestry groups, requiring re-training for each different setting. Furthermore, such methods can lack interpretability, which is an important element in each of these applications. RESULTS: We present SALAI-Net, a portable statistical LAI method that can be applied on any set of species and ancestries (species-agnostic), requiring only haplotype data and no other biological parameters. Inspired by identity by descent methods, SALAI-Net estimates population labels for each segment of DNA by performing a reference matching approach, which leads to an interpretable and fast technique. We benchmark our models on whole-genome data of humans and we test these models' ability to generalize to dog breeds when trained on human data. SALAI-Net outperforms previous methods in terms of balanced accuracy, while generalizing between different settings, species and datasets. Moreover, it is up to two orders of magnitude faster and uses considerably less RAM memory than competing methods. AVAILABILITY AND IMPLEMENTATION: We provide an open source implementation and links to publicly available data at github.com/AI-sandbox/SALAI-Net. Data is publicly available as follows: https://www.internationalgenome.org (1000 Genomes), https://www.simonsfoundation.org/simons-genome-diversity-project (Simons Genome Diversity Project), https://www.sanger.ac.uk/resources/downloads/human/hapmap3.html (HapMap), ftp://ngs.sanger.ac.uk/production/hgdp/hgdp_wgs.20190516 (Human Genome Diversity Project) and https://www.ncbi.nlm.nih.gov/bioproject/PRJNA448733 (Canid genomes). SUPPLEMENTARY INFORMATION: Supplementary data are available from Bioinformatics online.
Authors: Ehud Karavani; Or Zuk; Danny Zeevi; Nir Barzilai; Nikos C Stefanis; Alex Hatzimanolis; Nikolaos Smyrnis; Dimitrios Avramopoulos; Leonid Kruglyak; Gil Atzmon; Max Lam; Todd Lencz; Shai Carmi Journal: Cell Date: 2019-11-21 Impact factor: 41.582
Authors: Shaun Purcell; Benjamin Neale; Kathe Todd-Brown; Lori Thomas; Manuel A R Ferreira; David Bender; Julian Maller; Pamela Sklar; Paul I W de Bakker; Mark J Daly; Pak C Sham Journal: Am J Hum Genet Date: 2007-07-25 Impact factor: 11.025
Authors: Anders Albrechtsen; Thorfinn Sand Korneliussen; Ida Moltke; Thomas van Overseem Hansen; Finn Cilius Nielsen; Rasmus Nielsen Journal: Genet Epidemiol Date: 2009-04 Impact factor: 2.135
Authors: Maanasa Raghavan; Matthias Steinrücken; Kelley Harris; Stephan Schiffels; Simon Rasmussen; Michael DeGiorgio; Anders Albrechtsen; Cristina Valdiosera; María C Ávila-Arcos; Anna-Sapfo Malaspinas; Anders Eriksson; Ida Moltke; Mait Metspalu; Julian R Homburger; Jeff Wall; Omar E Cornejo; J Víctor Moreno-Mayar; Thorfinn S Korneliussen; Tracey Pierre; Morten Rasmussen; Paula F Campos; Peter de Barros Damgaard; Morten E Allentoft; John Lindo; Ene Metspalu; Ricardo Rodríguez-Varela; Josefina Mansilla; Celeste Henrickson; Andaine Seguin-Orlando; Helena Malmström; Thomas Stafford; Suyash S Shringarpure; Andrés Moreno-Estrada; Monika Karmin; Kristiina Tambets; Anders Bergström; Yali Xue; Vera Warmuth; Andrew D Friend; Joy Singarayer; Paul Valdes; Francois Balloux; Ilán Leboreiro; Jose Luis Vera; Hector Rangel-Villalobos; Davide Pettener; Donata Luiselli; Loren G Davis; Evelyne Heyer; Christoph P E Zollikofer; Marcia S Ponce de León; Colin I Smith; Vaughan Grimes; Kelly-Anne Pike; Michael Deal; Benjamin T Fuller; Bernardo Arriaza; Vivien Standen; Maria F Luz; Francois Ricaut; Niede Guidon; Ludmila Osipova; Mikhail I Voevoda; Olga L Posukh; Oleg Balanovsky; Maria Lavryashina; Yuri Bogunov; Elza Khusnutdinova; Marina Gubina; Elena Balanovska; Sardana Fedorova; Sergey Litvinov; Boris Malyarchuk; Miroslava Derenko; M J Mosher; David Archer; Jerome Cybulski; Barbara Petzelt; Joycelynn Mitchell; Rosita Worl; Paul J Norman; Peter Parham; Brian M Kemp; Toomas Kivisild; Chris Tyler-Smith; Manjinder S Sandhu; Michael Crawford; Richard Villems; David Glenn Smith; Michael R Waters; Ted Goebel; John R Johnson; Ripan S Malhi; Mattias Jakobsson; David J Meltzer; Andrea Manica; Richard Durbin; Carlos D Bustamante; Yun S Song; Rasmus Nielsen; Eske Willerslev Journal: Science Date: 2015-07-21 Impact factor: 47.728
Authors: Lavanya Rishishwar; Andrew B Conley; Charles H Wigington; Lu Wang; Augusto Valderrama-Aguirre; I King Jordan Journal: Sci Rep Date: 2015-07-21 Impact factor: 4.379
Authors: Jonathan M Flowers; Khaled M Hazzouri; Muriel Gros-Balthazard; Ziyi Mo; Konstantina Koutroumpa; Andreas Perrakis; Sylvie Ferrand; Hussam S M Khierallah; Dorian Q Fuller; Frederique Aberlenc; Christini Fournaraki; Michael D Purugganan Journal: Proc Natl Acad Sci U S A Date: 2019-01-14 Impact factor: 11.205