Literature DB >> 24974206

bammds: a tool for assessing the ancestry of low-depth whole-genome data using multidimensional scaling (MDS).

Anna-Sapfo Malaspinas¹, Ole Tange¹, José Víctor Moreno-Mayar¹, Morten Rasmussen², Michael DeGiorgio¹, Yong Wang², Cristina E Valdiosera², Gustavo Politis², Eske Willerslev¹, Rasmus Nielsen².

Abstract

SUMMARY: We present bammds, a practical tool that allows visualization of samples sequenced by second-generation sequencing when compared with a reference panel of individuals (usually genotypes) using a multidimensional scaling algorithm. Our tool is aimed at determining the ancestry of unknown samples-typical of ancient DNA data-particularly when only low amounts of data are available for those samples.
AVAILABILITY AND IMPLEMENTATION: The software package is available under GNU General Public License v3 and is freely available together with test datasets https://savannah.nongnu.org/projects/bammds/. It is using R (http://www.r-project.org/), parallel (http://www.gnu.org/software/parallel/), samtools (https://github.com/samtools/samtools). CONTACT: bammds-users@nongnu.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Species

Mesh：

Year: 2014 PMID： 24974206 PMCID： PMC4184259 DOI： 10.1093/bioinformatics/btu410

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Population structure plays an important role in determining the evolutionary history of a group. A great deal has been learned from single nucleotide polymorphism (SNP) array technology providing unmatched information of the population structure of several species [for humans, see (Novembre and Ramachandran, 2011)]. The advent of new sequencing platforms, which can deliver millions to billions of sequencing reads within days, has shifted the focus from SNP array data to whole-genome shotgun (WGS) data. While the cost has steadily decreased (Sboner ), obtaining many high-depth genomes remains prohibitive for many laboratories, in particular when working with ancient DNA (aDNA) samples where it is often desirable to screen many samples of potential interest while keeping the cost at a minimum. Methods based on non-parametric multidimensional statistics (more specifically principal components analysis, PCA) were first applied to genetic data more than 30 years ago (Menozzi ). PCA has since become a standard tool in population genetics (Patterson ; Wang ) owing in particular to (i) the low computational demand of such analyses, (ii) the appealing graphical result and (iii) its ease of use. Here, we describe a tool that allows to assign an ancestry to low-depth mapped WGS data when compared with an existing reference panel of genotype data using multidimensional scaling (MDS) based on genetic distances, a related method that provides results similar to those of PCA (Cox and Cox, 2000).

2 METHODS

In what follows, we assume that WGS data have been mapped to a reference genome and that files in BAM format are available (Li ). Calling genotypes for low-depth data is a challenging task (Nielsen ), particularly for aDNA, as ancient damage (Briggs ) and contamination are not incorporated into sequence data error models. To avoid calling genotypes, we sample a read at every position for the WGS data, similar in spirit to previous aDNA approaches (Green ). Specifically, for the reference set of individuals, we randomly sample one of the alleles from each individual, and for the WGS data, we choose an allele from a randomly selected read covering that site. If no read covers that site or if the sampled allele is not the minor or the major allele in the reference panel, we then assume that the data for this site are missing for that sample. In other words, the data in both the reference panel and the WGS samples become either one allele (A, C, G or T) or missing data. For site k, let = 1 if individuals i and j have a different randomly chosen allele and 0 if that allele is the same or if one of the individuals has missing data. Assume that the number of sites in the reference panel is K. Denote as the number of sites where neither of individual i and j have missing data. Then, the allele-sharing distance between individuals i and j is as follows: A matrix of allele-sharing distances between all pairs of individuals is computed. We then apply classical MDS to this matrix [e.g. (Cox and Cox, 2000)]. Our implementation has three major features: it is user friendly and is intended to be used by biologists with limited familiarity with a UNIX system, it is flexible in terms of formats of the reference panel and in terms of the visual output, it runs in ∼20 min on a machine with four 2.2 GHz cores with a reference panel including >600 000 SNPs and ∼950 individuals, making it practical to screen samples of an ongoing experiment progressively as additional data are produced. We first tested bammds through simulations using publicly available modern and ancient human data. For the WGS data, we used 10 modern human genomes from HGDP cell lines, published in Meyer , an Australian aboriginal genome (Rasmussen ) and the Anzick-1 genome (Rasmussen ). We mapped and processed the data identically for all genomes (see Supplementary data). We used a public reference panel that we make available in the Supplementary data, i.e. HGDP (Li ), which includes > 600 000 SNPs and ∼950 individuals subdivided into 53 populations and 7 geographic regions (Africa, Eastern-, Western-, Central- and South Asia, Europe, Oceania and Native America). For each genome, we sampled , , , , and reads (which corresponds to a depth of coverage around 0.001×, 0.01×, 0.1×, 0.5×, 1× and 5×, assuming ∼100 bp sequence reads). For each sub-sampled genome, we ran bammds with the HGDP reference panel. We summarized the simulation results using dimension 1 and 2 only of the MDS output, as we expect this to be the common usage. For each population in HGDP, we defined its centroid (or center of gravity) based on the coordinates of its members for those two dimensions. We then evaluated the results using two criteria: (i) by assessing which population was the closest when comparing the position of the WGS sample with the population centroid, and (ii) by determining if the position of the genome is within a two-dimensional 99% confidence region. We built the confidence region by assuming that the points follow a bivariate normal distribution centred around the centroid of the population to which it belongs (‘population ellipse’). We present a practical example on how to use the tool to determine whether a library is heavily contaminated by processing a newly sequenced ∼10 000 year BP old phalange (‘Gus’) from Argentina that clusters with the Europeans (Supplementary data).

3 RESULTS

The graphical result with all 10 modern individuals at a depth of 0.1× can be seen on Figure 1.

Fig. 1.

First two dimensions of an MDS plot including the ten 0.1X modern human genomes and the HGDP SNP data

First two dimensions of an MDS plot including the ten 0.1X modern human genomes and the HGDP SNP data We find in the simulations that for all but two cases, we recover the geographic region as the first hit for as few as 30 000 reads (∼0.001×, Table 1). In the remaining two cases, the Sardinian and the Karitiana individual, a depth of 0.1× and 0.01×, respectively, is enough. The true nearest population was also identified in most cases within the three closest centroids for a depth above 0.01 (7/10 cases). For the second criteria, we find that in 9/10 of the cases, the WGS sample was within the population ellipse at 0.5× and above. Only in one case (San individual) was a depth of 1× necessary to be placed within the population ellipse.

Table 1.

Summary of the simulation results for the ten modern genomes. For more details, see Supplementary data

Min. approx. depth of coverage to …	… recover geographic region as closest centroid	… recover true population within three closest centroids	… be placed within population ellipse
Mbuti (Africa)	0.001	0.001	0.1
French (Europe)	0.001	0.01	0.1
Papuan (Oceania)	0.001	0.001	0.5
Sardinian (Europe)	0.1	0.01	0.5
Han (Eastern Asia)	0.001	0.1	0.01
Yoruba (Africa)	0.001	0.001	0.1
Karitiana (America)	0.01	0.01	0.1
San (Africa)	0.001	0.001	1
Mandenka (Africa)	0.001	0.1	0.1
Dai (Eastern Asia)	0.001	0.5	0.5

Summary of the simulation results for the ten modern genomes. For more details, see Supplementary data For the ancient data, we get similar results for the Aborigine, which is assigned to the correct geographic region (Oceania) as a first hit with a depth of ∼0.001× and above. At a depth higher than 0.01×, we also recover the expected population as the closest population. For the Anzick-1 individual, presumably because of increased damage, a depth of 1× is needed to recover the geographic region as the first hit. On the other hand, a Native American population is among the three closest populations from a depth of 0.1× and above. The results for Gus are given in Supplementary data.

4 CONCLUSION

The tool we present in this article is based on classical MDS, a technique that originated in the 1930s and is commonly used in other fields [see, e.g. (Borg and Groenen, 1997) and citations therein]. We present a tool that was designed to be practical to assess the ancestry of mapped WGS data for samples sequenced at low depth, assuming that a relevant reference panel in terms of ancestry is provided. We show through simulations that useful ancestry information can be recovered for as few as 30 000 reads—corresponding to a fraction (∼1/60 in early 2014) of a HiSeq 2000 lane (www.illumina.com) for a sample with 1% endogenous content (or ∼1/4800 of a lane for a typical modern sample).

13 in total

1. Patterns of damage in genomic DNA sequences from a Neandertal.

Authors: Adrian W Briggs; Udo Stenzel; Philip L F Johnson; Richard E Green; Janet Kelso; Kay Prüfer; Matthias Meyer; Johannes Krause; Michael T Ronan; Michael Lachmann; Svante Pääbo
Journal: Proc Natl Acad Sci U S A Date: 2007-08-21 Impact factor: 11.205

2. A draft sequence of the Neandertal genome.

Authors: Johannes Krause; Adrian W Briggs; Tomislav Maricic; Udo Stenzel; Martin Kircher; Nick Patterson; Richard E Green; Heng Li; Weiwei Zhai; Markus Hsi-Yang Fritz; Nancy F Hansen; Eric Y Durand; Anna-Sapfo Malaspinas; Jeffrey D Jensen; Tomas Marques-Bonet; Can Alkan; Kay Prüfer; Matthias Meyer; Hernán A Burbano; Jeffrey M Good; Rigo Schultz; Ayinuer Aximu-Petri; Anne Butthof; Barbara Höber; Barbara Höffner; Madlen Siegemund; Antje Weihmann; Chad Nusbaum; Eric S Lander; Carsten Russ; Nathaniel Novod; Jason Affourtit; Michael Egholm; Christine Verna; Pavao Rudan; Dejana Brajkovic; Željko Kucan; Ivan Gušic; Vladimir B Doronichev; Liubov V Golovanova; Carles Lalueza-Fox; Marco de la Rasilla; Javier Fortea; Antonio Rosas; Ralf W Schmitz; Philip L F Johnson; Evan E Eichler; Daniel Falush; Ewan Birney; James C Mullikin; Montgomery Slatkin; Rasmus Nielsen; Janet Kelso; Michael Lachmann; David Reich; Svante Pääbo
Journal: Science Date: 2010-05-07 Impact factor: 47.728

3. Synthetic maps of human gene frequencies in Europeans.

Authors: P Menozzi; A Piazza; L Cavalli-Sforza
Journal: Science Date: 1978-09-01 Impact factor: 47.728

4. The real cost of sequencing: higher than you think!

Authors: Andrea Sboner; Xinmeng Jasmine Mu; Dov Greenbaum; Raymond K Auerbach; Mark B Gerstein
Journal: Genome Biol Date: 2011-08-25 Impact factor: 13.583

Review 5. Perspectives on human population structure at the cusp of the sequencing era.

Authors: John Novembre; Sohini Ramachandran
Journal: Annu Rev Genomics Hum Genet Date: 2011 Impact factor: 8.929

Review 6. Genotype and SNP calling from next-generation sequencing data.

Authors: Rasmus Nielsen; Joshua S Paul; Anders Albrechtsen; Yun S Song
Journal: Nat Rev Genet Date: 2011-06 Impact factor: 53.242

7. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

8. Worldwide human relationships inferred from genome-wide patterns of variation.

Authors: Jun Z Li; Devin M Absher; Hua Tang; Audrey M Southwick; Amanda M Casto; Sohini Ramachandran; Howard M Cann; Gregory S Barsh; Marcus Feldman; Luigi L Cavalli-Sforza; Richard M Myers
Journal: Science Date: 2008-02-22 Impact factor: 47.728

9. An Aboriginal Australian genome reveals separate human dispersals into Asia.

Authors: Morten Rasmussen; Xiaosen Guo; Yong Wang; Kirk E Lohmueller; Simon Rasmussen; Anders Albrechtsen; Line Skotte; Stinus Lindgreen; Mait Metspalu; Thibaut Jombart; Toomas Kivisild; Weiwei Zhai; Anders Eriksson; Andrea Manica; Ludovic Orlando; Francisco M De La Vega; Silvana Tridico; Ene Metspalu; Kasper Nielsen; María C Ávila-Arcos; J Víctor Moreno-Mayar; Craig Muller; Joe Dortch; M Thomas P Gilbert; Ole Lund; Agata Wesolowska; Monika Karmin; Lucy A Weinert; Bo Wang; Jun Li; Shuaishuai Tai; Fei Xiao; Tsunehiko Hanihara; George van Driem; Aashish R Jha; François-Xavier Ricaut; Peter de Knijff; Andrea B Migliano; Irene Gallego Romero; Karsten Kristiansen; David M Lambert; Søren Brunak; Peter Forster; Bernd Brinkmann; Olaf Nehlich; Michael Bunce; Michael Richards; Ramneek Gupta; Carlos D Bustamante; Anders Krogh; Robert A Foley; Marta M Lahr; Francois Balloux; Thomas Sicheritz-Pontén; Richard Villems; Rasmus Nielsen; Jun Wang; Eske Willerslev
Journal: Science Date: 2011-09-22 Impact factor: 47.728

10. Population structure and eigenanalysis.

Authors: Nick Patterson; Alkes L Price; David Reich
Journal: PLoS Genet Date: 2006-12 Impact factor: 5.917

12 in total

1. Improved ancestry estimation for both genotyping and sequencing data using projection procrustes analysis and genotype imputation.

Authors: Chaolong Wang; Xiaowei Zhan; Liming Liang; Gonçalo R Abecasis; Xihong Lin
Journal: Am J Hum Genet Date: 2015-05-28 Impact factor: 11.025

Review 2. Reconstructing ancient genomes and epigenomes.

Authors: Ludovic Orlando; M Thomas P Gilbert; Eske Willerslev
Journal: Nat Rev Genet Date: 2015-06-09 Impact factor: 53.242

3. Inference of Population Structure from Time-Series Genotype Data.

Authors: Tyler A Joseph; Itsik Pe'er
Journal: Am J Hum Genet Date: 2019-06-27 Impact factor: 11.025

4. Terminal Pleistocene Alaskan genome reveals first founding population of Native Americans.

Authors: J Víctor Moreno-Mayar; Ben A Potter; Lasse Vinner; Matthias Steinrücken; Simon Rasmussen; Jonathan Terhorst; John A Kamm; Anders Albrechtsen; Anna-Sapfo Malaspinas; Martin Sikora; Joshua D Reuther; Joel D Irish; Ripan S Malhi; Ludovic Orlando; Yun S Song; Rasmus Nielsen; David J Meltzer; Eske Willerslev
Journal: Nature Date: 2018-01-03 Impact factor: 49.962

5. Immunogenomics: A Negative Prostate Cancer Outcome Associated with TcR-γ/δ Recombinations.

Authors: Yaping N Tu; Wei Lue Tong; John M Yavorski; George Blanck
Journal: Cancer Microenviron Date: 2018-01-22

Review 6. Novel Substrates as Sources of Ancient DNA: Prospects and Hurdles.

Authors: Eleanor Joan Green; Camilla F Speller
Journal: Genes (Basel) Date: 2017-07-13 Impact factor: 4.096

7. Dissecting random and systematic differences between noisy composite data sets.

Authors: Kay Diederichs
Journal: Acta Crystallogr D Struct Biol Date: 2017-03-31 Impact factor: 7.652

8. Ancestry-agnostic estimation of DNA sample contamination from sequence reads.

Authors: Fan Zhang; Matthew Flickinger; Sarah A Gagliano Taliun; Gonçalo R Abecasis; Laura J Scott; Steven A McCaroll; Carlos N Pato; Michael Boehnke; Hyun Min Kang
Journal: Genome Res Date: 2020-01-24 Impact factor: 9.043

9. An Ancient Baboon Genome Demonstrates Long-Term Population Continuity in Southern Africa.

Authors: Iain Mathieson; Federico Abascal; Lasse Vinner; Pontus Skoglund; Cristina Pomilla; Peter Mitchell; Charles Arthur; Deepti Gurdasani; Eske Willerslev; Manj S Sandhu; Genevieve Dewar
Journal: Genome Biol Evol Date: 2020-04-01 Impact factor: 3.416

10. The genomic history of the Aegean palatial civilizations.

Authors: Florian Clemente; Martina Unterländer; Olga Dolgova; Carlos Eduardo G Amorim; Francisco Coroado-Santos; Samuel Neuenschwander; Elissavet Ganiatsou; Diana I Cruz Dávalos; Lucas Anchieri; Frédéric Michaud; Laura Winkelbach; Jens Blöcher; Yami Ommar Arizmendi Cárdenas; Bárbara Sousa da Mota; Eleni Kalliga; Angelos Souleles; Ioannis Kontopoulos; Georgia Karamitrou-Mentessidi; Olga Philaniotou; Adamantios Sampson; Dimitra Theodorou; Metaxia Tsipopoulou; Ioannis Akamatis; Paul Halstead; Kostas Kotsakis; Dushka Urem-Kotsou; Diamantis Panagiotopoulos; Christina Ziota; Sevasti Triantaphyllou; Olivier Delaneau; Jeffrey D Jensen; J Víctor Moreno-Mayar; Joachim Burger; Vitor C Sousa; Oscar Lao; Anna-Sapfo Malaspinas; Christina Papageorgopoulou
Journal: Cell Date: 2021-04-29 Impact factor: 41.582