Literature DB >> 27794556

gargammel: a sequence simulator for ancient DNA.

Gabriel Renaud1, Kristian Hanghøj1,2, Eske Willerslev1,3,4, Ludovic Orlando1,2.   

Abstract

Summary: Ancient DNA has emerged as a remarkable tool to infer the history of extinct species and past populations. However, many of its characteristics, such as extensive fragmentation, damage and contamination, can influence downstream analyses. To help investigators measure how these could impact their analyses in silico , we have developed gargammel, a package that simulates ancient DNA fragments given a set of known reference genomes. Our package simulates the entire molecular process from post-mortem DNA fragmentation and DNA damage to experimental sequencing errors, and reproduces most common bias observed in ancient DNA datasets. Availability and Implementation: The package is publicly available on github: https://grenaud.github.io/gargammel/ and released under the GPL. Contact: gabriel.renaud@snm.ku.dk. Supplementary information: Supplementary data are available at Bioinformatics online.
© The Author 2016. Published by Oxford University Press.

Entities:  

Mesh:

Substances:

Year:  2017        PMID: 27794556      PMCID: PMC5408798          DOI: 10.1093/bioinformatics/btw670

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

DNA retrieved from subfossils, also called ancient DNA (aDNA), is increasingly used to reconstruct population histories (Leonardi ). The analysis of aDNA data remains, however, challenging due to a number of factors that can affect downstream inferences. First, DNA tends to degrade over time, leading to fragments of limited sizes (30–80 bp) showing substantial nucleotide misincorporations (Briggs ). Second, environmental microbes tend to colonize the organism postmortem (Green ). As a result, the endogenous DNA fraction can sometimes be extremely reduced, making shotgun sequencing approaches uneconomical. Third, such exogenous sequences can impact the reconstruction of ancient genomes if not properly identified during read alignment. In the case of aDNA retrieved from hominin species, the DNA from present-day humans, which can be introduced at any stage including during the excavation and in the laboratory, is particularly problematic as it mixes unrelated population histories within a single sample. Ancient DNA researchers often use simulations to test the robustness of summary statistics aimed at inferring population parameters. While some packages have simulated platform-specific errors due to sequencing, no packages are currently available to properly simulate aDNA sequence datasets, including their most prominent characteristics, such as such as damage, fragmentation, human and microbial contamination. Here, we present gargammel, a package that simulates aDNA sequence datasets from a set of genome references representing the microbial fraction, the endogenous fraction and the present-day human contamination. The package can simulate most common features of aDNA sequences, including post-mortem DNA damage and base misincorporations. In addition, it simulates base compositional bias due to the molecular tools used in library preparation, sequencing bias against GC-rich fragments and errors introduced by the sequencing platform.

2 Methods

Our algorithm reflects the entire molecular and experimental process leading to the retrieval of aDNA fragments (Fig. 1). In its simplest mode, the user first provides three sets of references in fasta format: (i) the microbial contaminant, (ii) endogenous genome and (iii) the present-day human contamination. The user can also provide full microbial profiles, including taxonomic abundances, to represent more complex sources of microbial contamination. In this case, corresponding (or closely related) microbial genomes will be automatically downloaded from NCBI. The user either provides the desired endogenous coverage or a fixed number of fragments to simulate. The endogenous genome can contain 1 sequence for haploid organisms or 2 sequences as to simulate a diploid organism where fragments are sampled from each with equal probability.
Fig. 1.

Gargammel flowchart. A precise number of fragments are selected from the endogenous, present-day human contaminants and microbial genomes. Damage characteristic of aDNA is added and the various types of fragments, contaminants and endogenous ones, are combined, sequencing adapters are added in silico and sequencing errors with corresponding quality scores are produced. Microbial contamination occurs at a rate of Cbact. whereas present-day human contamination occurs at a rate of Chom. Molecular damage can be added using different models for all three sources via command-line options

Gargammel flowchart. A precise number of fragments are selected from the endogenous, present-day human contaminants and microbial genomes. Damage characteristic of aDNA is added and the various types of fragments, contaminants and endogenous ones, are combined, sequencing adapters are added in silico and sequencing errors with corresponding quality scores are produced. Microbial contamination occurs at a rate of Cbact. whereas present-day human contamination occurs at a rate of Chom. Molecular damage can be added using different models for all three sources via command-line options Fragments are selected from all three sets depending on the desired composition of the final set (e.g. 70% microbial, 20% endogenous and 10% present-day human contamination). The size of the fragments can be selected from a user-specified distribution. As aDNA base composition can be different from modern DNA (Jónsson ) and vary together with the molecular tools used during library preparation (Seguin-Orlando ), the base composition can also be modeled. Subsequently, post-mortem deamination is added according to the parameters of standard aDNA damage models (Briggs ) or a user-specified matrix of position-specific misincorporation rates. Fragmented aDNA templates can be shorter than the read length. Gargammel proceeds by adding the necessary length of the sequencing adapter. Finally, the ART sequencing simulator (Huang ) is used on the resulting sequences to produce Illumina reads with sequencing errors and quality scores. The various subprograms of the pipeline are called by a wrapper script and are detailed in the Supplementary Methods. Finally, the wrapper script combines reads from the three sources to create the final sequence set.

3 Features

We tested gargammel for its ability to reproduce empirical features found in six previously released aDNA datasets (see Supplementary Results). These include: (i) size distribution, (ii) base composition, (iii) GC-bias due to the DNA polymerase used for library amplification and; (iv) DNA misincorporation. The results presented in the Supplementary Results show a high consistency between observed and simulated distributions to show the applicability of gargammel as a sequence simulator for aDNA. Gargammel provides researchers with the opportunity to perform various inquiries to evaluate the robustness of various analyses to aDNA properties. In the Suppl. Results, we present two such types of analyses. First, we evaluated the potential impact of present-day contamination on admixture tests based on the D-statistics (Durand ). Simulated sequences were obtained from coalescence models without any admixture. We find that amounts of present-day human contamination in ancient human datasets can create spurious signals of admixture depending on the coalescent model, especially when a distant outgroup is used in the test and when the modern contamination source originates from a population coalescing deeply with the endogenous individual. Second, we evaluated whether microbial fragments of increasing size could impact the false positive alignment rate against the human reference genome (see Schubert for an evaluation of aDNA mapping). We identified a size threshold (35 bp) that reduces such impact for the microbial community identified in the 12.8 kyr-old Clovis individual (Rasmussen ). Such tests can be adapted to any other situation of interest. Click here for additional data file.
  9 in total

1.  Patterns of damage in genomic DNA sequences from a Neandertal.

Authors:  Adrian W Briggs; Udo Stenzel; Philip L F Johnson; Richard E Green; Janet Kelso; Kay Prüfer; Matthias Meyer; Johannes Krause; Michael T Ronan; Michael Lachmann; Svante Pääbo
Journal:  Proc Natl Acad Sci U S A       Date:  2007-08-21       Impact factor: 11.205

2.  Testing for ancient admixture between closely related populations.

Authors:  Eric Y Durand; Nick Patterson; David Reich; Montgomery Slatkin
Journal:  Mol Biol Evol       Date:  2011-02-15       Impact factor: 16.240

3.  ART: a next-generation sequencing read simulator.

Authors:  Weichun Huang; Leping Li; Jason R Myers; Gabor T Marth
Journal:  Bioinformatics       Date:  2011-12-23       Impact factor: 6.937

Review 4.  The Neandertal genome and ancient DNA authenticity.

Authors:  Richard E Green; Adrian W Briggs; Johannes Krause; Kay Prüfer; Hernán A Burbano; Michael Siebauer; Michael Lachmann; Svante Pääbo
Journal:  EMBO J       Date:  2009-08-06       Impact factor: 11.598

5.  mapDamage2.0: fast approximate Bayesian estimates of ancient DNA damage parameters.

Authors:  Hákon Jónsson; Aurélien Ginolhac; Mikkel Schubert; Philip L F Johnson; Ludovic Orlando
Journal:  Bioinformatics       Date:  2013-04-23       Impact factor: 6.937

Review 6.  Evolutionary Patterns and Processes: Lessons from Ancient DNA.

Authors:  Michela Leonardi; Pablo Librado; Clio Der Sarkissian; Mikkel Schubert; Ahmed H Alfarhan; Saleh A Alquraishi; Khaled A S Al-Rasheid; Cristina Gamba; Eske Willerslev; Ludovic Orlando
Journal:  Syst Biol       Date:  2017-01-01       Impact factor: 9.160

7.  Ligation bias in illumina next-generation DNA libraries: implications for sequencing ancient genomes.

Authors:  Andaine Seguin-Orlando; Mikkel Schubert; Joel Clary; Julia Stagegaard; Maria T Alberdi; José Luis Prado; Alfredo Prieto; Eske Willerslev; Ludovic Orlando
Journal:  PLoS One       Date:  2013-10-29       Impact factor: 3.240

8.  The genome of a Late Pleistocene human from a Clovis burial site in western Montana.

Authors:  Morten Rasmussen; Sarah L Anzick; Michael R Waters; Pontus Skoglund; Michael DeGiorgio; Thomas W Stafford; Simon Rasmussen; Ida Moltke; Anders Albrechtsen; Shane M Doyle; G David Poznik; Valborg Gudmundsdottir; Rachita Yadav; Anna-Sapfo Malaspinas; Samuel Stockton White; Morten E Allentoft; Omar E Cornejo; Kristiina Tambets; Anders Eriksson; Peter D Heintzman; Monika Karmin; Thorfinn Sand Korneliussen; David J Meltzer; Tracey L Pierre; Jesper Stenderup; Lauri Saag; Vera M Warmuth; Margarida C Lopes; Ripan S Malhi; Søren Brunak; Thomas Sicheritz-Ponten; Ian Barnes; Matthew Collins; Ludovic Orlando; Francois Balloux; Andrea Manica; Ramneek Gupta; Mait Metspalu; Carlos D Bustamante; Mattias Jakobsson; Rasmus Nielsen; Eske Willerslev
Journal:  Nature       Date:  2014-02-13       Impact factor: 49.962

9.  Improving ancient DNA read mapping against modern reference genomes.

Authors:  Mikkel Schubert; Aurelien Ginolhac; Stinus Lindgreen; John F Thompson; Khaled A S Al-Rasheid; Eske Willerslev; Anders Krogh; Ludovic Orlando
Journal:  BMC Genomics       Date:  2012-05-10       Impact factor: 3.969

  9 in total
  22 in total

1.  DamMet: ancient methylome mapping accounting for errors, true variants, and post-mortem DNA damage.

Authors:  Kristian Hanghøj; Gabriel Renaud; Anders Albrechtsen; Ludovic Orlando
Journal:  Gigascience       Date:  2019-04-01       Impact factor: 6.524

Review 2.  A broad survey of DNA sequence data simulation tools.

Authors:  Shatha Alosaimi; Armand Bandiang; Noelle van Biljon; Denis Awany; Prisca K Thami; Milaine S S Tchamga; Anmol Kiran; Olfa Messaoud; Radia Ismaeel Mohammed Hassan; Jacquiline Mugo; Azza Ahmed; Christian D Bope; Imane Allali; Gaston K Mazandu; Nicola J Mulder; Emile R Chimusa
Journal:  Brief Funct Genomics       Date:  2020-01-22       Impact factor: 4.241

3.  A likelihood method for estimating present-day human contamination in ancient male samples using low-depth X-chromosome data.

Authors:  J Víctor Moreno-Mayar; Thorfinn Sand Korneliussen; Jyoti Dalal; Gabriel Renaud; Anders Albrechtsen; Rasmus Nielsen; Anna-Sapfo Malaspinas
Journal:  Bioinformatics       Date:  2020-02-01       Impact factor: 6.937

4.  Joint Estimates of Heterozygosity and Runs of Homozygosity for Modern and Ancient Samples.

Authors:  Gabriel Renaud; Kristian Hanghøj; Thorfinn Sand Korneliussen; Eske Willerslev; Ludovic Orlando
Journal:  Genetics       Date:  2019-05-14       Impact factor: 4.562

5.  Reconstruction of ancient microbial genomes from the human gut.

Authors:  Marsha C Wibowo; Zhen Yang; Maxime Borry; Alexander Hübner; Kun D Huang; Braden T Tierney; Samuel Zimmerman; Francisco Barajas-Olmos; Cecilia Contreras-Cubas; Humberto García-Ortiz; Angélica Martínez-Hernández; Jacob M Luber; Philipp Kirstahler; Tre Blohm; Francis E Smiley; Richard Arnold; Sonia A Ballal; Sünje Johanna Pamp; Julia Russ; Frank Maixner; Omar Rota-Stabelli; Nicola Segata; Karl Reinhard; Lorena Orozco; Christina Warinner; Meradeth Snow; Steven LeBlanc; Aleksandar D Kostic
Journal:  Nature       Date:  2021-05-12       Impact factor: 69.504

6.  Testing of Alignment Parameters for Ancient Samples: Evaluating and Optimizing Mapping Parameters for Ancient Samples Using the TAPAS Tool.

Authors:  Ulrike H Taron; Moritz Lell; Axel Barlow; Johanna L A Paijmans
Journal:  Genes (Basel)       Date:  2018-03-13       Impact factor: 4.096

7.  Estimating genetic kin relationships in prehistoric populations.

Authors:  Jose Manuel Monroy Kuhn; Mattias Jakobsson; Torsten Günther
Journal:  PLoS One       Date:  2018-04-23       Impact factor: 3.240

8.  The genomic history of the Aegean palatial civilizations.

Authors:  Florian Clemente; Martina Unterländer; Olga Dolgova; Carlos Eduardo G Amorim; Francisco Coroado-Santos; Samuel Neuenschwander; Elissavet Ganiatsou; Diana I Cruz Dávalos; Lucas Anchieri; Frédéric Michaud; Laura Winkelbach; Jens Blöcher; Yami Ommar Arizmendi Cárdenas; Bárbara Sousa da Mota; Eleni Kalliga; Angelos Souleles; Ioannis Kontopoulos; Georgia Karamitrou-Mentessidi; Olga Philaniotou; Adamantios Sampson; Dimitra Theodorou; Metaxia Tsipopoulou; Ioannis Akamatis; Paul Halstead; Kostas Kotsakis; Dushka Urem-Kotsou; Diamantis Panagiotopoulos; Christina Ziota; Sevasti Triantaphyllou; Olivier Delaneau; Jeffrey D Jensen; J Víctor Moreno-Mayar; Joachim Burger; Vitor C Sousa; Oscar Lao; Anna-Sapfo Malaspinas; Christina Papageorgopoulou
Journal:  Cell       Date:  2021-04-29       Impact factor: 41.582

9.  Selection of Appropriate Metagenome Taxonomic Classifiers for Ancient Microbiome Research.

Authors:  Irina M Velsko; Laurent A F Frantz; Alexander Herbig; Greger Larson; Christina Warinner
Journal:  mSystems       Date:  2018-07-17       Impact factor: 6.496

10.  Comparative scaffolding and gap filling of ancient bacterial genomes applied to two ancient Yersinia pestis genomes.

Authors:  Nina Luhmann; Daniel Doerr; Cedric Chauve
Journal:  Microb Genom       Date:  2017-07-08
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.