MOTIVATION: Ancient DNA (aDNA) molecules in fossilized bones and teeth, coprolites, sediments, mummified specimens and museum collections represent fantastic sources of information for evolutionary biologists, revealing the agents of past epidemics and the dynamics of past populations. However, the analysis of aDNA generally faces two major issues. Firstly, sequences consist of a mixture of endogenous and various exogenous backgrounds, mostly microbial. Secondly, high nucleotide misincorporation rates can be observed as a result of severe post-mortem DNA damage. Such misincorporation patterns are instrumental to authenticate ancient sequences versus modern contaminants. We recently developed the user-friendly mapDamage package that identifies such patterns from next-generation sequencing (NGS) sequence datasets. The absence of formal statistical modeling of the DNA damage process, however, precluded rigorous quantitative comparisons across samples. RESULTS: Here, we describe mapDamage 2.0 that extends the original features of mapDamage by incorporating a statistical model of DNA damage. Assuming that damage events depend only on sequencing position and post-mortem deamination, our Bayesian statistical framework provides estimates of four key features of aDNA molecules: the average length of overhangs (λ), nick frequency (ν) and cytosine deamination rates in both double-stranded regions ( ) and overhangs ( ). Our model enables rescaling base quality scores according to their probability of being damaged. mapDamage 2.0 handles NGS datasets with ease and is compatible with a wide range of DNA library protocols. AVAILABILITY: mapDamage 2.0 is available at ginolhac.github.io/mapDamage/ as a Python package and documentation is maintained at the Centre for GeoGenetics Web site (geogenetics.ku.dk/publications/mapdamage2.0/). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Ancient DNA (aDNA) molecules in fossilized bones and teeth, coprolites, sediments, mummified specimens and museum collections represent fantastic sources of information for evolutionary biologists, revealing the agents of past epidemics and the dynamics of past populations. However, the analysis of aDNA generally faces two major issues. Firstly, sequences consist of a mixture of endogenous and various exogenous backgrounds, mostly microbial. Secondly, high nucleotide misincorporation rates can be observed as a result of severe post-mortem DNA damage. Such misincorporation patterns are instrumental to authenticate ancient sequences versus modern contaminants. We recently developed the user-friendly mapDamage package that identifies such patterns from next-generation sequencing (NGS) sequence datasets. The absence of formal statistical modeling of the DNA damage process, however, precluded rigorous quantitative comparisons across samples. RESULTS: Here, we describe mapDamage 2.0 that extends the original features of mapDamage by incorporating a statistical model of DNA damage. Assuming that damage events depend only on sequencing position and post-mortem deamination, our Bayesian statistical framework provides estimates of four key features of aDNA molecules: the average length of overhangs (λ), nick frequency (ν) and cytosine deamination rates in both double-stranded regions ( ) and overhangs ( ). Our model enables rescaling base quality scores according to their probability of being damaged. mapDamage 2.0 handles NGS datasets with ease and is compatible with a wide range of DNA library protocols. AVAILABILITY: mapDamage 2.0 is available at ginolhac.github.io/mapDamage/ as a Python package and documentation is maintained at the Centre for GeoGenetics Web site (geogenetics.ku.dk/publications/mapdamage2.0/). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
DNA in historical samples is subject to a plethora of environmental conditions and degradation reactions (Sawyer ). Abasic sites, strand breaks, interstrand cross-links and a wide diversity of atypic nucleotidic bases are formed following oxidative and hydrolytic degradation (Lindahl, 1993; Pääbo ), even in the most favorable preservation conditions.Post-mortem DNA damage limits our ability to access ancient DNA (aDNA) sequences and increases the risk of exogenous modern contamination, as undamaged DNA molecules are more prone to enzymatic manipulation. Nucleotide misincorporation patterns, which are mostly driven by deaminated forms of cytosines (uracils), have been suggested as a powerful approach to authenticate aDNA sequences generated on next-generation sequencing (NGS) platforms (Briggs ) and motivated the creation of the mapDamage package (Ginolhac ). Such patterns could vary according to the specific molecular approach used for constructing (Meyer ) and/or amplifying (Ginolhac ) second-generation DNA libraries. For instance, for one of the most popular protocols (Meyer and Kircher, 2010), we observe inflated cytosine deamination rates at 5′-overhangs, an increase in substitution rates toward sequencing starts and complementary increase in rates toward reads ends (Briggs ). Conversely, a novel procedure targeting single-stranded templates has shown elevated substitution rates at both ends (Meyer ).Statistical modeling of such patterns has been developed by Briggs with strand break, overhangs and cytosine deamination as key factors. Using read alignment to reference genomes and maximum likelihood optimization, this approach has delivered the first quantitative estimates of damage parameters. However, the likelihood framework originally implemented scales poorly with the size of NGS datasets, and extensive running times have prevented common usage. Here, we present an extension of mapDamage, which implements a fast approximation of the DNA damage model using a Bayesian framework. mapDamage 2.0 opens the possibility of comparing DNA damage levels across temporal and environmental gradients. Posterior distributions of damage parameters also enable penalizing the quality score of likely damaged bases, reducing noise in downstream single-nucleotide polymorphism (SNP) calling procedures.
2 APPROACH
Here we build on the DNA damage model described in Briggs . We make the simplifying assumption that mutations and post-mortem DNA damage are independent within a fragment, with occurrences depending only on the relative position from the sequence ends.
3 METHODS
The general idea is to mutate bases following an Hasegawa, Kishino and Yano (HKY) transition matrix (Hasegawa ) and then independently add post-mortem damage on top of mutated bases. In this framework, we have multinomial distributions describing the position-specific substitutions for any given base ( and ).
Θ is the HKY transition matrix, and is defined as the DNA damage transition matrix. We assume post-mortem cytosine deamination is the main driver of nucleotide misincorporations in agreement with experimental evidence (Briggs ), providingWhere the base-specific damage probabilities are defined asThe motivation for the base-specific damage probabilities is best explained by the Markov chain in Figure 1 where the first jump decides if the position is before or after a nick; then a substitution could be observed following deamination in overhang or double-stranded DNA regions. A similar Markov chain could be drawn for substitutions (Supplementary Section 1).
Fig. 1.
A schematic view describing the DNA damage Markov chain, which extends the DNA substitution model. The states and correspond to the final nucleotides in the sequences
A schematic view describing the DNA damage Markov chain, which extends the DNA substitution model. The states and correspond to the final nucleotides in the sequencesFor rescaling base quality scores, we assume that and substitutions either originate from true biological differences or from damage driven misincorporations. We can derive an estimate for the probability that a (similar for ) misincorporation at position i along the reads is due to damage using
We can now correct base quality scores provided in alignment BAM files [ at position i for read r] using
4 DISCUSSION
We applied mapDamage2.0 on a series of aDNA sequence datasets generated from a range of periods, source materials and environments (Supplementary Section 3). Posterior predictive intervals and empirical frequencies are in general agreement, as shown for the ancient plague dataset (Supplementary Table S2 and Supplementary Figs S4–S9) (Schuenemann ), demonstrating the adequacy of our method. We observed a ratio of cytosine deamination rates for double- and single-stranded regions orders of magnitude greater than estimates based on in vitro experiments in aqueous solution (0.007 in Lindahl, 1993 versus 0.026–0.070 for Schuenemann in Supplementary Table S1). This suggests that tissue- and sample-specific micro-environmental characteristics drive different DNA damage kinetics in situ. We also found a significant rank correlation between the posterior mean for single-stranded cytosine deamination and sample age (Supplementary Table S3) in agreement with Sawyer . However, remains of similar age and location showed diverse parameter estimates (Supplementary Table S2), suggesting a prominent role of micro-environmental characteristics over age in diagenesis.We also applied our quality rescaling scheme to the sequence data of an Australian Aboriginal individual who died in 1920s (Rasmussen ). This increased the overlap of genotype calls to dbSNP v137, suggesting that lower false-positive SNP calls were achieved (Supplementary Table S4).
5 CONCLUSION
We have developed a computational method for inferring aDNA damage parameters from NGS sequence datasets, with minimal changes to the DNA damage model presented by Briggs . Our model is compatible with the specificities of different sequencing and library building protocols. We believe that downscaling quality scores of likely damaged bases is the first from a long list of possible applications for damage parameter posterior distributions, limiting the impact of nucleotide misincorporations in downstream sequence analyses. The knowledge of such distributions could also be instrumental for improving mapping procedures to reference genomes (Schubert ).
Authors: Svante Pääbo; Hendrik Poinar; David Serre; Viviane Jaenicke-Despres; Juliane Hebler; Nadin Rohland; Melanie Kuch; Johannes Krause; Linda Vigilant; Michael Hofreiter Journal: Annu Rev Genet Date: 2004 Impact factor: 16.830
Authors: Adrian W Briggs; Udo Stenzel; Philip L F Johnson; Richard E Green; Janet Kelso; Kay Prüfer; Matthias Meyer; Johannes Krause; Michael T Ronan; Michael Lachmann; Svante Pääbo Journal: Proc Natl Acad Sci U S A Date: 2007-08-21 Impact factor: 11.205
Authors: Matthias Meyer; Martin Kircher; Marie-Theres Gansauge; Heng Li; Fernando Racimo; Swapan Mallick; Joshua G Schraiber; Flora Jay; Kay Prüfer; Cesare de Filippo; Peter H Sudmant; Can Alkan; Qiaomei Fu; Ron Do; Nadin Rohland; Arti Tandon; Michael Siebauer; Richard E Green; Katarzyna Bryc; Adrian W Briggs; Udo Stenzel; Jesse Dabney; Jay Shendure; Jacob Kitzman; Michael F Hammer; Michael V Shunkov; Anatoli P Derevianko; Nick Patterson; Aida M Andrés; Evan E Eichler; Montgomery Slatkin; David Reich; Janet Kelso; Svante Pääbo Journal: Science Date: 2012-08-30 Impact factor: 47.728
Authors: Verena J Schuenemann; Kirsten Bos; Sharon DeWitte; Sarah Schmedes; Joslyn Jamieson; Alissa Mittnik; Stephen Forrest; Brian K Coombes; James W Wood; David J D Earn; William White; Johannes Krause; Hendrik N Poinar Journal: Proc Natl Acad Sci U S A Date: 2011-08-29 Impact factor: 11.205
Authors: Morten Rasmussen; Xiaosen Guo; Yong Wang; Kirk E Lohmueller; Simon Rasmussen; Anders Albrechtsen; Line Skotte; Stinus Lindgreen; Mait Metspalu; Thibaut Jombart; Toomas Kivisild; Weiwei Zhai; Anders Eriksson; Andrea Manica; Ludovic Orlando; Francisco M De La Vega; Silvana Tridico; Ene Metspalu; Kasper Nielsen; María C Ávila-Arcos; J Víctor Moreno-Mayar; Craig Muller; Joe Dortch; M Thomas P Gilbert; Ole Lund; Agata Wesolowska; Monika Karmin; Lucy A Weinert; Bo Wang; Jun Li; Shuaishuai Tai; Fei Xiao; Tsunehiko Hanihara; George van Driem; Aashish R Jha; François-Xavier Ricaut; Peter de Knijff; Andrea B Migliano; Irene Gallego Romero; Karsten Kristiansen; David M Lambert; Søren Brunak; Peter Forster; Bernd Brinkmann; Olaf Nehlich; Michael Bunce; Michael Richards; Ramneek Gupta; Carlos D Bustamante; Anders Krogh; Robert A Foley; Marta M Lahr; Francois Balloux; Thomas Sicheritz-Pontén; Richard Villems; Rasmus Nielsen; Jun Wang; Eske Willerslev Journal: Science Date: 2011-09-22 Impact factor: 47.728
Authors: Lara M Cassidy; Rui Martiniano; Eileen M Murphy; Matthew D Teasdale; James Mallory; Barrie Hartwell; Daniel G Bradley Journal: Proc Natl Acad Sci U S A Date: 2015-12-28 Impact factor: 11.205
Authors: Ariane Düx; Sebastian Lequime; Philippe Lemey; Sébastien Calvignac-Spencer; Livia Victoria Patrono; Bram Vrancken; Sengül Boral; Jan F Gogarten; Antonia Hilbig; David Horst; Kevin Merkel; Baptiste Prepoint; Sabine Santibanez; Jasmin Schlotterbeck; Marc A Suchard; Markus Ulrich; Navena Widulin; Annette Mankertz; Fabian H Leendertz; Kyle Harper; Thomas Schnalke Journal: Science Date: 2020-06-19 Impact factor: 47.728
Authors: Tim H Heupink; Sankar Subramanian; Joanne L Wright; Phillip Endicott; Michael Carrington Westaway; Leon Huynen; Walther Parson; Craig D Millar; Eske Willerslev; David M Lambert Journal: Proc Natl Acad Sci U S A Date: 2016-06-06 Impact factor: 11.205
Authors: Kirsten I Bos; Günter Jäger; Verena J Schuenemann; Åshild J Vågene; Maria A Spyrou; Alexander Herbig; Kay Nieselt; Johannes Krause Journal: Philos Trans R Soc Lond B Biol Sci Date: 2015-01-19 Impact factor: 6.237
Authors: Hannes Schroeder; María C Ávila-Arcos; Anna-Sapfo Malaspinas; G David Poznik; Marcela Sandoval-Velasco; Meredith L Carpenter; José Víctor Moreno-Mayar; Martin Sikora; Philip L F Johnson; Morten Erik Allentoft; José Alfredo Samaniego; Jay B Haviser; Michael W Dee; Thomas W Stafford; Antonio Salas; Ludovic Orlando; Eske Willerslev; Carlos D Bustamante; M Thomas P Gilbert Journal: Proc Natl Acad Sci U S A Date: 2015-03-09 Impact factor: 11.205