| Literature DB >> 26458810 |
Gabriel Renaud1, Viviane Slon2, Ana T Duggan3, Janet Kelso4.
Abstract
UNLABELLED: Ancient DNA is typically highly degraded with appreciable cytosine deamination, and contamination with present-day DNA often complicates the identification of endogenous molecules. Together, these factors impede accurate assembly of the endogenous ancient mitochondrial genome. We present schmutzi, an iterative approach to jointly estimate present-day human contamination in ancient human DNA datasets and reconstruct the endogenous mitochondrial genome. By using sequence deamination patterns and fragment length distributions, schmutzi accurately reconstructs the endogenous mitochondrial genome sequence even when contamination exceeds 50 %. Given sufficient coverage, schmutzi also produces reliable estimates of contamination across a range of contamination rates. AVAILABILITY: https://bioinf.eva.mpg.de/schmutzi/ license:GPLv3.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26458810 PMCID: PMC4601135 DOI: 10.1186/s13059-015-0776-0
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Schematic illustration of mitochondrial sequences from an ancient DNA library. When DNA from an ancient human sample is sequenced, DNA from the ancient human (endogenous fragments represented in green) as well as contaminant DNA fragments from the individuals who have handled the bone (contaminating fragments represented in red) are included. Because DNA undergoes deamination over time, endogenous fragments are likely to carry deaminated cytosines (represented as T’s in a blue frame), particularly near the ends of the DNA fragments. The non-deaminated cytosines are represented as unframed blue C’s. Schmutzi first identifies the endogenous fragments and, in a second step, uses these to quantify contamination. These steps are repeated until convergence is achieved and a single mitochondrial genome is identified
Fig. 2Schmutzi workflow. An initial contamination estimate is computed using the deamination rates of fragments by conditioning on the other end being deaminated and comparing these to the deamination rate of all fragments in the dataset (contDeam). This prior is provided to call an endogenous consensus (endoCaller). The consensus call is, in turn, used to re-estimate mitochondrial contamination (mtCont). Deamination rates and fragment length distributions are measured for fragments that support endogenous and contaminant mitochondrial genomes (splitEndo). The information from mtCont and splitEndo is used as input for re-calling the endogenous consensus (endoCaller). This cycle is repeated until a stable contamination rate is reached. db database
Inputs and outputs for the different programs described in ‘Methods’. Overview of the three main programs, contDeam, endoCaller and mtCont, with the helper program splitEndo for the iterative mode
| Program | Input | Output |
|---|---|---|
| contDeam | BF | CRDP, EDR |
| endoCaller | BF, EDR, CP, DFL | EB |
| mtCont | BF, EDR, DB, EB | CRDB, CS |
| splitEndo | BF, EB | EDR, DFL |
BF BAM file, CRDP present-day human contamination rate using deamination patterns, CRDB present-day human contamination rate using a database of putative contaminants and the endogenous base, EDR endogenous deamination rates, CP contamination prior, DFL distribution of endogenous/contaminant fragment lengths, EB endogenous base, DB database of putative mitochondrial contaminant genomes, CS most likely contamination source
Fig. 3Effect of increasing contamination on endogenous genome sequence reconstruction and contaminant genome sequence reconstruction of simulated data. Accuracy of the ancient (a) and present-day contaminant (b) mitochondrial consensus sequences produced by schmutzi on simulated data for an early modern human, a Neanderthal and a Denisovan mitochondrial genome. We define an error as either a mismatch or an indel between the predicted endogenous sequence and the published mitochondrial sequence used for simulations. As contamination increases, inference of the endogenous mitochondrial genome becomes more difficult (a). In contrast, the prediction of the contaminant genome becomes more accurate at higher levels of present-day human contamination (b)
Similarity of the predicted endogenous mitochondrial genome sequence to the original Neanderthal reference sequence, at various rates of simulated contamination with present-day human DNA. An endogenous consensus call was performed using schmutzi on all fragments, and using PMDtools followed by htslib on the fragments labeled by PMDtools as endogenous. For comparison, we generated a simple consensus by running htslib on all sequenced fragments. While this approach works well at low amounts of contamination, it produces an incorrect consensus at higher levels of contamination when the presence of contaminating fragments is not accounted for using approaches like PMDtools and schmutzi. The number of indels are reported as either insertions or deletions in either the predicted consensus or the Neanderthal reference; hence, discrepancies in the final sum may occur
| Contamination | Endogenous prediction | Endogenous prediction from | Mitochondrial consensus, called | ||||||
|---|---|---|---|---|---|---|---|---|---|
| rate | from schmutzi | PMDtools and htslib | using htslib on all fragments | ||||||
| Matches | Mismatches | Indels | Matches | Mismatches | Indels | Matches | Mismatches | Indels | |
| 1 % | 16,565 | 0 | 0 | 16,561 | 2 | 6 | 16,561 | 3 | 5 |
| 5 % | 16,565 | 0 | 0 | 16,561 | 2 | 6 | 16,561 | 3 | 5 |
| 10 % | 16,565 | 0 | 0 | 16,561 | 2 | 6 | 16,561 | 3 | 5 |
| 15 % | 16,565 | 0 | 0 | 16,560 | 3 | 6 | 16,553 | 11 | 5 |
| 20 % | 16,565 | 0 | 0 | 16,560 | 3 | 6 | 16,488 | 76 | 5 |
| 25 % | 16,565 | 0 | 0 | 16,558 | 5 | 6 | 16,374 | 190 | 5 |
| 30 % | 16,564 | 1 | 0 | 16,558 | 5 | 6 | 16,371 | 193 | 5 |
| 35 % | 16,564 | 1 | 0 | 16,556 | 7 | 6 | 16,371 | 193 | 5 |
| 40 % | 16,564 | 1 | 0 | 16,555 | 8 | 6 | 16,371 | 193 | 5 |
| 45 % | 16,564 | 1 | 0 | 16,553 | 10 | 6 | 16,371 | 193 | 5 |
| 50 % | 16,563 | 2 | 0 | 16,553 | 10 | 6 | 16,371 | 193 | 5 |
| 55 % | 16,564 | 1 | 0 | 16,554 | 9 | 6 | 16,370 | 194 | 5 |
| 60 % | 16,563 | 2 | 0 | 16,551 | 12 | 6 | 16,368 | 196 | 5 |
| 65 % | 16,563 | 1 | 1 | 16,551 | 12 | 6 | 16,361 | 203 | 5 |
| 70 % | 16,562 | 1 | 2 | 16,548 | 15 | 6 | 16,358 | 206 | 5 |
| 75 % | 16,563 | 1 | 1 | 16,546 | 17 | 6 | 16,355 | 209 | 5 |
| 80 % | 16,561 | 2 | 2 | 16,545 | 18 | 6 | 16,355 | 209 | 5 |
| 85 % | 16,563 | 1 | 1 | 16,544 | 19 | 6 | 16,355 | 209 | 5 |
| 90 % | 16,561 | 3 | 1 | 16,539 | 24 | 6 | 16,355 | 209 | 5 |
| 95 % | 16,550 | 15 | 7 | 16,532 | 31 | 6 | 16,355 | 209 | 5 |
Empirical mitochondrial datasets. The numbers in parentheses represent the deamination rates when conditioning on the other end of the fragment being deaminated for heavily contaminated samples
| Sample | mtDNA | Deamination | Present-day | Library ID | |
|---|---|---|---|---|---|
| ID | coverage | rates (%) | contamination | and reference | |
| (×) | 5′ | 3′ | |||
| Altai Neanderthal | 1076 | 5.7 | 28.4 | Low (∼1 %) | L9198 from [ |
| Denisovan | 258 | 14.8 | 33.9 | Low (∼1 %) | B1108 from [ |
| Ust’-ishim | 124 | 2.7 | 3.4 | Low (∼1 %) | B3899 from [ |
| Mezmaiskaya Neanderthal B9687 | 711 | 8.8 (17.3) | 13.3 (25.8) | High (∼40–50 %) | B9687 from [ |
| Mezmaiskaya Neanderthal B9688 | 636 | 8.5 (15.0) | 12.7 (24.1) | High (∼40–50 %) | B9688 from [ |
Fig. 4Consensus call and contamination estimate accuracy for empirical datasets. a The htslib consensus call (yellow) and the schmutzi consensus call (red) were performed on a subset of the data from three Neanderthals, one Denisovan and one early modern human. The number of mismatches between the mitochondrial consensus sequence and the published mitochondrial genome from the same individual was calculated. b Contamination was estimated using schmutzi (red) and contamMix v.1.0-10 (blue) and compared to the contamination computed using diagnostic positions (gray per fragment and black per base). For the two Mezmaiskaya individuals, the endogenous genome used for comparison was obtained using another library with low levels of contamination from the same individual. diag pos diagnostic position, Nean Neanderthal
Fig. 5Contamination estimates and phylogenetic placement of Mezmaiskaya 1 (library ID B9687). a The posterior probability distribution for contamination in Mezmaiskaya 1. The dotted line represents the estimate obtained using an ad hoc method based on fixed sites. b A maximum-likelihood tree showing the placement of the mitochondrial genome of Mezmaiskaya 1 (labeled MT in the tree) and the inferred contaminant (labeled MTc in the tree), compared to 20 present-day humans and nine archaic humans
Fig. 6Simulated versus measured contamination rates. Several sets contained simulated aDNA fragments from a mitochondrial genome belonging to an early modern human (left), a Neanderthal (middle) or a Denisovan (right). All simulated sets had damage patterns associated with a single-stranded library protocol. The double-stranded figure can be found in Additional file 1: Results. A contaminating present-day human was pooled together at various rates to simulate contamination. The dotted black line represents a perfect prediction, and blue dots are the predicted rates of contamination by schmutzi once convergence was achieved. The red dots represent sets for which the algorithm stopped prematurely due to lack of information about the contaminant fragments. The black whiskers represent the 95 % confidence interval for contamination
Accuracy of contamination estimates on a simulated early modern human with double-stranded deamination patterns and high present-day modern human contamination. Three cores were used for every program. The programs contamMix and contDeam estimate contamination on a per fragment basis while mtCont estimates contamination on a per nucleotide basis. The contamination on a per nucleotide basis is higher due to the longer average length of contaminating fragments
| Contamination | Contamination | Run time |
|---|---|---|
| estimate | estimate | |
| method | ||
| Target contamination rate: 50 % (fragment basis) | ||
| contamMix 1.0-10 | 54.9±0.7 % | 4 days |
| Schmutzi (contDeam) | 49.0±0.5 % | 68 s |
| Target contamination rate: 58.2 % (nucleotide basis) | ||
| Schmutzi (mtCont without the predicted contaminant) | 32.0±1.0 % | 183 m |
| Schmutzi (mtCont with the predicted contaminant) | 60.0±1.0 % | 200 m |
Fig. 7Robustness of the contamination estimate to lower coverage. The simulated dataset with a contamination rate of ∼47 % and single-stranded deamination patterns was subsampled at various coverages from 0 to 1250 ×. Top: Contamination rates were estimated across a range of coverages in simulated data for a Neanderthal, a Denisovan and an early modern human (Ust’-Ishim). Bottom: Contamination estimates when a high-quality mtDNA sequence from a closely related individual is used as the endogenous genome. Robust estimates can be made down to 5 × coverage even at 47 % contamination. For the early modern human, the contamination estimate provided was computed using the database alone and not the prediction of the contaminant genome thus leading to underestimates (see Table 4 for an example of the effect of using the predicted contaminant in the contamination estimate)
Notation used in ‘Methods’
| Symbol | Definition |
|---|---|
|
| Set of all fragments |
|
| Set of all fragments from the endogenous genome |
|
| a particular fragment in |
|
| The event that a sequencing error has occurred |
|
| The event that deamination has occurred |
|
| The event that |
|
| The event that |
|
| Probability that |
|
| The base from the endogenous genome |
|
| The base from the contaminant genome |
|
| The base from the contaminant genome used by mtCont, obtained from a database |
|
| The base at position |
|
| The probability that base |
| ¬ | Denotes the complement of an event (event has not occurred) |
|
| Contamination rate, estimated by contDeam |
|
| Contamination rate, estimated by mtCont |
|
| Prior on contamination rate provided as input to endoCaller |
| endodist | log-normal distribution of the fragment length for the endogenous fragments |
| contdist | log-normal distribution of the fragment length for the contaminant fragments |