| Literature DB >> 33452015 |
Hansi Weissensteiner1, Lukas Forer1, Liane Fendt1, Azin Kheirkhah1, Antonio Salas2, Florian Kronenberg1, Sebastian Schoenherr1.
Abstract
Within-species contamination is a major issue in sequencing studies, especially for mitochondrial studies. Contamination can be detected by analyzing the nuclear genome or by inspecting polymorphic sites in the mitochondrial genome (mtDNA). Existing methods using the nuclear genome are computationally expensive, and no appropriate tool for detecting sample contamination in large-scale mtDNA data sets is available. Here we present haplocheck, a tool that requires only the mtDNA to detect contamination in both targeted mitochondrial and whole-genome sequencing studies. Our in silico simulations and amplicon mixture experiments indicate that haplocheck detects mtDNA contamination accurately and is independent of the phylogenetic distance within a sample mixture. By applying haplocheck to The 1000 Genomes Project Consortium data, we further evaluate the application of haplocheck as a fast proxy tool for nDNA-based contamination detection using the mtDNA and identify the mitochondrial copy number within a mixture as a critical component for the overall accuracy. The haplocheck tool is available both as a command-line tool and as a cloud web service producing interactive reports that facilitates the navigation through the phylogeny of contaminated samples.Entities:
Year: 2021 PMID: 33452015 PMCID: PMC7849411 DOI: 10.1101/gr.256545.119
Source DB: PubMed Journal: Genome Res ISSN: 1088-9051 Impact factor: 9.043
Figure 1.All possible contamination scenarios. Here, a contamination level of 20% is shown in all three scenarios (A–C). Shared polymorphisms of two haplotypes are included in a single branch, whereas the split into two branches displays the different lineage haplotypes. (A) Shared mutations defining H1a1 (last common ancestor [LCA]) are present at 100%, whereas 7961C is present only at 20%, defining the minor haplogroup H1a1b, whereas 4639C and 10993A are present at 80%, defining the major haplogroup H1a1a1. (B) A mixture of two haplotypes within a single lineage but of different lineage depths (minor haplotype H1a1 and major haplotype H1a1a1) is observed if no minor haplotype can be found. (C) A mixture of two haplotypes within a single lineage but of different lineage depths (minor H1a1a1 and major H1a1) is detected if the minor haplotype results in a stable haplogroup. Shared homoplasmic sites facilitate the identification of the branching pattern in all three scenarios and improve the overall haplogroup quality. The used notation for variants (e.g., 1438G) includes the mtDNA position (1438) followed by the actual base change (G).
Four mixtures (M1–M4) have been analyzed using haplocheck with varying coverage
Four in silico MiSeq mixtures (S1–S4) have been generated and analyzed using haplocheck with varying coverage
F1-Score for different noise categories using the finally chosen setup 3
Four samples including two different haplotypes, in which each haplotype shows a different amount of mtCN have been created (see mtCN ratio)
Tissue cell types of all 2504 samples from The 1000 Genomes Project Consortium (low-coverage data set)
Figure 2.Violin plot representing the mean coverage over all 2504 samples in the two The 1000 Genomes Project Consortium data sets (high-coverage and low-Coverage). Because of different tissues in the low-coverage data, different clusters of coverage can be observed, resulting in wrong mtDNA contamination estimates for nDNA. It can be seen that the second peak within the low-coverage group vanishes for the high-coverage data, resulting in better estimates for extrapolation.
Haplocheck v1.1.3 runtime for 26 samples of The 1000 Genomes Project Consortium low-coverage data