Literature DB >> 33452015

Contamination detection in sequencing studies using the mitochondrial phylogeny.

Hansi Weissensteiner¹, Lukas Forer¹, Liane Fendt¹, Azin Kheirkhah¹, Antonio Salas², Florian Kronenberg¹, Sebastian Schoenherr¹.

Abstract

Within-species contamination is a major issue in sequencing studies, especially for mitochondrial studies. Contamination can be detected by analyzing the nuclear genome or by inspecting polymorphic sites in the mitochondrial genome (mtDNA). Existing methods using the nuclear genome are computationally expensive, and no appropriate tool for detecting sample contamination in large-scale mtDNA data sets is available. Here we present haplocheck, a tool that requires only the mtDNA to detect contamination in both targeted mitochondrial and whole-genome sequencing studies. Our in silico simulations and amplicon mixture experiments indicate that haplocheck detects mtDNA contamination accurately and is independent of the phylogenetic distance within a sample mixture. By applying haplocheck to The 1000 Genomes Project Consortium data, we further evaluate the application of haplocheck as a fast proxy tool for nDNA-based contamination detection using the mtDNA and identify the mitochondrial copy number within a mixture as a critical component for the overall accuracy. The haplocheck tool is available both as a command-line tool and as a cloud web service producing interactive reports that facilitates the navigation through the phylogeny of contaminated samples.

Entities: Chemical

Year: 2021 PMID： 33452015 PMCID： PMC7849411 DOI： 10.1101/gr.256545.119

Source DB: PubMed Journal: Genome Res ISSN： 1088-9051 Impact factor: 9.043

The human mitochondrial DNA (mtDNA) is an extranuclear DNA molecule of ∼16.6 kb in length (Andrews et al. 1999). It is inherited exclusively through the maternal line, facilitating the reconstruction of the human maternal phylogeny and female (pre-)historical demographic patterns worldwide. The strict maternal inheritance of mtDNA results in a natural grouping of haplotypes into monophyletic clusters, referred to as haplogroups (Kivisild et al. 2006; Kloss-Brandstätter et al. 2011). Furthermore, second-generation sequencing enables the detection of heteroplasmy over the complete mitochondrial genome. Heteroplasmy is the occurrence of at least two different haplotypes of mtDNA in the investigated biological samples (e.g., cells or tissues). Depending on the sequencing coverage, heteroplasmic positions are reliably detectable down to the 1% variant level (Ye et al. 2014; Weissensteiner et al. 2016a). It has been shown that external or cross-contamination (Yao et al. 2007; Just et al. 2014, 2015; Yin et al. 2019; Brandhagen et al. 2020), artificial recombination (Bandelt et al. 2004), or index hopping (Van Der Valk et al. 2019) can generate polymorphic sites that can be erroneously interpreted as heteroplasmic sites (He et al. 2010; Bandelt and Salas 2012; Just et al. 2014, 2015; Ye et al. 2014). Sample contamination is still a major issue in both nuclear DNA (nDNA) and mtDNA sequencing studies that must be prevented to avoid mistakes as they occurred with Sanger sequencing studies in the past (Salas et al. 2005). Because of the accuracy and sensitivity of second-generation sequencing combined with the availability of improved computational models, within-species contamination is traceable down to the 1% level in whole-genome sequencing (WGS) studies (Jun et al. 2012). Several approaches exist to detect contamination in mtDNA sequencing studies. We and others previously showed that a contamination approach based on the coexistence of phylogenetically incompatible mitochondrial haplotypes observable as polymorphic sites is feasible (Li et al. 2010, 2015; Avital et al. 2012; Weissensteiner et al. 2016a). The method of Dickins et al. (2014) facilitates the check for contamination by building neighbor joining trees. Mixemt (Vohr et al. 2017) incorporates the mitochondrial phylogeny and estimates the most probable haplogroup for each sequence read. The implemented algorithm reveals advantages for contamination detection by detecting several haplotypes within one sample and is independent of variant frequencies. However, it is too computationally expensive when applied to thousands of samples. For ancient DNA studies, schmutzi (Renaud et al. 2015) uses sequence deamination patterns and fragment-length distributions to estimate contamination. Additionally, specific laboratory protocols were designed for eliminating contamination, for example, double-barcode sequencing approaches (Yin et al. 2019). For contamination detection in mitochondrial studies, mostly DNA cross-contamination is investigated (Ding et al. 2015; Wei et al. 2019; Yuan et al. 2020) by applying VerifyBamID (Jun et al. 2012; Zhang et al. 2020). Nevertheless, it becomes apparent that a tool for mitochondrial studies that rapidly and accurately detects contamination in thousands of samples is still missing. Because mtDNA is also present hundredfold to several thousandfold per cell depending on the cell type, also WGS data sets specifically targeting the autosomal genome result in a high coverage over the mitochondrial genome. In this study, we systematically evaluate the approach of using the mtDNA phylogeny for contamination detection and present haplocheck, a tool to report contamination in mtDNA targeted sequencing and WGS studies. In general, haplocheck works by identifying polymorphic sites down to 1% within an input sample. By grouping polymorphic sites into haplotypes, haplocheck identifies contamination using the mitochondrial phylogeny and the concept of haplogroups. Overall, this work should show the merits of the mitochondrial genome as an instrument for additional quality control in sequencing studies. It additionally presents haplocheck, a fast and accurate tool that takes advantage of a solid well-known mitochondrial phylogeny for detecting contamination.

Methods

Haplocheck takes as input BAM or VCF files. For BAM files, an initial variant calling step based on a maximum likelihood (ML) function (Ye et al. 2014) is performed. Detected polymorphic variants are then reported in VCF format and split by their variant allele frequency (AF) into a major and minor haplotype profile. A haplotype profile consists of all detected homoplasmic variants and the corresponding allele of each polymorphic variant. Alleles with an AF ≥ 50% are added to the major haplotype profile; otherwise, they are added to the minor haplotype profile. A haplogroup for each haplotype is then determined using HaploGrep 2 (Weissensteiner et al. 2016b). By using the mitochondrial phylogeny, the phylogenetic distance (i.e., number of nodes between the two haplogroups) is calculated. The identification of two stable haplogroups allows haplocheck to report the contamination level for each sample. Three different scenarios need to be considered for contamination detection based on the mitochondrial phylogeny. First, two haplotypes branch into two different nodes: a major haplotype with a mutation level x and a minor haplotype with a mutation level 1 − x (Fig. 1A), whereas here H1a1 represents the last common ancestor (LCA) for both haplotypes. Second, if polymorphic sites are only identified in the major haplotype, the minor haplotype H1a1 is defined as the LCA (Fig. 1B). Third, if polymorphic sites are only present in the minor haplotype, the major haplotype H1a1 defines the LCA (Fig. 1C).

Figure 1.

All possible contamination scenarios. Here, a contamination level of 20% is shown in all three scenarios (A–C). Shared polymorphisms of two haplotypes are included in a single branch, whereas the split into two branches displays the different lineage haplotypes. (A) Shared mutations defining H1a1 (last common ancestor [LCA]) are present at 100%, whereas 7961C is present only at 20%, defining the minor haplogroup H1a1b, whereas 4639C and 10993A are present at 80%, defining the major haplogroup H1a1a1. (B) A mixture of two haplotypes within a single lineage but of different lineage depths (minor haplotype H1a1 and major haplotype H1a1a1) is observed if no minor haplotype can be found. (C) A mixture of two haplotypes within a single lineage but of different lineage depths (minor H1a1a1 and major H1a1) is detected if the minor haplotype results in a stable haplogroup. Shared homoplasmic sites facilitate the identification of the branching pattern in all three scenarios and improve the overall haplogroup quality. The used notation for variants (e.g., 1438G) includes the mtDNA position (1438) followed by the actual base change (G).

Variant calling

The overall performance of haplocheck relies on an accurate variant calling. Previously, we developed mtDNA-Server (Weissensteiner et al. 2016a) for the detection of polymorphic sites down to 1% (Ye et al. 2014) in combination with several quality-control criteria such as (1) base quality ≥ 20, (2) >10× depth per strand, (3) 1% minor AF on each strand, and (4) a log-likelihood ratio (LLR) of ≥5. LLR represents the ratio between the estimated frequency of the major allele within the ML function of the polymorphic and the homoplasmic model. For this work, we developed a multithreaded version of mtDNA-Server and integrated it into haplocheck (https://github.com/seppinho/mutserve). As mentioned, detected polymorphic positions are reported in VCF format as heterozygous genotypes (GT) using the AF tag for the estimated contamination level. Although the term genotype applies to autosomal diploid scenarios, we use it here to refer to mtDNA variation patterns that resemble a genotype status. For homoplasmic positions, the final genotype GT ∈ {A,C,G,T} is detected using all input reads (reads) and calculating the genotype probability P using Bayes’ theorem P(GT|reads) = P(reads|GT) × P(GT)/P(reads). To calculate the prior probability P(GT), we used The 1000 Genomes Project Consortium Phase 3 VCF file (The 1000 Genomes Project Consortium 2015) and calculated the frequencies for all sites using VCFtools (Danecek et al. 2011). To compute P(reads|GT), we calculated the sequence error rate (ei = 10−Qi/10) for each base i of a read, where Q is the reported quality value. For each genotype GT (GT ∈ {A,C,G,T}) of a read, we determined the genotype likelihood by multiplying 1 − ei in case the base of the read ri = GT and ei/3 otherwise over all reads (Ding et al. 2015). The denominator P(reads) is the sum of all four P(reads|GT).

Contamination detection model

The contamination model within haplocheck includes steps for (1) splitting homoplasmic and polymorphic sites into two haplotype profiles, (2) haplogroup classification for each haplotype profile, and (3) filtering based on quality-control criteria. Homozygous genotypes for the alternate alleles (ALT; i.e., homoplasmic sites) are added to both haplotypes, whereas heterozygous genotypes are split using the AF tag. Because mutserve always reports the AF of the nonreference allele, the split method applies the following rule: In case a GT 0/1 (e.g., Ref: G, ALT: C) with an AF of 0.20 is included, the split method defines C as the minor allele, 0.2 as the minor level, and 0.8 as the major level. Conversely, when a GT 0/1 (e.g., Ref: G, ALT: C) with an AF of 0.80 is included, the C is defined as the major allele. If no reference allele is included (e.g., 1/2), we use the first allele as the major allele and assign the included AF to that allele. For haplogroup classification, we use HaploGrep 2 (Weissensteiner et al. 2016b) based on Phylotree 17 (van Oven and Kayser 2009), which has been refactored as a module and integrated directly into haplocheck. As a result, HaploGrep 2 reports the haplogroup of both the major and minor haplotype. For each analyzed sample, the LCA is required to estimate the final contamination level and to calculate the distance between the two haplotypes. Therefore, we traverse Phylotree from the rCRS reference to each haplotype node. The LCA is determined by starting at the final node of haplotype 1 (h1) and by iterating back until the reference (rCRS) is reached. Then, we iterate back to rCRS for haplotype 2 (h2) until the first node included in h1 is identified. This node then defines the LCA of both haplotypes. Only polymorphic positions starting from the LCA and showing a phylogenetic weight greater than five are taken into account for the subsequent filtering step. The phylogenetic weight describes the frequency of each mutation in Phylotree and is scaled from one to 10 in a nonlinear way. Variants with a high occurrence in Phylotree are assigned a small phylogenetic weight. Furthermore, back mutations (i.e., mutation changes back to the rCRS reference within a specific haplogroup) and deletions on polymorphic sites are ignored by haplocheck. By using all previous information, we finally estimate the contamination level for samples fulfilling the following three quality-control criteria: (1) two or more polymorphic variants starting from the LCA, (2) ≥0.5 haplogroup quality for each haplotype (calculated by HaploGrep 2 using the Kulczynski metric), and (3) phylogenetic distance of two or more. The median mutation level of all detected polymorphic sites reaching the described criteria is calculated independently for both haplotypes (h1 and h2). Haplocheck reports the median level of the minor haplotype as the final contamination level.

Report

Haplocheck produces a tab-delimited text file and an interactive HTML report. For each sample, haplocheck determines the final contamination status, the contamination level, and quality metrics such as the phylogenetic distance or the coverage. Additionally, a graphical phylogenetic tree is generated dynamically for each sample, including the path from the rCRS to the two final haplotypes. This allows the user to manually inspect edge cases, visualize the contamination graphically, and analyze the source of contamination (see Supplemental Fig. S1).

Results

Haplocheck is available as a standalone command-line tool and as a cloud web service. For both scenarios, the identical computational workflow consisting of variant calling (for BAM input only), haplogroup classification, and contamination detection is applied. The Cloudgene framework (Schönherr et al. 2012) is used to provide the workflow as a service to users, which is also used for large-scale genetic services like the Michigan Imputation Server (Das et al. 2016) and the mtDNA-Server (Weissensteiner et al. 2016a), that greatly improves user experience and productivity.

Evaluation

To test the performance of haplocheck within targeted mtDNA and WGS studies, we analyzed several data sets. First, we checked previously generated mtDNA mixtures of two samples including different haplotypes (Weissensteiner et al. 2016a). The mitochondrial genomes of the mixed fragments (1%–50%) were amplified by PCR and sequenced on an Illumina HiSeq system. We analyzed the original samples (coverage 60,000×) and down-sampled them accordingly. Our results show that a coverage of >100× and >600× is required to detect contamination of 10% and 1%, respectively (see Table 1). Of note, The 1000 Genomes Project Consortium low-coverage sample collection (2–4 × nDNA coverage) already includes sufficient coverage over the mitochondrial genome to detect contamination down to 1% (1800 × mtDNA coverage).

Table 1.

Four mixtures (M1–M4) have been analyzed using haplocheck with varying coverage

Four mixtures (M1–M4) have been analyzed using haplocheck with varying coverage We further simulated sequencing data mixtures for different sequencing instruments by using the ART-NGS read simulator (Huang et al. 2012). The generated mixtures differ in (1) contamination level (1%–50%), (2) coverage (between 10×–5000×), and (3) phylogenetic distance between the two mixed haplotypes (three to 23 phylogenetic nodes between them). The results were highly concordant with the mixtures and show that haplocheck is able to detect contamination accurately even for samples including haplotypes with a close phylogenetic distance (see Table 2; for other phylogenetic distances, see Supplemental Table S1).

Table 2.

Four in silico MiSeq mixtures (S1–S4) have been generated and analyzed using haplocheck with varying coverage

Four in silico MiSeq mixtures (S1–S4) have been generated and analyzed using haplocheck with varying coverage In a second step, we created and analyzed in silico data by mixing random genotype profiles from the currently best available mtDNA phylogeny derived from Phylotree Build 17 (code on GitHub). The overall performance of haplocheck depends on a good classification of samples into haplogroups even from noisy variant calling data sets. Therefore, we initially created input profiles for each displayed haplogroup, amounting to 5426 profiles in total. Each input profile consists of a list of polymorphisms from the tree reference (rCRS) to the actual node (or haplogroup). Our test data consist of 500,000 unique mixtures of pairwise haplogroup profiles derived from the overall phylogeny comprising 5500 haplogroups (250,000 contaminated, 250,000 not-contaminated samples) and 100,000 mixtures from the haplogroup H-subtree, including 977 haplogroups. The generation of in silico data from the H-subtree allows us to test the performance of samples showing a smaller phylogenetic distance. To account for noisy input data, we artificially added random variants to each input profile. This has been performed by removing expected variants from the input profile and adding random variants available within Phylotree. The amount of noise varies from zero to eight variants for each mixture. The proportion of added versus removed variants is calculated randomly. To make it further restrictive, we only added phylogenetic relevant variants from Phylotree. Variants that are not present in Phylotree (i.e., so far unknown in the phylogeny) would not affect the contamination estimation. Finally, three data sets (noise 0, 4, 8) derived from two different trees (complete tree, haplogroup H subtree) have been generated, each consisting of 500,000 and 100,000 mixtures respectively. The F1-score, defined as (2 × precision × sensitivity)/(precision + sensitivity), has been calculated for each mixture to analyze the overall accuracy of haplocheck. To determine the best haplocheck configuration regarding accuracy, we tested different setups for all six data sets. Each setup includes a different threshold for (1) the amount of major and minor polymorphic sites, (2) the minimum allowed phylogenetic distance between two profiles, and (3) the haplogroup classification model (Kulczynski, Hamming, Jaccard). The six best setups have been tested to determine the optimal trade-off between noise, haplogroup distance, and the overall F1-score (see Supplemental Fig. S2). In our experiments, setup 3 showed the best trade-off between haplogroup distance and overall accuracy. This setup allows us to detect contamination of samples with a phylogenetic distance of at least two and has been used as the final setup for the contamination method. Table 3 summarizes the F1-score statistics for Setup 3. The result indicates that haplocheck is able to accurately detect contamination of two samples also in the case in which noise is included in the input profiles and the distance between the two haplogroups is small.

Table 3.

F1-Score for different noise categories using the finally chosen setup 3

F1-Score for different noise categories using the finally chosen setup 3 In a last step, we also evaluated the performance of haplocheck as a tool to extrapolate the nDNA contamination level from mtDNA data. Therefore, we generated four whole-genome in silico samples from two random The 1000 Genomes Project Consortium samples showing no signs of contamination based on the VerifyBamID score (Supplemental Table S2). To analyze the impact of the mitochondrial copy number (mtCN), four samples with different amounts of mtCN were chosen from The 1000 Genomes Project Consortium sample collection. The mtCN has been inferred using the formula (mtDNA coverage)/(nDNA coverage × 2) (Ding et al. 2015). For each sample, again four different in silico whole-genome mixtures between 1% and 10% have been created and analyzed using VerifyBamID2 (for nDNA) and haplocheck (for mtDNA). Table 4 summarizes the findings, whereby each sample cell includes the average delta between the calculated and the expected value for all four different mixtures per sample. Levels obtained from VerifyBamID2 and haplocheck correlate if the copy number (CN) for each haplotype in the sample is similar (see samples 1 and 2). Values obtained from sample 3 still correlate, because the main haplotype shows a higher mtCN and is therefore less affected by the lower mtCN of haplotype 2. In a worst-case scenario (sample 4), in which the main haplotype has a lower mtCN and the minor haplotype a higher mtCN, the values obtained from haplocheck and VerifyBamID2 differ substantially.

Table 4.

Four samples including two different haplotypes, in which each haplotype shows a different amount of mtCN have been created (see mtCN ratio)

Four samples including two different haplotypes, in which each haplotype shows a different amount of mtCN have been created (see mtCN ratio) A drastic shift in the CN is atypical for a large sequencing project. In work by Zhang et al. (2017), the CN of 1500 women aged 17–85 have been analyzed and show that most samples are within a range of 100–300 (mean, 169; DNA source, whole blood). In work by Fazzini et al. (2019), the mtCN has been analyzed in a cohort of 4812 chronic kidney disease patients, also showing only moderate differences (mean, 107.2; SD, 36.4; DNA source, whole blood).

Contamination detection in The 1000 Genomes Project Consortium

To evaluate haplocheck on a WGS study, we extracted the mtDNA genome reads (labeled as chromosome MT) from samples (Phase 3, low-coverage) from The 1000 Genomes Project Consortium (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/), resulting in a sample size of 2504 and a total file size of 95 GB. As an initial check, we compared variants detected by mutserve to the official The 1000 Genomes Project Consortium data release using callMom (https://github.com/juansearch/callMom) and determined the haplogroup using HaploGrep 2. Overall, 98% of the samples (n = 2504) result in an identical haplogroup (see Supplemental Fig. S3). The downloaded BAM files were then used as an input for haplocheck to test for contamination. Based on the mitochondrial genome, 5.07% (127 of 2504) of all samples show signs of contamination on mtDNA (see Supplemental Table S3). As previously shown, the performance of haplocheck as a proxy for nDNA is dependent on the mtCN. Because this is uneven for the low-coverage data from The 1000 Genomes Project Consortium, we looked at the tissue source used for DNA extraction. As depicted in Table 5 and Supplemental Figure S4, there is a significant difference in the mtCN owing to the two tissue types used within The 1000 Genomes Project Consortium (P < 2.2 × 10−16, independent t-test).

Table 5.

Tissue cell types of all 2504 samples from The 1000 Genomes Project Consortium (low-coverage data set)

Tissue cell types of all 2504 samples from The 1000 Genomes Project Consortium (low-coverage data set) Because of the different mtCN, we split The 1000 Genomes Project Consortium samples into two groups and calculated the Pearson correlation coefficient (R) separately. Group 1 (mtCN ≥ 300, n = 2004) shows a correlation of R = 0.72 between the contamination levels of VerifyBamID2 and haplocheck, with the contamination levels reported by haplocheck ranging from 0.8% to 4.8% (see Supplemental Table S4). The levels are in a very similar range as the VerifyBamID estimates for The 1000 Genomes Project Consortium, because only samples showing a VerifyBamID level of <3% are included. Group 2 (mtCN < 300, n = 500) shows a correlation of only R = 0.31, and contamination levels reported by haplocheck are between 1.8% and 25.5% (see Supplemental Table S5). Because of the higher mtCN, samples of group 1 are more stable, and contamination levels are in a similar range. Samples with a lower mtCN (group 2) differ substantially, because a contamination with a sample showing a higher amount of mtCN affects the mtDNA contamination level. Therefore, group 2 shows a much higher discrepancy in the contamination level compared with VerifyBamID2. To verify the feasibility of haplocheck in WGS studies with only moderate differences between the samples, we downloaded the mtDNA (labeled as chromosome chrM) of the deep-sequenced The 1000 Genomes Project Consortium sample collection (30× coverage; ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/1000G_2504_high_coverage/), amounting to 176 GB. Compared to the previously analyzed low-coverage sample collection, the mtDNA coverage is much more homogeneous for the high-coverage data (see Fig. 2). Haplocheck detected only minor mtDNA contamination in seven samples (0.9%–1.7%) (see Supplemental Table S6); all spuriously detected contamination in the low-coverage data owing to the different mtCN have vanished.

Figure 2.

Violin plot representing the mean coverage over all 2504 samples in the two The 1000 Genomes Project Consortium data sets (high-coverage and low-Coverage). Because of different tissues in the low-coverage data, different clusters of coverage can be observed, resulting in wrong mtDNA contamination estimates for nDNA. It can be seen that the second peak within the low-coverage group vanishes for the high-coverage data, resulting in better estimates for extrapolation. In the last step, we looked at samples that have been excluded from The 1000 Genomes Project Consortium sample collection (nDNA contamination level >3% using VerifyBamID). In total, four samples have been excluded by VerifyBamID using its sequence-only method (free-mix parameter) and seven samples using its sequence and array methods (chip-mix parameter). Haplocheck was able to identify these samples as contaminated with a correlation of 89% between the nDNA and mtDNA level (Supplemental Table S7).

nDNA of mitochondrial origin

nDNA of mitochondrial origin (NUMT) can result either in a coverage drop on mtDNA sites owing to the alignment of mitochondrial reads to NUMT or in false-positive polymorphic calls owing to the alignment of NUMT reads to the mitochondrial genome (Maude et al. 2019). Approaches exist (Goto et al. 2011; Samuels et al. 2013) that exclude reads mapping to the nDNA but overall reduce coverage and may result in false negatives (Albayrak et al. 2016). In work by Weissensteiner et al. (2016a), we annotated mitochondrial sites coming from an NUMT reference database (Li et al. 2012; Dayama et al. 2014), although limited to known NUMTs. For contamination detection with haplocheck, false-positive polymorphic sites owing to NUMTs are expected to only have a minor effect because they typically do not resemble the complete mitochondrial haplotypes. Nevertheless, sufficient coverage for the haplogroup defining variants is still required when dealing with NUMTs. In a study conducted by Maude et al. (2019), an in silico model has been set up to analyze the homology between mitochondrial variants and NUMTs. They show that 29 variants representing haplogroups A, H, L2, M, and U did not cause loss of coverage, but nevertheless, a substantial loss of coverage has been identified for specific sites (e.g., G1888A, A4769G). In a recent work, the presence of a mega-NUMT that could mimic contamination on mitochondrial haplogroup level is described (Balciuniene and Balciunas 2019). This indicates that in very rare cases, NUMTs could indeed resemble complete mitochondrial haplotypes and yield to a false-positive contamination result (Salas et al. 2020; Wei et al. 2020). Although we did not observe NUMT-related issues in the validation of The 1000 Genomes Project Consortium, we cannot entirely rule out possible NUMTs effects on contamination detection.

Runtime and performance

Haplocheck scales linearly with the data size (i.e., sequence reads). For the complete sample collection of The 1000 Genomes Project Consortium in BAM format, the contamination estimate has been calculated within 5.95 h using a single core (Intel Xeon CPU 2.30 GHz) and 1 GB RAM and 1.85 h using four cores and 4 GB RAM, respectively. Table 6 includes the runtime for 26 samples in BAM format for VerifyBamID2 (input WGS data, varying amounts of markers and cores) and haplocheck (input mtDNA only).

Table 6.

Haplocheck v1.1.3 runtime for 26 samples of The 1000 Genomes Project Consortium low-coverage data

Contamination source

Haplocheck always reports both the major and minor haplotypes for each sample. Therefore, possible sources of contamination can be investigated. For example, sample HG00740 from The 1000 Genomes Project Consortium low-coverage data set shows a contamination level of 2.74% on nDNA (using VerifyBamID2) and 3% on mtDNA (using haplocheck). By looking at the phylogenetic tree that is created for each sample by haplocheck, the contaminating minor haplogroup B2b3a can be identified. The identical haplogroup is also assigned to sample HG01079, which has been analyzed in the same center with a similar mtCN. Such phylogenetic information provided within the interactive HTML report can help in identifying the source of contamination for all three types of contamination.

Discussion

There are many examples in the literature showing the negative impact of artifacts on mtDNA data sets in different areas of research, including medical studies, forensic genetics, and human population studies (He et al. 2010; Bandelt and Salas 2012; Just et al. 2014; Ye et al. 2014). The approach described in this paper takes advantage of the mitochondrial phylogeny and is capable of detecting sample contamination based on mitochondrial haplotype mixtures. By creating several in silico data sets and analyzing The 1000 Genomes Project Consortium samples, we show that haplocheck can be used both in studies using targeted amplification of mtDNA and in those using WGS data. We also investigated the influence of the mtCN and advise taking the mtCN into consideration when using mtDNA estimates for extrapolating nDNA levels. Several other methods for contamination detection exist. For nDNA sequences, VerifyBamID2 (Zhang et al. 2020) offers an ancestry-agnostic DNA contamination estimation method and is widely used in WGS studies. For ancient studies, schmutzi (Renaud et al. 2015) provides a contamination estimation tool by using sequence deamination patterns; the approach presented in Fu et al. (2013) includes a likelihood-based method to estimate the frequency of present-day human mtDNA haplotypes in the contaminator population. A further approach was suggested by Dickins et al. (2014), describing a pipeline for contamination detection accessible through the Galaxy online platform (Afgan et al. 2018). Some limitations apply to the phylogenetic-based contamination check proposed in the present investigation, previously applied in a semi-automatic manner (Li et al. 2010; Avital et al. 2012). There is currently a publication bias in favor of the European mtDNA haplogroups that provide the most phylogenetic details, whereas especially African haplogroups are underrepresented (626 African haplogroups compared to 2546 European haplogroups in Phylotree 17). Although the major changes in the phylogeny were performed during the initial growing process of the tree, the last few years showed only refinements of lineages and branches. Therefore, major changes are no longer expected in the human phylogeny, but data from upcoming sequencing studies will help to refine existing groups. Further, contamination detection based on mitochondrial genomes is not applicable in scenarios in which samples belong to the same maternal line (e.g., mother–offspring) owing to an identical haplogroup. The application of haplocheck to ancient DNA studies is limited due the required coverage for detecting polymorphic sites. Importantly, it has also been previously shown for ancient studies that the mtDNA-to-nDNA ratio influences the accuracy of extrapolating nDNA contamination levels from mtDNA estimates (Furtwängler et al. 2018). Overall, we showed that haplogroup-based contamination detection as performed by haplocheck can be used systematically as a quality measure for mtDNA data. Such kind of analysis could become effective before data interpretation and publication of mtDNA sequencing projects.

Software availability

Haplocheck is available at GitHub (https://github.com/genepi/haplocheck) under the MIT license and requires Java 8 or higher for local execution. All generated data, scripts, and reports are available within this repository. The web service can be accessed via Mitoverse (https://mitoverse.i-med.ac.at). The complete source code from GitHub has been uploaded to the Supplemental Material as Supplemental Code.

Competing interest statement

The authors declare no competing interests.

46 in total

1. Extensive tissue-related and allele-related mtDNA heteroplasmy suggests positive selection for somatic mutations.

Authors: Mingkun Li; Roland Schröder; Shengyu Ni; Burkhard Madea; Mark Stoneking
Journal: Proc Natl Acad Sci U S A Date: 2015-02-09 Impact factor: 11.205

2. An Effective Strategy to Eliminate Inherent Cross-Contamination in mtDNA Next-Generation Sequencing of Multiple Samples.

Authors: Chun Yin; Yang Liu; Xu Guo; Deyang Li; Wan Fang; Jin Yang; Feng Zhou; Wancheng Niu; Yongfeng Jia; Hushan Yang; Jinliang Xing
Journal: J Mol Diagn Date: 2019-04-23 Impact factor: 5.568

3. A phylogenetic approach for haplotype analysis of sequence data from complex mitochondrial mixtures.

Authors: Samuel H Vohr; Rachel Gordon; Jordan M Eizenga; Henry A Erlich; Cassandra D Calloway; Richard E Green
Journal: Forensic Sci Int Genet Date: 2017-05-29 Impact factor: 4.882

4. Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data.

Authors: Goo Jun; Matthew Flickinger; Kurt N Hetrick; Jane M Romm; Kimberly F Doheny; Gonçalo R Abecasis; Michael Boehnke; Hyun Min Kang
Journal: Am J Hum Genet Date: 2012-10-25 Impact factor: 11.025

5. Detecting heteroplasmy from high-throughput sequencing of complete human mitochondrial DNA genomes.

Authors: Mingkun Li; Anna Schönberg; Michael Schaefer; Roland Schroeder; Ivane Nasidze; Mark Stoneking
Journal: Am J Hum Genet Date: 2010-08-13 Impact factor: 11.025

6. Fidelity of capture-enrichment for mtDNA genome sequencing: influence of NUMTs.

Authors: Mingkun Li; Roland Schroeder; Albert Ko; Mark Stoneking
Journal: Nucleic Acids Res Date: 2012-05-30 Impact factor: 16.971

7. Heteroplasmic mitochondrial DNA mutations in normal and tumour cells.

Authors: Yiping He; Jian Wu; Devin C Dressman; Christine Iacobuzio-Donahue; Sanford D Markowitz; Victor E Velculescu; Luis A Diaz; Kenneth W Kinzler; Bert Vogelstein; Nickolas Papadopoulos
Journal: Nature Date: 2010-03-03 Impact factor: 49.962

8. The variant call format and VCFtools.

Authors: Petr Danecek; Adam Auton; Goncalo Abecasis; Cornelis A Albers; Eric Banks; Mark A DePristo; Robert E Handsaker; Gerton Lunter; Gabor T Marth; Stephen T Sherry; Gilean McVean; Richard Durbin
Journal: Bioinformatics Date: 2011-06-07 Impact factor: 6.937

9. Assessing Mitochondrial DNA Variation and Copy Number in Lymphocytes of ~2,000 Sardinians Using Tailored Sequencing Analysis Tools.

Authors: Jun Ding; Carlo Sidore; Thomas J Butler; Mary Kate Wing; Yong Qian; Osorio Meirelles; Fabio Busonero; Lam C Tsoi; Andrea Maschio; Andrea Angius; Hyun Min Kang; Ramaiah Nagaraja; Francesco Cucca; Gonçalo R Abecasis; David Schlessinger
Journal: PLoS Genet Date: 2015-07-14 Impact factor: 5.917

10. Ancestry-agnostic estimation of DNA sample contamination from sequence reads.

Authors: Fan Zhang; Matthew Flickinger; Sarah A Gagliano Taliun; Gonçalo R Abecasis; Laura J Scott; Steven A McCaroll; Carlos N Pato; Michael Boehnke; Hyun Min Kang
Journal: Genome Res Date: 2020-01-24 Impact factor: 9.043

8 in total

1. A bioinformatics pipeline for estimating mitochondrial DNA copy number and heteroplasmy levels from whole genome sequencing data.

Authors: Stephanie L Battle; Daniela Puiu; Joost Verlouw; Linda Broer; Eric Boerwinkle; Kent D Taylor; Jerome I Rotter; Stephan S Rich; Megan L Grove; Nathan Pankratz; Jessica L Fetterman; Chunyu Liu; Dan E Arking
Journal: NAR Genom Bioinform Date: 2022-05-17

2. Benchmarking Low-Frequency Variant Calling With Long-Read Data on Mitochondrial DNA.

Authors: Theresa Lüth; Susen Schaake; Anne Grünewald; Patrick May; Joanne Trinh; Hansi Weissensteiner
Journal: Front Genet Date: 2022-05-19 Impact factor: 4.772

3. An in-depth analysis of the mitochondrial phylogenetic landscape of Cambodia.

Authors: Anita Kloss-Brandstätter; Monika Summerer; David Horst; Basil Horst; Gertraud Streiter; Julia Raschenberger; Florian Kronenberg; Torpong Sanguansermsri; Jürgen Horst; Hansi Weissensteiner
Journal: Sci Rep Date: 2021-05-24 Impact factor: 4.379

4. From Forensics to Clinical Research: Expanding the Variant Calling Pipeline for the Precision ID mtDNA Whole Genome Panel.

Authors: Filipe Cortes-Figueiredo; Filipa S Carvalho; Ana Catarina Fonseca; Friedemann Paul; José M Ferro; Sebastian Schönherr; Hansi Weissensteiner; Vanessa A Morais
Journal: Int J Mol Sci Date: 2021-11-06 Impact factor: 5.923

5. The Value of Whole-Genome Sequencing for Mitochondrial DNA Population Studies: Strategies and Criteria for Extracting High-Quality Mitogenome Haplotypes.

Authors: Kimberly Sturk-Andreaggi; Joseph D Ring; Adam Ameur; Ulf Gyllensten; Martin Bodner; Walther Parson; Charla Marshall; Marie Allen
Journal: Int J Mol Sci Date: 2022-02-17 Impact factor: 5.923

6. Mitochondrial DNA variation across 56,434 individuals in gnomAD.

Authors: Kristen M Laricchia; Nicole J Lake; Nicholas A Watts; Megan Shand; Andrea Haessly; Laura Gauthier; David Benjamin; Eric Banks; Jose Soto; Kiran Garimella; James Emery; Heidi L Rehm; Daniel G MacArthur; Grace Tiao; Monkol Lek; Vamsi K Mootha; Sarah E Calvo
Journal: Genome Res Date: 2022-01-24 Impact factor: 9.438

7. Benchmarking the Effectiveness and Accuracy of Multiple Mitochondrial DNA Variant Callers: Practical Implications for Clinical Application.

Authors: Eddie K K Ip; Michael Troup; Colin Xu; David S Winlaw; Sally L Dunwoodie; Eleni Giannoulatou
Journal: Front Genet Date: 2022-03-08 Impact factor: 4.599

8. A method for multiplexed full-length single-molecule sequencing of the human mitochondrial genome.

Authors: Ieva Keraite; Philipp Becker; Davide Canevazzi; Cristina Frias-López; Marc Dabad; Raúl Tonda-Hernandez; Ida Paramonov; Matthew John Ingham; Isabelle Brun-Heath; Jordi Leno; Anna Abulí; Elena Garcia-Arumí; Simon Charles Heath; Marta Gut; Ivo Glynne Gut
Journal: Nat Commun Date: 2022-10-06 Impact factor: 17.694

8 in total