| Literature DB >> 35001369 |
Liping Su1, Shanshan Guo1, Wenjie Guo1, Xiaoying Ji1, Yang Liu1, Huanqin Zhang2, Qichao Huang1, Kaixiang Zhou1, Xu Guo1, Xiwen Gu3, Jinliang Xing1.
Abstract
Next-generation sequencing (NGS) of mitochondrial DNA (mtDNA) has widespread applications in aging and cancer studies. However, cross-contamination of mtDNA constitutes a major concern. Previous methods for the detection of mtDNA contamination mainly focus on haplogroup-level phylogeny, but neglect haplotype-level differences, leading to limited sensitivity and accuracy. In our study, we present mitoDataclean, a random-forest-based machine learning package for accurate identification of cross-contamination, evaluation of contamination levels and detection of contamination-derived variants in mtDNA NGS data. Comprehensive optimization of mitoDataclean revealed that training simulation with mixtures of small haplogroup distance and low polymorphic difference was critical for optimal modeling. Compared to existing methods, mitoDataclean exhibited significantly improved sensitivity and accuracy for the detection of sample contamination in simulated data. In addition, mitoDataclean achieved area under the curve values of 0.91 and 0.97 for discerning genuine and contamination-derived mtDNA variants in a simulated Western dataset and private sequencing contamination data, respectively, suggesting that this tool may be applicable for different populations and samples with different sources of contamination. Finally, mitoDataclean was further evaluated in several private and public datasets and showed a robust ability for contamination detection. Altogether, our study demonstrates that mitoDataclean may be used for accurate detection of contaminated samples and contamination-derived variants in mtDNA NGS data.Entities:
Keywords: machine learning; mitochondrial DNA; next-generation sequencing; sample cross-contamination
Mesh:
Substances:
Year: 2022 PMID: 35001369 DOI: 10.1002/ijc.33927
Source DB: PubMed Journal: Int J Cancer ISSN: 0020-7136 Impact factor: 7.396