| Literature DB >> 35309001 |
Stephanie J Müller1, Rebone L Meraba1, Gciniwe S Dlamini1, Darlington S Mapiye1.
Abstract
The persistence and emergence of new multi-drug resistant Mycobacterium tuberculosis (M. tb) strains continues to advance the devastating tuberculosis (TB) epidemic. Robust systems are needed to accurately and rapidly perform drug-resistance profiling, and machine learning (ML) methods combined with genomic sequence data may provide novel insights into drug-resistance mechanisms. Using 372 M. tb isolates, the combined utility of ML and bioinformatics to perform drug-resistance profiling is demonstrated. SNPs, InDels, and dinucleotide frequencies are explored as input features for three ML models, namely Decision Trees, Random Forest, and the eXtreme Gradient Boosted model. Using SNPs and InDels, all three models performed equally well yielding a 99% accuracy, 97% recall, and 99% F1-score. Using dinucleotide frequencies, the XGBoost algorithm was superior with a 97% accuracy, 94% recall and 97% F1-score. This study validates the use of variants and presents dinucleotide features as another effective feature encoding method for ML-based phenotype classification. ©2021 AMIA - All rights reserved.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35309001 PMCID: PMC8861754
Source DB: PubMed Journal: AMIA Annu Symp Proc ISSN: 1559-4076