| Literature DB >> 33789960 |
Bert Bogaerts1,2, Thomas Delcourt1, Vanessa Mathys3, Kevin Vanneste4, Karine Soetaert3, Samira Boarbi3, Pieter-Jan Ceyssens3, Raf Winand1, Julien Van Braekel1, Sigrid C J De Keersmaecker1, Nancy H C Roosens1, Kathleen Marchal5,2,6.
Abstract
The use of whole-genome sequencing (WGS) for routine typing of bacterial isolates has increased substantially in recent years. For Mycobacterium tuberculosis (MTB), in particular, WGS has the benefit of drastically reducing the time required to generate results compared to most conventional phenotypic methods. Consequently, a multitude of solutions for analyzing WGS MTB data have been developed, but their successful integration in clinical and national reference laboratories is hindered by the requirement for their validation, for which a consensus framework is still largely absent. We developed a bioinformatics workflow for (Illumina) WGS-based routine typing of MTB complex (MTBC) member isolates allowing complete characterization, including (sub)species confirmation and identification (16S, csb/RD, hsp65), single nucleotide polymorphism (SNP)-based antimicrobial resistance (AMR) prediction, and pathogen typing (spoligotyping, SNP barcoding, and core genome multilocus sequence typing). Workflow performance was validated on a per-assay basis using a collection of 238 in-house-sequenced MTBC isolates, extensively characterized with conventional molecular biology-based approaches supplemented with public data. For SNP-based AMR prediction, results from molecular genotyping methods were supplemented with in silico modified data sets, allowing us to greatly increase the set of evaluated mutations. The workflow demonstrated very high performance with performance metrics of >99% for all assays, except for spoligotyping, where sensitivity dropped to ∼90%. The validation framework for our WGS-based bioinformatics workflow can aid in the standardization of bioinformatics tools by the MTB community and other SNP-based applications regardless of the targeted pathogen(s). The bioinformatics workflow is available for academic and nonprofit use through the Galaxy instance of our institute at https://galaxy.sciensano.be.Entities:
Keywords: Mycobacterium tuberculosis; antimicrobial resistance; national reference center; public health; single nucleotide polymorphism; validation; whole genome sequencing
Mesh:
Year: 2021 PMID: 33789960 PMCID: PMC8316078 DOI: 10.1128/JCM.00202-21
Source DB: PubMed Journal: J Clin Microbiol ISSN: 0095-1137 Impact factor: 5.948
FIG 1Overview of the bioinformatics workflow. Each box represents a component corresponding to a series of tasks that provide a certain well-defined functionality (indicated in bold). The major bioinformatics utilities employed in each module are also mentioned (indicated in italics). Data processing steps are indicated in yellow, bioinformatics assays are indicated in red, and the dotted red boxes group the assays into the three main categories. PE, paired-end; QC, quality control; RD, regions of difference; AMR, antimicrobial resistance.
Employed QC metrics
| Metric | Definition | Warning threshold | Failure threshold |
|---|---|---|---|
| Contamination (%) | Percentage of reads classified as highest-occurring species other than | 1.00 | 5.00 |
| Median coverage against reference genome (×) | Median coverage based on mapping of the trimmed reads against the H37Rv reference genome | 20 | 10 |
| Reads mapping back to reference genome (%) | Percentage of the trimmed reads mapping back to the H37Rv reference genome | 95 | 90 |
| cgMLST genes identified (%) | Percentage of cgMLST genes identified. Only perfect hits (i.e., full length and 100% identity) are considered | 95 | 90 |
| Average read quality (Q-score) | Q-score of the trimmed reads averaged over all reads and positions | 30 | 25 |
| GC content deviation (%) | Deviation of the average GC content of the trimmed reads from the expected value for | 2.00 | 4.00 |
| N-fraction | Average N-fraction per read position of the trimmed reads | 0.05 | 0.10 |
| Per base sequence content (%) | Difference between AT and GC frequencies averaged at every read position. Since primer artefacts can cause fluctuations at the start of reads due to the nonrandom nature of enzymatic tagmentation when the Nextera XT protocol is used for library preparation, the first 20 bases are not included in this test. As fluctuations can also exist at the end of reads caused by the low abundance of very long reads because of read trimming, the 0.5% longest reads are similarly excluded. | 3.00 | 6.00 |
| Minimum read length (%) | Minimum read length after trimming (denoted as percentage of untrimmed read length) that minimum half of all trimmed reads must obtain (e.g., half of all trimmed reads should either be minimally 120 or 200 bases long when raw input reads lengths are 300 bases long) | 66.67 | 40.00 |
FIG 2Schematic representation of the validation strategy. The colored boxes represent the different components of the validation strategy as follows: whole-genome sequencing (WGS, yellow), characterization with molecular methods (molecular, green), in silico modification or characterization of WGS data (in silico, orange), and validation of the assays (validation, blue). Arrows indicate the flow of the data between the different steps. The “Method” columns represent the molecular or bioinformatics method that was used to generate data. The “# observations” columns correspond to the number of observations that were obtained or evaluated at the listed level. Abbreviations: MTBC, Mycobacterium tuberculosis complex; SIT, shared international type; RD, regions of difference; WGS, whole-genome sequencing; SRA Sequence Read Archive; AMR, antimicrobial resistance; LPAs line probe assays. (a) csb/RD allows distinguishing the following 4 species within the MTBC: Mycobacterium tuberculosis, Mycobacterium africanum, Mycobacterium bovis BCG, and Mycobacterium bovis. (b) Validation was performed on a per-species basis reusing the positive observations of the other species as negative controls. (c) Some observations of the in silico modified dataset were removed because of incompatibilities with preexisting mutations in the WGS data.
Evaluated performance metrics and their corresponding definitions and formulas
| Metric | Definition | Formula | Assay-specific definitions | Bioinformatics assay | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| (Sub)species identification and confirmation | SNP-based antimicrobial resistance detection | Pathogen typing | ||||||||||
| 16S |
| Conventional | Spoligotyping | SNP barcoding | Sequence typing | |||||||
| Repeatability | Agreement of the assay based on intra-assay replicates | Repeatability = 100% × (no. of intra-assay replicates in agreement)/(total no. of intra-assay replicates) | Intra-assay replicate | Running the bioinformatics workflow twice on the same data set on the same computational environment | ||||||||
| Reproducibility | Agreement of the assay based on interassay replicates | Reproducibility = 100% × (no. interassay replicates in agreement)/(total no. interassay replicates) | Interassay replicate | Running the bioinformatics workflow twice on the same data set on a separate computational environment | ||||||||
| Accuracy | The likelihood that results of the assay are correct | Accuracy = 100% × (TP + TN)/(TN + FN + TP + FP) | True-positive result (TP) | Detection of MTBC species in positive sample | Targeted species detected in positive sample | Detection of MTBC species in positive sample | Detection of expected variant | Detection of expected spoligotype | Expected lineage in detected lineages | Detection of the same allele as in the reference standard | ||
| Precision | The likelihood that detected results of the assay are truly present | Precision = 100% × TP/(TP + FP) | False-negative result (FN) | No detection of MTBC species in positive sample | Targeted species not detected in positive sample | No detection of MTBC species in positive sample | No detection of expected variant | No detection of spoligotype or detection of wrong spoligotype | Expected lineage not in detected lineages | Detection of a different allele as in the reference standard | ||
| Sensitivity | The likelihood that a result will be correctly picked up by the assay when present | Sensitivity = 100% × TP/(TP + FN) | True-negative result (TN) | No detection of MTBC species in negative sample | Targeted species not detected in negative sample | No detection of MTBC species in negative sample | No variant detected at WT position | No detection of spoligotype when none expected | No detection of lineage when none expected | No detection of an allele in negative-control sample | ||
| Specificity | The likelihood that a result will not be falsely picked up by the assay when not present | Specificity = 100% × TN/(TN + FP) | False-positive result (FP) | Detection of MTBC species in negative sample | Targeted species detected in negative sample | Detection of MTBC species in negative sample | Detection of nonexisting variant or other variant than expected | Detection of spoligotype when none expected | Detection of lineage when none expected | Detection of an allele in negative-control sample | ||
TP, true positive; FP, false positive, TN, true negative; FN, false negative; WT, wild-type.
FIG 3Overview of the mutations included in the (validation of) SNP-based AMR detection. The y axis represents the number of mutations for the validation steps listed on the x axis. Each subplot represents a different category of mutations. The colors indicate if mutations were included (turquoise) or not included (red) in the corresponding validation step. The total height of each bar (i.e., red + turquoise) corresponds to the maximum number of mutations of the corresponding category that could be included in the validation. For the comparison with conventional methods, this corresponds to all mutations located in genomic regions targeted by the conventional methods. For the WGS-based workflow (validated through in silico conventional and in silico database), this corresponds to the total number of AMR mutations in the database. The white labels show the number of times mutations of the corresponding category were present in the validation samples according to the reference information (note that this number can be higher than the number of mutations on the y axis because the same mutation can have been validated multiple times in different samples). This figure demonstrates that (i) the number of mutations that are screened with WGS far exceeds the mutations detected with conventional molecular methods, and (ii) the in silico approach enables us to validate a much larger fraction of mutations in the database. obs., observations; db, database; conv. conventional; AMR, antimicrobial resistance.
Performance results for the different evaluated bioinformatics assays
| Group | Assay | No. of TP | No. of FN | No. of TN | No. of FP | Accuracy (%) | Precision (%) | Sensitivity (%) | Specificity (%) | Repeatability (%) | Reproducibility (%) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| (Sub)species confirmation and identification | 16S rRNA | 150 | 1 | 64 | 0 | 99.53 | 100 | 99.34 | 100 | 100 | 100 |
| 12/16/13/12 | 0 | 41/37/40/41 | 0 | 100 | 100 | 100 | 100 | 100 | 100 | ||
|
| 151 | 0 | 64 | 0 | 100 | 100 | 100 | 100 | 100 | 100 | |
| SNP-based AMR detection | AMR detection: comparison with molecular methods | 118 | 0 | 17,776 | 1 | 99.99 | 99.16 | 100 | 99.99 | 100 | 100 |
| AMR detection | 138 | 0 | 137 | 0 | 100 | 100 | 100 | 100 | 100 | 100 | |
| AMR detection | 11,371 | 29 | 11,456 | 0 | 99.87 | 100 | 99.75 | 100 | 100 | 100 | |
| Pathogen typing | Spoligotyping | 148 | 18 | 16 | 0 | 90.11 | 100 | 89.16 | 100 | 100 | 100 |
| SNP barcoding | 44 | 0 | 16 | 0 | 100 | 100 | 100 | 100 | 100 | 100 | |
| Sequence typing (cgMLST) | 31,245 | 3 | 11,904 | 0 | 99.99 | 100 | 99.99 | 100 | 100 | 100 |
TP, true positive; FP, false positive; TN, true negative; FN false negative; RD, regions of difference; AMR, antimicrobial resistance; IS, in silico.
Validation was performed on a per-species basis. Listed values correspond to M. africanum, M. bovis, M. bovis BCG, and M. tuberculosis (in order).
FIG 4Maximum likelihood tree containing an overview of the diversity of the in-house samples included in the validation data set. The scale bar is expressed as average changes per site. The annotations are (from inner to outer rings) sample name, main lineage determined with the workflow (top-right legend), full lineage detected with the workflow, SIT number determined with conventional spoligotyping, total and unique (i.e., only occurring in the corresponding sample) number of detected mutations with known association(s) with resistance to antibiotics, and number of in silico inserted variants selected from the database (“database”) and from the positions covered by the conventional methods (“conventional”). Note that the color scale for the number of in silico inserted mutations was capped at 20 to increase clarity; the total number of mutations for all data points in the “database” ring ranged from 217 to 225. Nb, number; SIT, shared international type.