| Literature DB >> 30894839 |
Bert Bogaerts1, Raf Winand1, Qiang Fu1, Julien Van Braekel1, Pieter-Jan Ceyssens2, Wesley Mattheus2, Sophie Bertrand2, Sigrid C J De Keersmaecker1, Nancy H C Roosens1, Kevin Vanneste1.
Abstract
Despite being a well-established research method, the use of whole-genome sequencing (WGS) for routine molecular typing and pathogen characterization remains a substantial challenge due to the required bioinformatics resources and/or expertise. Moreover, many national reference laboratories and centers, as well as other laboratories working under a quality system, require extensive validation to demonstrate that employed methods are "fit-for-purpose" and provide high-quality results. A harmonized framework with guidelines for the validation of WGS workflows does currently, however, not exist yet, despite several recent case studies highlighting the urgent need thereof. We present a validation strategy focusing specifically on the exhaustive characterization of the bioinformatics analysis of a WGS workflow designed to replace conventionally employed molecular typing methods for microbial isolates in a representative small-scale laboratory, using the pathogen Neisseria meningitidis as a proof-of-concept. We adapted several classically employed performance metrics specifically toward three different bioinformatics assays: resistance gene characterization (based on the ARG-ANNOT, ResFinder, CARD, and NDARO databases), several commonly employed typing schemas (including, among others, core genome multilocus sequence typing), and serogroup determination. We analyzed a core validation dataset of 67 well-characterized samples typed by means of classical genotypic and/or phenotypic methods that were sequenced in-house, allowing to evaluate repeatability, reproducibility, accuracy, precision, sensitivity, and specificity of the different bioinformatics assays. We also analyzed an extended validation dataset composed of publicly available WGS data for 64 samples by comparing results of the different bioinformatics assays against results obtained from commonly used bioinformatics tools. We demonstrate high performance, with values for all performance metrics >87%, >97%, and >90% for the resistance gene characterization, sequence typing, and serogroup determination assays, respectively, for both validation datasets. Our WGS workflow has been made publicly available as a "push-button" pipeline for Illumina data at https://galaxy.sciensano.be to showcase its implementation for non-profit and/or academic usage. Our validation strategy can be adapted to other WGS workflows for other pathogens of interest and demonstrates the added value and feasibility of employing WGS with the aim of being integrated into routine use in an applied public health setting.Entities:
Keywords: Neisseria meningitidis; national reference center; public health; validation; whole-genome sequencing
Year: 2019 PMID: 30894839 PMCID: PMC6414443 DOI: 10.3389/fmicb.2019.00362
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
FIGURE 1Overview of the bioinformatics workflow. Each box represents a component corresponding to a series of tasks that provide a certain well-defined functionality (indicated in bold). Major bioinformatics utilities employed in each module are also mentioned (indicated in italics). Abbreviations: paired-end (PE).
Advanced quality control metrics with their associated definitions and threshold values for warnings and failures.
| Warning | Failure | ||
|---|---|---|---|
| Metric | Definition | threshold | threshold |
| Median coverage | Median coverage based on mapping of the trimmed reads against the assembly ( | 20 | 10 |
| % reads mapping back to assembly | Percentage of the trimmed reads mapping back to the assembly ( | 95 | 90 |
| % cgMLST genes identified | Percentage of cgMLST genes identified. Only perfect hits (i.e., full length and 100% identity) are considered ( | 95 | 90 |
| Average read quality ( | 30 | 25 | |
| GC-content deviation | Deviation of the average GC content of the trimmed reads from the expected value for | 2 | 4 |
| N-fraction | Average N-fraction per read position of the trimmed reads | 0.05 | 0.10 |
| Mean | Average position in the trimmed reads where the average | 66.67% | 50.00% |
| Per base sequence content | Difference between AT and GC frequencies averaged at every read position. Since primer artifacts can cause fluctuations at the start of reads due to the non-random nature of enzymatic tagmentation when the Nextera XT protocol is used for library preparation, the first 20 bases are not included in this test. As fluctuations can also exist at the end of reads caused by the low abundance of very long reads because of read trimming, the 0.5% longest reads are similarly excluded | 3 | 6 |
| Minimum read length | Minimum read length after trimming (denoted as percentage of untrimmed read length) that minimum half of all trimmed reads must obtain (e.g., half of all trimmed reads should either be minimally 175 or 150 bases long when raw input reads lengths are 300 bases long) | 58.33% | 50% |
Overview of employed typing schemas#.
| Schema name | #Total loci | # Nucleotide loci | # Protein loci |
|---|---|---|---|
| Classic MLST | 7 | 7 | 0 |
| 1 | 1 | 0 | |
| cgMLST | 1605 | 1605 | 0 |
| Bexsero antigen sequence typing | 5 | 0 | 5 |
| 2 | 0 | 2 | |
| 1 | 1 | 0 | |
| 1 | 0 | 1 | |
| 9 | 2 | 7 | |
| Resistance genes | 9 | 9 | 0 |
| Vaccine targets ( | 3 | 3 | 0 |
Overview of performance metrics and their corresponding definitions and formulas adopted for our validation strategy.
| Assay-specific | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Metric | Definition | Formula | definitions | Bioinformatics assay | |||||
| Resistance gene characterization | Sequence typing (cgMLST) | Serogroup determination | |||||||
| ARG-ANNOT | ResFinder | CARD | NDARO | ||||||
| Repeatability | Agreement of the assay based on within-run replicates | Repeatability = 100% × (# within-run replicates in agreement)/(total # within-run replicates) | Within-run replicate | Repeated bioinformatics analysis on the same sample using the same dataset | |||||
| Reproducibility | Agreement of the assay based on between-run replicates | Reproducibility = 100% × (# between-run replicates in agreement)/(total # between-run replicates) | Between-run replicate | Repeated bioinformatics analysis on the same sample using a different dataset (generated using a different library and/or sequencing run) | |||||
| Accuracy | The likelihood that results of the assay are correct | Accuracy = 100% × (TP+TN)/(TN+FN+TP+FP) | TP result | Detection of a gene present in the reference standard | Detection of the same allele as in the reference standard | Detection of the same serogroup as in reference standard | |||
| Precision | The likelihood that detected results of the assay are truly present | Precision = 100% × TP/(TP+FP) | FN result | No detection of a gene present in the reference standard | Detection of a different allele as in the reference standard | Detection of a different serogroup as in the reference standard | |||
| Sensitivity | The likelihood that a result will be correctly picked up by the assay when present | Sensitivity = 100% × TP/(TP+FN) | TN result | No detection of a gene not present in the reference standard | No detection of an allele when challenged with the cgMLST schema of | No detection of a serogroup when challenged with the serogroup schema of | |||
| Specificity | The likelihood that a result will not be falsely picked up by the assay when not present | Specificity = 100% × TN/(TN+FP) | FP result | Detection of a gene not present in the reference standard | Detection of an allele when challenged with the cgMLST schema of | Detection of a serogroup when challenged with the serogroup schema of | |||
Results for the core validation dataset.
| Metric | Bioinformatics assay | ||||||
|---|---|---|---|---|---|---|---|
| Sequence typing | Serogroup | ||||||
| Resistance gene detection | (cgMLST) | determination | |||||
| ARG-ANNOT | ResFinder | CARD | NDARO | ||||
| Repeatability | 100% | 100% | 100% | 100% | 100% | 100% | |
| Reproducibility | 100% | 100% | 100% | 100% | 99.65% | 99.50% | |
| Database standard | Accuracy | – | – | – | – | 98.62% | 95.27% |
| Precision | – | – | – | – | 100% | 100% | |
| Sensitivity | – | – | – | – | 97.12% | 90.55% | |
| Specificity | – | – | – | – | 100% | 100% | |
| Tool standard | Accuracy | – | 100% | 99.87% | 100% | 99.37% | 100% |
| Precision | – | 100% | 87.52% | 100% | 100% | 100% | |
| Sensitivity | – | 100% | 99.93% | 100% | 98.68% | 100% | |
| Specificity | – | 100% | 99.87% | 100% | 100% | 100% | |
Results for the extended validation dataset.
| Metric | Bioinformatics assay | ||||||
|---|---|---|---|---|---|---|---|
| Sequence typing | Serogroup | ||||||
| Resistance gene detection | (cgMLST) | determination | |||||
| ARG-ANNOT | ResFinder | CARD | NDARO | ||||
| Repeatability | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% | |
| Reproducibility | – | – | – | – | – | – | |
| Database standard | Accuracy | – | – | – | – | – | 96.09% |
| Precision | – | – | – | – | – | 100.00% | |
| Sensitivity | – | – | – | – | – | 92.19% | |
| Specificity | – | – | – | – | – | 100.00% | |
| Tool standard | Accuracy | – | 100.00% | 99.88% | 100.00% | 99.78% | 100.00% |
| Precision | – | 100.00% | 87.50% | ∗ | 100.00% | 100.00% | |
| Sensitivity | – | 100.00% | 100% | ∗ | 99.52% | 100.00% | |
| Specificity | – | 100.00% | 99.88% | 100.00% | 100.00% | 100.00% | |
FIGURE 2Reproducibility of the sequence typing assay for the core validation dataset. The abscissa depicts the sequencing runs that are being compared, while the ordinate represents the percentage of cgMLST loci that were concordant between the same samples of different sequencing runs. Note that the ordinate starts at 94% instead of 0% to enable illustrating the variation between run comparisons more clearly. Each comparison is presented as a boxplot based on 67 samples where the boundary of the box closest to the abscissa indicates the 25th percentile, the thick line inside the box indicates the median, and the boundary of the box farthest from the abscissa indicates the 75th percentile. See also Supplementary Table S9 for detailed values for all samples and sequencing runs.
FIGURE 3Database standard results of the sequence typing assay for the core validation dataset. The abscissa depicts the sequencing run, while the ordinate represents the percentages of cgMLST loci as indicated by the title above each graph. Each sequencing run is presented as a boxplot based on 67 samples (see the legend of Figure 2 for a brief explanation). The upper left graph depicts the percentage of concordant cgMLST loci, i.e., where our workflow identified the same allele as the database standard, which were classified as TPs. Note that the ordinate starts at 93% instead of 0% to enable illustrating the results more clearly. All other cases were classified as FNs, and encompass three categories. First, the upper right graph depicts the percentage of cgMLST loci for which our workflow detected a different allele than present in the database standard. Second, the bottom left graph depicts the percentage of cgMLST loci for which our workflow did not detect any allele but an allele was nevertheless present in the database standard. Third, the bottom right graph depicts the percentage of cgMLST loci for which our workflow detected an allele but for which no allele was present in the database standard. Most FNs are explained by no information being present in the database standard, followed by an actual mismatch, and only few cases are due to our workflow improperly not detecting an allele. See also Supplementary Table S10 for detailed values for all samples and runs.
FIGURE 4Tool standard results of the sequence typing assay for the core validation dataset. The abscissa depicts the sequencing run, while the ordinate represents the percentages of cgMLST loci as indicated by the title above each graph. Each sequencing run is presented as a boxplot based on 67 samples (see the legend of Figure 2 for a brief explanation). The upper left graph depicts the percentage of concordant cgMLST loci, i.e., where our workflow identified the same allele as the tool standard, which were classified as TPs. Note that the ordinate starts at 98% instead of 0% to enable illustrating the results more clearly. All other cases were classified as FNs, and encompass two categories. First, the upper right graph depicts the percentage of cgMLST loci for which our workflow identified multiple perfect hits, of which at least one corresponded to the tool standard but was reported differently. Second, the lower left graph depicts the percentage of cgMLST loci for which our workflow detected a different allele compared to the tool standard. Most FNs are therefore explained by a different manner of handling multiple perfect hits, and only a small minority are due to an actual mismatch between our workflow and the tool standard. Furthermore, upon closer inspection, these mismatches were due to an artifact of the reference tool used for the tool standard that has been resolved in the meantime (see Supplementary Figure S2). See also Supplementary Table S11 for detailed values for all samples and runs.