| Literature DB >> 34739368 |
Bert Bogaerts1,2, Raf Winand1, Julien Van Braekel1, Stefan Hoffman1, Nancy H C Roosens1, Sigrid C J De Keersmaecker1, Kathleen Marchal2,3,4, Kevin Vanneste1.
Abstract
Entities:
Keywords: Illumina; NGS; WGS; real-time; validation
Mesh:
Year: 2021 PMID: 34739368 PMCID: PMC8743554 DOI: 10.1099/mgen.0.000699
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 1.Schematic overview of the protocol for real-time data generation and analysis. The grey boxes indicate the different environments involved. The three environments are located on separate (virtual) machines or servers. See Methods S1 for a detailed description. The ‘MiSeq-sequencer’ environment corresponds to the sequencer itself. The ‘MiSeq-sync’ environment corresponds to a server that periodically mounts the MiSeq drive to transfer intermediate BCL files generated by the MiSeq. The ‘MiSeq-agent’ environment corresponds to a server that collects BCL files from the ‘MiSeq-sync' environment and converts them to FASTQ files that can be analysed with bioinformatics workflows. See Methods S1 for an elaborate description.
Fig. 2.Schematic representation of the workflow to obtain minimal requirements for sequencing coverage and read lengths. The flowchart provides a schematic overview of the different scenarios for which the performance of various bioinformatics assays was evaluated. The arrows direct the steps that need to be followed to set up optimized WGS runs on Illumina sequencers. Underlined text refers to other figures or tables included in this manuscript. The ‘=’, ‘↑’, and ‘↓’ symbols indicate that the associated metric was kept constant, increased, or decreased, respectively. Abbreviations: cov, coverage. Nb., number
Overview of evaluated bioinformatics assays for all three species. The publication(s) describing the assay in more detail are listed in the second column, and more information about the bioinformatics methodology is also presented in the Supplementary Material Section S1.3. The primary bioinformatics tools used for the assay are listed in the third column.
|
Bioinformatics assay |
Short description |
Bioinformatics tool(s) |
Evaluated species |
Assay-specific definitions | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
TP |
FN |
TN |
FP | ||||
|
16S rRNA species confirmation |
Confirmation that the targeted species is present trough alignment of the assembled contigs against the NCBI 16S rRNA database [ |
blastn [ |
No |
No |
Yes |
Matching species detected with full dataset |
Unmatched species detected with full dataset |
No detection of targeted species in negative control sample |
Detection of targeted species in negative control sample | |
|
Gene detection |
Virulence genes |
Detection of genes associated with virulence using an alignment-based approach [ |
blastn [ |
Yes |
Yes |
No |
Detection of a gene detected in full dataset |
No detection of a gene detected in full dataset |
No detection of a gene not detected in full dataset |
Detection of a gene not detected in full dataset |
|
AMR genes |
Detection of genes associated with AMR using an alignment-based approach [ |
Yes |
No |
No | ||||||
|
SNP-based antimicrobial resistance detection |
Detection of SNPs with a known association with AMR using a read mapping-based approach [ |
SAMtools [ |
No |
No |
Yes |
Detection of mutation present in full dataset |
No detection of mutation present in full dataset |
Mutation from database not detected in full and modified datasets |
Mutation from database detected in modified dataset but not in full dataset | |
|
PointFinder |
Detection of SNPs with a known association with AMR using an alignment-based approach [ |
PointFinder [ |
Yes |
No |
Yes |
Detection of mutation present in full dataset |
No detection of mutation present in full dataset |
Mutation from database not detected in full and modified datasets |
Mutation from database detected in modified dataset but not in full dataset | |
|
Serotype determination* |
In silico serotyping based on alignment-based detection of serotype-determining genes [ |
blastn [ |
Yes |
Yes |
No |
Detection of same serotype as in full dataset |
Detection of different serotype as in full dataset |
No detection of serotype in negative control sample |
Detection of serotype in negative control sample | |
|
Sequence typing (cgMLST) |
Alignment-based detection of alleles from species-specific cgMLST schemes [ |
blastn [ |
Yes |
Yes |
Yes |
Detection of same allele as in full dataset |
Detection of different allele as in full dataset |
No detection of allele in negative control sample |
Detection of allele in negative control sample | |
* Algorithm for serotype determination is different for E. coli and N. meningitidis.
AMR, antimicrobial resistance; FN, false negative; FP, false positive; TN, true negative; TP, true positive.
Overview of performance metrics and their corresponding definitions and formulas adopted for our validation strategy
|
Metric |
Definition |
Formula |
|---|---|---|
|
Repeatability |
Agreement of assay based on intra-assay replicates* |
Repeatability=100 %×(# intra-assay replicates in agreement) / (total # intra-run replicates) |
|
Reproducibility |
Agreement of assay based on inter-assay replicates* |
Reproducibility=100 %×(# inter-assay replicates in agreement) / (total # inter-assay replicates) |
|
Accuracy |
The likelihood that results of the assay are correct |
Accuracy=100 %×(TP+TN)/(TN+FN+TP+FP) |
|
Precision |
The likelihood that detected results of the assay are truly present |
Precision=100 %×TP/(TP +FP) |
|
Sensitivity |
The likelihood that a result will be correctly picked up by the assay when present |
Sensitivity=100 %×TP/(TP +FN) |
|
Specificity |
The likelihood that a result will not be falsely picked up by the assay when not present |
Specificity=100 %×TN/(TN +FP) |
|
Matthews correlation coefficient (MCC) |
Compound performance metric that considers all confusion matrix categories (TN, TP, FP, FN) expressed as a value between zero (very low performance) and 1 (very high performance) |
|
*Intra- and inter-assay replicates were defined as repeated bioinformatics analysis on the same dataset on the same and different computational environments, respectively.
Fig. 3.WGS data quality metrics in function of coverage and read length for E. coli. The x- and y-axes in each panel denote the (symmetric) read length and theoretical coverage. For scenario 2, the theoretical coverage is calculated based on full-length reads and the effective coverage is therefore lower than indicated on the y-axis. The z-axis and color-scale show the value of the corresponding metric for (a) effective coverage, (b) total assembly length, and (c) N50. The evaluated coverage and read length combinations are indicated with blue open circles, and the three-dimensional planes were extrapolated based on the observed values. Contour lines on the bottom of individual figures are indicated according to the colour legend. Figures for all three species are provided in Figs S2–S7.
Fig. 4.Bioinformatics assay performance in function of coverage and read length. The x- and y-axes in each panel denote the (symmetric) read length and theoretical coverage. For Scenario 2, the theoretical coverage is calculated based on full-length reads and effective coverage is therefore lower than indicated on the y-axis. The z-axis denotes the Matthews Correlation Coefficient (MCC). The evaluated data points are indicated with blue open circles, and the three-dimensional planes were extrapolated based on the observed values. The red and cyan planes correspond to MCC thresholds of 95 and 99 %, respectively. Contour lines on the bottom of individual figures are indicated according to the colour legend, and the red and cyan lines correspond to the 95 and 99% MCC thresholds, respectively. Abbreviations: antimicrobial resistance (AMR).
Overview of the minimal sequencing duration and associated sequencing set-ups. Dashes indicate that performance for the corresponding theoretical coverage was below the 95 % MCC threshold for all read length combinations. For non-time-critical situations, the number of samples that can be multiplexed for the targeted assay(s) can be determined using the Lander/Waterman equation and the minimum coverage values listed in section A. For time-critical situations, the number of samples can be determined similarly, using the values in sections B and C. The minimum read-lengths that need to be sequenced are provided in section B. When the real-time sequencing protocol is used, the earliest combination that provides accurate results is provided in section C. Note that for Scenario 2, the indicated coverage is based on full-length 2×251 reads, which should be considered when determining the number of samples to multiplex in a run.
|
Bioinformatics assay |
16S rRNA |
Gene detection |
SNP-based AMR detection |
PointFinder |
Sequence typing |
Serotype determination | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
| ||||
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||
|
| ||||||||||||||
|
|
10X |
15X |
15X |
10X |
10X |
10X |
20X |
15X |
20X |
15X |
50X |
25X | ||
|
| ||||||||||||||
|
|
|
|
14.07 |
– |
25.30 |
19.68 |
14.07 |
14.07 |
25.30 |
– |
– |
– |
– |
19.68 |
|
|
51F - 51R |
– |
201F - 51R |
126F - 51R |
51F - 51R |
51F - 51R |
201F - 51R |
– |
– |
– |
– |
126F - 51R | ||
|
|
|
14.07 |
21.56 |
14.07 |
14.07 |
14.07 |
14.07 |
14.07 |
14.07 |
19.68 |
19.68 |
– |
14.07 | |
|
|
51F - 51R |
151F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
126F - 51R |
126F - 51R |
– |
51F - 51R | ||
|
|
|
14.07 |
14.07 |
14.07 |
14.07 |
14.07 |
14.07 |
14.07 |
14.07 |
14.07 |
14.07 |
– |
14.07 | |
|
|
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
– |
51F - 51R | ||
|
|
|
14.07 |
14.07 |
14.07 |
14.07 |
14.07 |
14.07 |
14.07 |
14.07 |
14.07 |
14.07 |
30.42 |
14.07 | |
|
|
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
151F - 151R |
51F - 51R | ||
|
|
|
14.07 |
14.07 |
14.07 |
14.07 |
14.07 |
14.07 |
14.07 |
14.07 |
14.07 |
14.07 |
46.77 |
14.07 | |
|
|
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
51F - 51R |
251F - 251R |
51F - 51R | ||
|
| ||||||||||||||
|
|
|
|
34.16 |
– |
– |
35.69 |
25.30 |
29.04 |
– |
– |
– |
– |
– |
29.04 |
|
|
201F - 151R |
– |
– |
251F - 126R |
201F - 51R |
251F - 51R |
– |
– |
– |
– |
– |
251F - 51R | ||
|
|
|
19.68 |
30.42 |
25.30 |
25.30 |
15.94 |
19.68 |
25.30 |
29.04 |
37.91 |
31.26 |
– |
25.30 | |
|
|
126F - 51R |
151F - 151R |
201F - 51R |
201F - 51R |
76F - 51R |
126F - 51R |
201F - 51R |
251F - 51R |
251F - 151R |
251F - 76R |
– |
201F - 51R | ||
|
|
|
17.81 |
25.30 |
21.56 |
19.68 |
14.07 |
17.81 |
21.56 |
25.30 |
25.30 |
25.30 |
– |
19.68 | |
|
|
101F - 51R |
201F - 51R |
151F - 51R |
126F - 51R |
51F - 51R |
101F - 51R |
151F - 51R |
201F - 51R |
201F - 51R |
201F - 51R |
– |
126F - 51R | ||
|
|
|
17.81 |
25.30 |
19.68 |
17.81 |
14.07 |
15.94 |
19.68 |
21.56 |
25.30 |
21.56 |
– |
17.81 | |
|
|
101F - 51R |
201F - 51R |
126F - 51R |
101F - 51R |
51F - 51R |
76F - 51R |
126F - 51R |
151F - 51R |
201F - 51R |
151F - 51R |
– |
101F - 51R | ||
|
|
|
13.98 |
21.46 |
17.72 |
17.72 |
13.98 |
13.98 |
17.72 |
19.59 |
19.59 |
19.59 |
38.28 |
17.72 | |
|
|
51F - 51R |
151F - 51R |
101F - 51R |
101F - 51R |
51F - 51R |
51F - 51R |
101F - 51R |
126F - 51R |
126F - 51R |
126F - 51R |
201F - 201R |
101F - 51R | ||
AMR, antimicrobial resistance ; F, forward; R, reverse.