| Literature DB >> 28592550 |
Varvara K Kozyreva1, Chau-Linda Truong1, Alexander L Greninger1, John Crandall1, Rituparna Mukhopadhyay1, Vishnu Chaturvedi2.
Abstract
Public health microbiology laboratories (PHLs) are on the cusp of unprecedented improvements in pathogen identification, antibiotic resistance detection, and outbreak investigation by using whole-genome sequencing (WGS). However, considerable challenges remain due to the lack of common standards. Here, we describe the validation of WGS on the Illumina platform for routine use in PHLs according to Clinical Laboratory Improvements Act (CLIA) guidelines for laboratory-developed tests (LDTs). We developed a validation panel comprising 10 Enterobacteriaceae isolates, 5 Gram-positive cocci, 5 Gram-negative nonfermenting species, 9 Mycobacterium tuberculosis isolates, and 5 miscellaneous bacteria. The genome coverage range was 15.71× to 216.4× (average, 79.72×; median, 71.55×); the limit of detection (LOD) for single nucleotide polymorphisms (SNPs) was 60×. The accuracy, reproducibility, and repeatability of base calling were >99.9%. The accuracy of phylogenetic analysis was 100%. The specificity and sensitivity inferred from multilocus sequence typing (MLST) and genome-wide SNP-based phylogenetic assays were 100%. The following objectives were accomplished: (i) the establishment of the performance specifications for WGS applications in PHLs according to CLIA guidelines, (ii) the development of quality assurance and quality control measures, (iii) the development of a reporting format for end users with or without WGS expertise, (iv) the availability of a validation set of microorganisms, and (v) the creation of a modular template for the validation of WGS processes in PHLs. The validation panel, sequencing analytics, and raw sequences could facilitate multilaboratory comparisons of WGS data. Additionally, the WGS performance specifications and modular template are adaptable for the validation of other platforms and reagent kits.Entities:
Keywords: CLIA; WGS; bacteria; bioinformatics pipeline; laboratory-developed test; performance specifications; public health; quality management; validation; whole-genome sequencing
Mesh:
Year: 2017 PMID: 28592550 PMCID: PMC5527429 DOI: 10.1128/JCM.00361-17
Source DB: PubMed Journal: J Clin Microbiol ISSN: 0095-1137 Impact factor: 5.948
Performance characteristics, definitions, and formulas used for validation
| Performance characteristic for WGS applications | Definition of performance characteristic for WGS applications | Formula used for calculation | Assay-specific definition | Result of assay used for validation of parameter | |||
|---|---|---|---|---|---|---|---|
| hqSNP-based genotyping | MLST | 16S | Antibiotic resistance gene detection | ||||
| Accuracy | Degree of agreement between the nucleic acid sequences derived from the assay (measured value) and those from a reference sequence (true value) | ||||||
| Accuracy of platform | Accuracy of base calling against the reference sequence; the accuracy of the platform was established as the agreement between base calling made by the MiSeq sequencer (measured value) and the NCBI/CDC reference sequence (true value) | % agreement with reference = [(covered genome length) − (total no. of SNPs differing from reference)]/(covered genome length) × 100 | Accuracy of the platform | 99.999378% | |||
| Accuracy of assay | Accuracy of assay is determined as an agreement of the assay result for validation sequences generated by the PHL with the assay result for reference sequences of the same strains | Accuracy = (no. of correct results)/(total no. of results) × 100 | Definition of correct result | Congruence of phylogenetic trees built using reference sequences and validation sequences | Detection and correct identification of each of the MLST alleles | ID of the 16S rRNA sequence of the validation sample matches the ID of the 16S rRNA sequence of the reference sequence | Presence of ABR genes characteristic of the reference strain, absence of any other ABR genes |
| Single test unit | Individual sample clustering | Allele | 16S rRNA ID result | Antibiotic resistance gene | |||
| Accuracy of assay | 100% | 100% | 100% | 100% | |||
| Accuracy of bioinformatics pipeline | Agreement of the clustering suggested by previous investigators with the clustering achieved by analysis using PHL validation bioinformatics pipeline | % agreement = (no. of outbreak isolates clustered correctly in validation tree)/(total no. of outbreak isolates clustered together in the study tree) × 100 | Accuracy of bioinformatics pipeline | 100% | NA | NA | NA |
| Precision | Degree to which repeated sequence analyses give the same result repeatably (within-run precision) and reproducibly (between-run precision) | ||||||
| Repeatability | Repeatability was established by sequencing the same samples multiple times under the same conditions and evaluating the concordance of the assay results and performance | Repeatability = (no. of within-run replicates in agreement)/(total no. of tests performed for within-run replicates) × 100 | Definition of correct result | Repeatability of single nucleotide variant detection | Repeatability of allele detection | Repeatability of 16S ID | NA |
| Single test unit | SNP (precision per replicate), SNP (precision per genome size) | Allele | 16S rRNA ID | NA | |||
| Repeatability | 99.02%, 99.9999997% | 100% | 100% | NA | |||
| Reproducibility | Reproducibility was assessed as the consistency of the assay results and performance characteristics for the same sample sequenced under different conditions, such as between different runs, operators, and sample preparations | Reproducibility = (no. of between-run replicates in agreement)/(total no. of tests performed for between-run replicates) × 100 | Definition of correct result | Reproducibility of single nucleotide variant detection | Reproducibility of allele detection | Reproducibility of 16S ID | NA |
| Single test unit | SNP (precision per replicate), SNP (precision per base pair) | Allele | 16S rRNA ID | NA | |||
| Reproducibility | 97.05%, 99.999998% | 100% | 100% | NA | |||
| Analytical sensitivity (LOD) | Minimum coverage that allows accurate SNP detection (LODSNP) | NA | LODSNP | 60× | NA | ||
| Analytical specificity (interference) | Ability of an assay to detect only the intended target in the presence of potentially cross-reacting nucleotide sequences | NA | Interference/cross-reactivity | Cross-reactivity and interference from contaminating sequencing reads are possible | NA | ||
| Diagnostic sensitivity | Likelihood that a WGS assay will detect sequence variation when present within the analyzed genomic region (this value reflects the false-negative rate of the assay) | Diagnostic sensitivity = TP/(TP + FN) × 100 | Definition of true-positive result | Clustering of related samples (no. of validation samples with clustering results matching the reference) | No. of correctly identified alleles | NA | NA |
| Definition of false-negative result | No. of validation samples that clustered together with samples genetically distant according to the reference tree | No. of unidentified or misidentified alleles in validation samples | NA | NA | |||
| Single test unit | Individual sample clustering | Allele | NA | NA | |||
| Diagnostic sensitivity | 100% | 100% | NA | NA | |||
| Diagnostic specificity | Probability that a WGS assay will not detect sequence variations when none are present within the analyzed genomic region (this value reflects an assay's false-positive rate) | Diagnostic specificity = TN/(TN + FP) × 100 | Definition of true-negative result | No clustering between unrelated samples (no. of validation samples with clustering results matching the reference) | No. of unidentified alleles in negative-control samples | NA | NA |
| Definition of false-positive result | No. of validation samples that failed to cluster together with samples genetically similar according to the reference tree | No. of identified alleles in negative-control samples | NA | NA | |||
| Single test unit | Individual sample clustering | Allele | NA | NA | |||
| Diagnostic specificity | 100% | 100% | NA | NA | |||
| Reportable range | Region of the genome in which a sequence of an acceptable quality can be derived by the laboratory assay | NA | Genome-wide hqSNPs | Housekeeping genes in MLST scheme | 16S rRNA gene | Genes in ResFinder database | |
See details in Document S1 in the supplemental material. Abbreviations: TP, true-positive results; TN, true-negative results; FP, false-positive results; FN, false-negative results; LOD, limit of detection; LODSNP, limit of SNP detection; ID, identification; NA, the parameter was not defined for the given assay.
FIG 1Summary of WGS validation. The estimated performance parameters are shown in blue boxes. The components of WGS accuracy determined in this study are shown in purple boxes. The WGS assays evaluated in order to deduce the corresponding performance parameters are shown in green boxes. Percentages alongside the boxes represent values measured during this validation for the corresponding parameters.
Summary of data from previous studies used for validation of the bioinformatics pipeline
| Study parameter | Value for study | |
|---|---|---|
| MRSA study | ||
| Microorganism | Methicillin-resistant | |
| Source of isolates | Human | Human |
| No. of isolates analyzed | 7 outbreak isolates (1 outbreak cluster) + 2 epidemiologically unrelated isolates | 9 outbreak isolates (4 outbreak clusters) + 2 epidemiologically unrelated isolates |
| Type of outbreak | Hospital-associated outbreak | Foodborne outbreaks |
| Samples used for validation | P1, P2, P3, P4, P16, P21, and P25; an isolate identified by infectious control investigation as belonging to nonoutbreak ST1; a MRSA isolate identified by searching a microbiology database as belonging to nonoutbreak ST772 | 0803T57157, 0808S61603, 0808F31478, 0903R11327, 0811R10987, 0804R9234, 0810R10649, 0901M16079, 0110T17035, 1005R12913, and 1006R12965 |
| GenBank accession no. of corresponding samples | ||
| No. of clusters in the study tree | 1 | 4 |
| No. of clusters in the validation tree | 1 | 4 |
| No. of outbreak isolates in each cluster in the study tree | 7 for cluster 1 | 2 for cluster 1, 3 for cluster 2, 2 for cluster 3, and 2 for cluster 4 |
| No. of outbreak isolates in each cluster in the validation tree | 7 for cluster 1 | 2 for cluster 1, 3 for cluster 2, 2 for cluster 3, and 2 for cluster 4 |
| No. of epidemiologically unrelated isolates in the set | 2 | 2 |
| No. of epidemiologically unrelated isolates that clustered with outbreak isolates | 0 | 0 |
| % agreement {[(no. of outbreak isolates clustered correctly in the validation tree) × 100]/(total no. of outbreak isolates that clustered together in the study tree)} | (7 × 100/7) = 100 | (9 × 100/9) = 100 |
See reference 38. ST1, sequence type 1; ST772, sequence type 772.
See reference 39.
FIG 2Bioinformatics pipeline validation with outbreak isolates from two previously published studies. (A) Phylogenetic tree of outbreak isolates reported in the “MRSA study” by Harris et al. (38). The isolates from the MRSA study that were picked for validation are indicated by arrows and numbers assigned for purposes of validation (1 to 7). (B) Phylogenetic tree validation using the samples from the MRSA study and the validation bioinformatics pipeline. The same isolates in the original tree and the validation tree are marked with the same numbers. (C) Comparison of the group of related isolates (isolates 1 to 7) from the MRSA study with epidemiologically unrelated isolates from the same study using the validation bioinformatics pipeline. (D) Phylogenetic tree combining epidemiologically related and nonrelated isolates reported in the “Salmonella study” by Leekitcharoenphon et al. (39). The isolates from the Salmonella study that were picked for validation are marked with green node circles and have the numbers 1 to 11 assigned for purposes of validation. Epi, epidemiologically. (E) Validation phylogenetic tree generated for the samples from the Salmonella study using the in-house bioinformatics pipeline. The same isolates in the tree from the Salmonella study and the validation tree are marked with the same numbers.
FIG 3WGS quality control scheme. The preanalytical, analytical, and postanalytical steps of WGS are shown on the left in orange boxes. The QC metrics that are being evaluated at each step are presented in light green boxes. Three vertical blocks represent the types of samples to which major QC metrics are being applied: test samples, positive controls, and negative controls. The overlap between the boxes designating QC metrics and the vertical blocks shows which QC metrics are applicable to which of the three types of samples.
List of strains used for validation and corresponding reference materials
| MDL ID | Species | Reference genome | |
|---|---|---|---|
| NCBI strain | NCBI accession no. | ||
| C1 | |||
| C3 | |||
| C55 | |||
| C4 | |||
| C6 | |||
| C5 | |||
| C46 | |||
| C47 | |||
| C48 | |||
| C49 | ATCC 700669 | ||
| C50 | FRD1 | ||
| C51 | |||
| C52 | |||
| C53 | ATCC 25240 | ||
| C54 | PKAB07 | ||
| C103 | 638R | ||
| C104 | KR494 | ||
| C2 | |||
| C105 | |||
| C106 | MS11 | ||
| C56 | H37Rv | ||
| C57 | H37Rv | ||
| C58 | H37Rv | ||
| C59 | H37Rv | ||
| C61 | H37Rv | ||
| C65 | H37Rv | ||
| C67 | H37Rv | ||
| C68 | H37Rv | ||
| C69 | H37Rv | ||
Boldface type indicates reference strains for which genomes are available from the NCBI database. Lightface type indicates cases where the genome is not available from the NCBI database and an alternative reference genome was used for mapping. MDL, Microbial Diseases Laboratory, California Department of Public Health, Richmond, CA.
List of strains used for validation and corresponding reference materials available from the CDC
| MDL ID | Species | Reference raw reads generated by the CDC | Reference genome used for mapping | ||
|---|---|---|---|---|---|
| CDC strain | GenBank accession no. | NCBI strain | NCBI accession no. | ||
| C72 | 2011C-3493 | ||||
| C73 | P125109 | ||||
| C74 | 1326/28 | ||||
| C75 | P125109 | ||||
| C76 | P125109 | ||||
| C77 | 14028S | ||||
Boldface type indicates the reference strains for which genomes are available from the NCBI database. MDL, Microbial Diseases Laboratory, California Department of Public Health, Richmond, CA.
Sample C77 was sequenced by the MDL only for genotyping assay accuracy validation. No replicates were done.