| Literature DB >> 34900416 |
Darlene D Wagner1,2, Heather A Carleton3, Eija Trees4, Lee S Katz3,5.
Abstract
BACKGROUND: Whole genome sequencing (WGS) has gained increasing importance in responses to enteric bacterial outbreaks. Common analysis procedures for WGS, single nucleotide polymorphisms (SNPs) and genome assembly, are highly dependent upon WGS data quality.Entities:
Keywords: Assembly; Multiheal; Read cleaning; Read healing; SNP
Year: 2021 PMID: 34900416 PMCID: PMC8627651 DOI: 10.7717/peerj.12446
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Read quality metrics: R1/R2 PHRED quality, median insert lengths, and percentage of R1 + R2 reads with Ns.
Range of averages quality scores for reads from isolates and inserts; medians are shown in parentheses. Results of healing pipelines are shown per data set. noNmin100 not shown due to lack of statistically significant results when compared to raw reads.
|
| R1 qual. | 33.0–36.1 (34.8) | 32.5–35.7 (34.2) | 32.2–36.9 (35.3) | 34.2–36.8 (35.9) |
| R2 qual. | 26.0–33.5 (30.3) | 27.5–32.8 (30.2) | 26.1–35.3 (32.2) | 30.2–35.4 (33.9) | |
| insert (bp) | 161–371 (264) | 175–531 (290) | 184–556 (383) | 212–492 (309) | |
| R1 + R2 Ns % | 0.0–5.1 | 0.0–4.6 | 0.0–31.9 | 0.0 –12.2 | |
|
| R1 qual. | 33.1–36.1 (34.8) | 32.8–35.3 (34.4) | 32.3–36.9 (35.3) | 34.3–36.9 (35.9) |
| R2 qual. | 26.5–33.8 (30.8) | 27.6–32.8 (30.4) | 26.8–35.6 (32.6) | 30.5–35.6 (34.2) | |
| insert (bp) | 171–372 (268) | 176–532 (293) | 198–558 (389) | 213–493 (310) | |
| R1 + R2 Ns % | 0.0–4.7 | 0.0–4.5 | 0.0–31.9 | 0.0–3.2 | |
|
| R1 qual. | 34.4–36.7 (35.7) | 34.5–36.7 (35.4) | 33.9–37.2 (36.0) | 35.0–37.2 (36.5) |
| R2 qual. | 30.8–35.0 (32.9) | 31.4–34.6 (33.2) | 30.8–36.2 (34.2) | 33.0–36.2 (35.3) | |
| insert (bp) | 192–382 (283) | 187–539 (296) | 229–564 (406) | 219–494 (312) | |
| R1 + R2 Ns % | 0.0–3.5 | 0.0–3.2 | 0.0–1.5 | 0.0–3.7 | |
|
| R1 qual. | 34.5–36.7 (35.8) | 34.5–36.7 (35.5) | 34.1–37.2 (36.1) | 35.0–37.2 (36.5) |
| R2 qual. | 31.1–35.2 (33.2) | 31.6–34.8 (33.4) | 31.3–36.3 (34.4) | 33.1–36.3 (35.4) | |
| insert (bp) | 192–383 (283) | 187–546 (296) | 229–564 (406) | 220–494 (312) | |
| R1 + R2 Ns % | 0.0–3.4 | 0.0–3.2 | 0.0–1.5 | 0.0–2.3 | |
|
| R1 qual. | 34.5–36.7 (35.8) | 34.6–36.8 (35.6) | 34.1–37.2 (36.1) | 35.0–37.3 (36.5) |
| R2 qual. | 31.1–35.2 (33.2) | 31.6–35.2 (33.6) | 31.3–36.3 (34.4) | 33.1–36.3 (35.5) | |
| insert (bp) | 190–380 (281) | 183–538 (293) | 227–562 (404) | 218–492 (310) | |
| R1 + R2 Ns % | 0.0–3.4 | 0.0–3.1 | 0.0–1.4 | 0.0 –2.3 | |
|
| R1 qual. | 33.1–36.1 (34.9) | 32.6–35.7 (34.2) | 32.3–36.9 (35.3) | 34.2–36.9 (35.9) |
| R2 qual. | 26.8–33.6 (30.5) | 27.6–32.8 (30.3) | 26.3–35.4 (32.3) | 30.4–35.5 (34.0) | |
| insert (bp) | 189–412 (290) | 182–540 (300) | 227–587 (401) | 216–499 (314) | |
| R1 + R2 Ns % | 0.0–1.7 | 0.0–0.6 | 0.0–32.0 | 0.0–6.5 |
Figure 1Illumina Read Chemistries Used in the Study.
Forward reads (R1, in blue) and reverse reads (R2, red) with range of lengths and average lengths found in raw reads. Insert sizes were inferred from SMALT mapping to draft genome assemblies and are given here as per isolate median lengths between 5′ position of R1 and the 5′ position of R2.
Figure 2MSA Lengths from Unambiguous SNPs.
(A) SNPs for E. coli O26 (Cluster 1) with Lyve-SET (dark grey bars) and CFSAN_SNP_Pipeline (light gray bars). (B) SNPs for Salmonella enterica Reading (Cluster 2) with Lyve-SET and CFSAN_SNP_Pipeline. (C) SNPs for S. enterica Pomona (Cluster 3) with Lyve-SET and CFSAN_SNP_Pipeline. (D) SNPs for Shigella sonnei (Cluster 4) with Lyve-SET and CFSAN_SNP_Pipeline. For both Lyve-SET and CFSAN across all four clusters, SNPs were counted from final multiple sequence alignments including positions with ambiguous nucleotides.
Figure 3Read Healing Effects on SNPs Identification.
ROC-like plots of unique Lyve-SET SNPs (estimated false discovery rate in Data S5) compared to detected concordant SNPs or True Positive Rate (estimated sensitivity in Data S5). (A) E. coli O26 (Cluster 1). (B) S. enterica serovar Reading (Cluster 2). (C) S. enterica serovar Pomona (Cluster 3). (D) S. sonnei (Cluster 4). Estimated false discovery and True Positive Rate for the CFSAN SNP Pipeline are plotted in Fig. S5.
SPAdes Assembly quality metrics average values.
Boldface underlined values indicate one-sided Dunn post-hoc test improvement over raw read scores at α = 0.05 level of significance (with Benjamani-Hochberg correction). Reads healed through pipelines; noNmin100 and fastxOnly-3pr did not produce assemblies with statistically significant improvements over any metric and are not shown.
| Raw Reads | noNmin100-3pr | prinseq | prinseq -5pr3pr | prinseq-3pr | bayesHammer | ||
|---|---|---|---|---|---|---|---|
| Contigs (1.95 × 10−10) | 330.5 | 318.6 | |||||
| N50 (0.01364) | 81,036.8 | 84,793.3 | 92,749.2 | 92,574.8 | 92,391.1 | ||
| Maximum contig (0.06474) | 218,070.1 | 213,638.3 | 245,939.0 | 239,557.8 | 250,158.6 | 251,436.1 | |
| Contigs (0.6353) | 96.0 | 91.8 | 77.5 | 76.8 | 76.5 | 86.0 | |
| N50 (0.9972) | 243,104.2 | 265,306.8 | 270,566.3 | 277,891.2 | 270,003.6 | 253,005.8 | |
| Maximum contig (0.9552) | 552,949.7 | 625,007.5 | 653,148.7 | 658,147.9 | 602,089.9 | 581,313.7 | |
| Contigs (0.506) | 44.5 | 44.2 | 35.5 | 35.3 | 34.7 | 35.6 | |
| N50 (0.9714) | 412,213.0 | 431,232.0 | 412,871.0 | 431,778.0 | 465,935.0 | 431,969 | |
| Maximum contig (0.5936) | 911,229.0 | 838,277.0 | 1,000,451.0 | 1,019,369.0 | 1,023,669.0 | 1,039,598 | |
| Contigs (0.5547) | 449.8 | 448.2 | 445.1 | 448.0 | 446.2 | 443.4 | |
| N50 (0.8932) | 23,931.1 | 24,054.5 | 24,117.1 | 23,931.1 | 23,929.4 | 24,009.8 | |
| Maximum contig (0.999) | 89,893.8 | 89,378.5 | 90,046.7 | 89,870.1 | 89,632.6 | 89,352.8 |