| Literature DB >> 26935931 |
Nathan D Olson1, Justin M Zook2, Daniel V Samarov3, Scott A Jackson2, Marc L Salit2,4.
Abstract
The rapid adoption of microbial whole genome sequencing in public health, clinical testing, and forensic laboratories requires the use of validated measurement processes. Well-characterized, homogeneous, and stable microbial genomic reference materials can be used to evaluate measurement processes, improving confidence in microbial whole genome sequencing results. We have developed a reproducible and transparent bioinformatics tool, PEPR, Pipelines for Evaluating Prokaryotic References, for characterizing the reference genome of prokaryotic genomic materials. PEPR evaluates the quality, purity, and homogeneity of the reference material genome, and purity of the genomic material. The quality of the genome is evaluated using high coverage paired-end sequence data; coverage, paired-end read size and direction, as well as soft-clipping rates, are used to identify mis-assemblies. The homogeneity and purity of the material relative to the reference genome are characterized by comparing base calls from replicate datasets generated using multiple sequencing technologies. Genomic purity of the material is assessed by checking for DNA contaminants. We demonstrate the tool and its output using sequencing data while developing a Staphylococcus aureus candidate genomic reference material. PEPR is open source and available at https://github.com/usnistgov/pepr .Entities:
Keywords: Bioinformatics; Microbiology; Whole genome sequencing
Mesh:
Year: 2016 PMID: 26935931 PMCID: PMC4819933 DOI: 10.1007/s00216-015-9299-5
Source DB: PubMed Journal: Anal Bioanal Chem ISSN: 1618-2642 Impact factor: 4.142
Fig. 1PEPR workflow. White objects are pipeline inputs, grey objects are the three pipeline components, and light blue objects are the pipeline products
Summary of sequencing datasets
| Acc. | Plat | Vial | Lib. | Reads | Length (bp) | Insert (bp) | Cov. |
|---|---|---|---|---|---|---|---|
| SRR1979039 | miseq | 0 | 1 | 3305082 | 230 | 257 | 247 |
| SRR1979040 | miseq | 0 | 2 | 3732088 | 216 | 233 | 263 |
| SRR1979041 | miseq | 1 | 1 | 3973320 | 218 | 242 | 279 |
| SRR1979042 | miseq | 1 | 2 | 3941040 | 223 | 247 | 285 |
| SRR1979043 | miseq | 2 | 1 | 3442554 | 234 | 268 | 261 |
| SRR1979070 | miseq | 2 | 2 | 3226726 | 232 | 268 | 240 |
| SRR1979044 | miseq | 3 | 1 | 3025028 | 233 | 264 | 229 |
| SRR1979045 | miseq | 3 | 2 | 4796382 | 200 | 210 | 303 |
| SRR1979046 | miseq | 4 | 1 | 3338456 | 239 | 278 | 260 |
| SRR1979047 | miseq | 4 | 2 | 2995090 | 237 | 277 | 231 |
| SRR1979048 | miseq | 5 | 1 | 3495384 | 225 | 255 | 255 |
| SRR1979049 | miseq | 5 | 2 | 3116128 | 241 | 281 | 244 |
| SRR1979050 | miseq | 6 | 1 | 3129282 | 237 | 271 | 240 |
| SRR1979060 | miseq | 6 | 2 | 2976312 | 242 | 280 | 233 |
| SRR1979064 | miseq | 7 | 1 | 2630544 | 241 | 283 | 204 |
| SRR1979065 | miseq | 7 | 2 | 3416580 | 225 | 248 | 247 |
| SRR2002412 | pgm | 0 | 1 | 556903 | 231 | 42 | |
| SRR2002413 | pgm | 1 | 1 | 530117 | 224 | 38 | |
| SRR2002414 | pgm | 2 | 1 | 437527 | 231 | 33 | |
| SRR2002415 | pgm | 3 | 1 | 552692 | 232 | 42 | |
| SRR2002416 | pgm | 4 | 1 | 498479 | 232 | 37 | |
| SRR2002418 | pgm | 5 | 1 | 390070 | 235 | 30 | |
| SRR2002419 | pgm | 6 | 1 | 426196 | 232 | 32 | |
| SRR2002420 | pgm | 7 | 1 | 439119 | 238 | 34 | |
| SRR2056302 | pacbio | 9 | 1 | 163475 | 10510 | 108 | |
| SRR2056306 | pacbio | 9 | 2 | 163471 | 10436 | 103 | |
| SRR2056310 | pacbio | 9 | 3 | 163474 | 9863 | 91 |
Acc. - Sequence read archive (SRA) database accessions. Plat. - sequencing platform, miseq: Illumina MiSeq, pgm: Ion Torrent PGM, pacbio: Pacific Biosciences RSII. Lib. - library replicate number for miseq and pgm, smartcell replicate for pacbio. Reads - number of sequencing reads in the dataset. Length - median read length in base pairs. Insert - median insert size in base pairs for paired-end reads. Cov. - median sequence coverage across the genome
Fig. 2Comparison of base purity values for PGM and MiSeq. Positions are colored based of high and low purity values for the two sequencing platforms, MiSeq - Illumina MiSeq and PGM - Ion Torrent PGM. A purity value of 0.99 was used to differentiate between high and low purity positions. Positions with high purity for both platforms were excluded from the figure
Number of genome positions with high and low purity, purity metric values higher and lower than 0.99 respectively, for the Illumina MiSeq and Ion Torrent PGM sequencing platforms
| PGM-High | PGM-Low | |
|---|---|---|
| MiSeq-High | 2864925 | 44534 |
| MiSeq-Low | 394 | 115 |
Fig. 3Distribution of genome positions by purity group. Bases with high and low purity and purity values greater than and less than 0.99 respectively for the two platforms, MiSeq - Illumina MiSeq and PGM - Ion Torrent PGM. Positions with high purity for both platforms were excluded from the figure
Pairwise variant analysis results
| Position | Proportion of Pairs | Median Frequency | Minimum P-value | N Significant |
|---|---|---|---|---|
| 244332 | 0.01 | 21.31 | 0.51 | 0.00 |
| 2615986 | 0.03 | 20.48 | 0.45 | 0.00 |
| 2616058 | 0.08 | 25.29 | 0.15 | 0.00 |
| 2619808 | 0.01 | 20.78 | 0.61 | 0.00 |
| 2619886 | 0.01 | 21.54 | 0.50 | 0.00 |
Position is the position in the genome where differences in variant frequency for at least one of the 16 pairwise comparisons were reported. Proportion of pairs is the fraction of the pairwise comparisons between the 16 Illumina MiSeq datasets where VarScan reported a difference in variant frequency. Median frequency is the median variant frequency for datasets with reported difference at that genome position. Minimum p-value is the lowest p-value reported by VarScan for all pairwise dataset comparisons with reported differences in variant frequency. N Significant is the number of datasets with reported statistically significant differences at that genome position
Fig. 4Breakdown of contaminants by organism