| Literature DB >> 30343674 |
Seth Sims1,2,3, Atkinson G Longmire4,5, David S Campo6, Sumathi Ramachandran4, Magdalena Medrzycki4, Lilia Ganova-Raeva4, Yulin Lin4, Amanda Sue4, Hong Thai4, Alexander Zelikovsky7, Yury Khudyakov4.
Abstract
BACKGROUND: Molecular surveillance and outbreak investigation are important for elimination of hepatitis C virus (HCV) infection in the United States. A web-based system, Global Hepatitis Outbreak and Surveillance Technology (GHOST), has been developed using Illumina MiSeq-based amplicon sequence data derived from the HCV E1/E2-junction genomic region to enable public health institutions to conduct cost-effective and accurate molecular surveillance, outbreak detection and strain characterization. However, as there are many factors that could impact input data quality to which the GHOST system is not completely immune, accuracy of epidemiological inferences generated by GHOST may be affected. Here, we analyze the data submitted to the GHOST system during its pilot phase to assess the nature of the data and to identify common quality concerns that can be detected and corrected automatically.Entities:
Keywords: HCV; HVR1; Molecular surveillance; Outbreak detection; Quality control; Transmission
Mesh:
Year: 2018 PMID: 30343674 PMCID: PMC6196402 DOI: 10.1186/s12859-018-2329-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
GHOST QC filters listed in order of execution. All filters except for “Primer dimer” are discussed in detail in Longmire et al., [12]
| Order | Filter name | Description | Position relative to |
|---|---|---|---|
| 1 | Ambiguity | After standard demultiplexing, read pairs are filtered out if a read has more than three N’s. | Before |
| 2 | Primer dimer | Checks for the existence of primer dimers or non-specifc product. For each read pair, this filter inspects the forward read for forward primer using the same search parameters as the filter dedicated to primer verification and read orientation. Once found, the reverse complement is searched for the reverse primer. If the distance between the forward and reverse primers is found to be less than the threshold set in the filter dedicated to read length (185 bp), the pair is discarded. If both primers are not found, the process is repeated with the reverse read. | Before |
| 3 | Short read | Read pairs are filtered out if either read has a length less than 185 bp. | Before |
| 4 | MID mismatch | Each identifier on both forward and reverse reads are examined and the pair is discarded if either identifier is found to not be an exact match to a given list of valid identifiers. | Before |
| 5 | Minority MID | Pairs containing valid identifiers are discarded if they are not a constituent of the majority identifier tuple. If 25% or more of the read pairs are found to contain valid identifiers that are not the majority tuple, the entire sample is discarded from analysis without further processing. | Before |
| 6 | Primer verification | Primer sequence patterns are searched for in the forward and reverse reads. Primer sequences are located in each read using fuzzy matching and only allow substitutions ≤2, insertions (relative to the reference) ≤ 1, deletions (relative to the reference) ≤ 1, and a combination of total errors ≤3. Read pairs where either primers cannot be found are discarded. The primer locations are used to orient the reads into the uniform orientation. | After |
| 7 | Casper mismatch | Read pairs are unified into a single error-corrected sequence using the Casper error correction method with a quality threshold of 15, k-mer length of 17, k-mer neighborhood of 8, and minimum match threshold of 95%. Overlap fitness is evaluated by the classical Hamming Distance. The overlap corresponding to the highest ratio of correct positions to overlap length is selected, with the longest overlap being preferred in the event of there being more than one overlap with equal ratios. | After |
| 8 | Nonsense | Merged sequences are discarded if a nonsense-free reading frame cannot be found. | After |
Fig. 1Sankey diagram of read pair allocation for all samples after deduplication. Arrow thickness represents the proportion of read pairs removed by the filter step. The “not sampled” step represents those reads not used after 20,000 read pair random sampling
Deduplicated data partitioned into 4 categories
| Failing non-negative | 225 |
|---|---|
| Passing non-negative | 1750 |
| Failing negative | 87 |
| Passing negative | 25 |
| Total | 2087 |
Fig. 2Performance of primer dimer filter. a) Mean values of filters before the introduction of the primer dimer filter. b) Mean value of filters after introduction of the primer dimer filter
Fig. 3Histogram of primer dimer filter normalized values. Normalization is calculated with respect to the number of read pairs entering into the filter
Fig. 4Scatter plots of the primer dimer filter compared against 3 other filters
Fig. 5Mean values (before normalization) of all GHOST QC task filters with respect to the 4 mutually exclusive sample sets
Fig. 6Boxplots of filter distributions after normalization for all deduplicated samples
Welch’s test for mean comparison of means
| Comparison | Filter |
| Corrected | Reject | |
|---|---|---|---|---|---|
| P vs F | Primer dimer | 2.485304 | 3.220E-02 | 7.210E-01 | FALSE |
| P vs F | Ambiguity | −2.64516 | 8.258E-03 | 2.999E-01 | FALSE |
| P vs F | Short read | 1.227051 | 2.478E-01 | 9.995E-01 | FALSE |
| P vs F | MID mismatch | 1.721249 | 1.159E-01 | 9.828E-01 | FALSE |
| P vs F | Minority MID | 1.491232 | 1.666E-01 | 9.949E-01 | FALSE |
| P vs F | Primer verification | 0.318833 | 7.561E-01 | 1.000E + 00 | FALSE |
| P vs F | Casper mismatch | 2.09018 | 6.284E-02 | 9.033E-01 | FALSE |
| P vs F | Nonsense | 0.600825 | 5.611E-01 | 1.000E + 00 | FALSE |
| P vs F | raw pairs passed | −1.65813 | 1.279E-01 | 9.875E-01 | FALSE |
| P vs F | r1_maxlength | 0.027582 | 9.785E-01 | 1.000E + 00 | FALSE |
| P vs F | r2_maxlength | 0.027582 | 9.785E-01 | 1.000E + 00 | FALSE |
| P vs F | r1_numseqs | 0.496099 | 6.303E-01 | 1.000E + 00 | FALSE |
| P vs F | r2_numseqs | 0.496099 | 6.303E-01 | 1.000E + 00 | FALSE |
| P vs F | r1_minlength | −2.61235 | 2.243E-02 | 6.055E-01 | FALSE |
| P vs F | r2_minlength | −1.40858 | 1.889E-01 | 9.972E-01 | FALSE |
| P vs F | r1_gc | 0.9154 | 3.810E-01 | 1.000E + 00 | FALSE |
| P vs F | r2_gc | 1.115745 | 2.897E-01 | 9.999E-01 | FALSE |
| P vs F | r1_qual | −0.7403 | 4.759E-01 | 1.000E + 00 | FALSE |
| P vs F | r2_qual | −0.22578 | 8.259E-01 | 1.000E + 00 | FALSE |
| PN vs PNN | Primer dimer | −20.708 | 2.919E-67 | 0.000E + 00 | TRUE |
| PN vs PNN | Ambiguity | −3.81125 | 1.589E-04 | 7.125E-03 | TRUE |
| PN vs PNN | Short read | −9.98183 | 3.000E-21 | 0.000E + 00 | TRUE |
| PN vs PNN | MID mismatch | −20.5454 | 1.447E-66 | 0.000E + 00 | TRUE |
| PN vs PNN | Minority MID | −18.9948 | 1.667E-58 | 0.000E + 00 | TRUE |
| PN vs PNN | Primer verification | −6.14462 | 1.853E-09 | 9.082E-08 | TRUE |
| PN vs PNN | Casper mismatch | −14.2144 | 3.668E-39 | 0.000E + 00 | TRUE |
| PN vs PNN | Nonsense | −6.44656 | 3.119E-10 | 1.559E-08 | TRUE |
| PN vs PNN | raw pairs passed | 15.95968 | 1.747E-47 | 0.000E + 00 | TRUE |
| PN vs PNN | r1_maxlength | 3.883107 | 1.138E-04 | 5.446E-03 | TRUE |
| PN vs PNN | r2_maxlength | 3.883107 | 1.138E-04 | 5.446E-03 | TRUE |
| PN vs PNN | r1_numseqs | 2.021773 | 4.358E-02 | 8.161E-01 | FALSE |
| PN vs PNN | r2_numseqs | 2.021773 | 4.358E-02 | 8.161E-01 | FALSE |
| PN vs PNN | r1_minlength | 0.927906 | 3.538E-01 | 1.000E + 00 | FALSE |
| PN vs PNN | r2_minlength | −0.5458 | 5.854E-01 | 1.000E + 00 | FALSE |
| PN vs PNN | r1_gc | −0.26695 | 7.896E-01 | 1.000E + 00 | FALSE |
| PN vs PNN | r2_gc | −0.90928 | 3.635E-01 | 1.000E + 00 | FALSE |
| PN vs PNN | r1_qual | 0.559492 | 5.760E-01 | 1.000E + 00 | FALSE |
| PN vs PNN | r2_qual | 3.245425 | 1.234E-03 | 5.290E-02 | FALSE |
| PN vs FN | Primer dimer | −1.03201 | 3.207E-01 | 9.999E-01 | FALSE |
| PN vs FN | Ambiguity | −1.51352 | 1.333E-01 | 9.881E-01 | FALSE |
| PN vs FN | Short read | −1.47375 | 1.586E-01 | 9.944E-01 | FALSE |
| PN vs FN | MID mismatch | −5.29945 | 1.447E-04 | 6.634E-03 | TRUE |
| PN vs FN | Minority MID | −13.2917 | 1.590E-24 | 0.000E + 00 | TRUE |
| PN vs FN | Primer verification | −2.51412 | 1.338E-02 | 4.321E-01 | FALSE |
| PN vs FN | Casper mismatch | −1.91524 | 7.643E-02 | 9.381E-01 | FALSE |
| PN vs FN | Nonsense | −0.52093 | 6.037E-01 | 1.000E + 00 | FALSE |
| PN vs FN | raw pairs passed | 2.537808 | 2.448E-02 | 6.290E-01 | FALSE |
| PN vs FN | r1_maxlength | 0.388275 | 7.045E-01 | 1.000E + 00 | FALSE |
| PN vs FN | r2_maxlength | 0.388275 | 7.045E-01 | 1.000E + 00 | FALSE |
| PN vs FN | r1_numseqs | 0.79751 | 4.400E-01 | 1.000E + 00 | FALSE |
| PN vs FN | r2_numseqs | 0.79751 | 4.400E-01 | 1.000E + 00 | FALSE |
| PN vs FN | r1_minlength | −1.71001 | 9.675E-02 | 9.686E-01 | FALSE |
| PN vs FN | r2_minlength | −0.98698 | 3.430E-01 | 1.000E + 00 | FALSE |
| PN vs FN | r1_gc | 0.433694 | 6.711E-01 | 1.000E + 00 | FALSE |
| PN vs FN | r2_gc | 0.353036 | 7.286E-01 | 1.000E + 00 | FALSE |
| PN vs FN | r1_qual | −0.67196 | 5.144E-01 | 1.000E + 00 | FALSE |
| PN vs FN | r2_qual | 0.15547 | 8.790E-01 | 1.000E + 00 | FALSE |
P pass, F fail, PN passing negative, PNN passing non-negative, FN failing negative
Fig. 7Scatter plot showing samples in categories PNN, FN, and PN. Box shows the application of the three threshold combination using minimization of Gini impurity index
Fig. 8Breakdown of data categorizations using parameters from the three filter threshold combination. Top row shows histograms of each of the three filters. Bottom row shows results of using any two of the filters alone
Fig. 9Histogram of the ratio of bit score-derived log probabilities of best to second-best subtype matches of the sequences in all deduplicated samples submitted to GHOST. Solid line indicates the cutoff ratio of 2, with the area under the curve to the left of the cutoff representing unique sequences that are classified only at the genotype level
Sequence counts for unique sequences found to have a ratio of first to second-best hit scores under 2
| major | minor | count | proportion |
|---|---|---|---|
| 1a | 1c | 704,704 | 0.057086 |
| 1b | 1c | 407,112 | 0.032979 |
| 1a | 1b | 3548 | 0.000287 |
| 1b | 1a | 3390 | 0.000275 |
| 2a | 2c | 172 | 1.39E-05 |
| 2c | 2a | 7 | 5.67E-07 |
| 3a | 1a | 3 | 2.43E-07 |
| 1b | 3a | 2 | 1.62E-07 |
Fig. 10Histogram of prevalence ratios for all non-dominant subtypes where prevalence ratio is defined as the total frequency of the subtype divided by the total frequency of the dominant type
Fig. 11All deduplicated samples submitted to GHOST, including artificially created panel verification samples and non-linking samples. Node and link colors were arbitrarily assigned to clusters
Fig. 12All links found in GHOST. Nodes representing samples artificially created for panel verifications by state pilot participants were removed, along with non-linking samples. Box encloses a chordless cycle. Node and link colors were arbitrarily assigned to clusters
Fig. 13All links found in GHOST with removal of nodes representing samples artificially created for panel verifications by state pilot participants and nodes representing samples associated with a project with known quality control issues. Non-linking samples removed. Node and link colors were arbitrarily assigned to clusters
Quality Control event descriptions, triggers, actions, and notifications
| Column1 | Event | Action | Indicator | Notification | Suggestion |
|---|---|---|---|---|---|
| A | Poor quality | Warn | Casper mismatch filter > 95% PNN data | In sample X, the Casper alignment step discarded high level of pairs due to mismatches. | Please ensure the quality of amplification reagents not compromised (check polymerase expiration date, proper storage conditions, quality of primers). Please confirm the concentration and quality of pooled library and ensure the correct concentration loaded on the chip. Check the expiration date on the MiSeq reagent kit. |
| B | Poor purification | Warn | Primer dimer filter > 95% PNN data | In sample X, the primer dimer filter shows a high level of pairs discarded | Please check the concentration of primers (barcode and index) and review magnetic beads cleaning procedure. Ensure the quality of your final pooled library exceeds 95% purity. |
| C | Negative or loss of product detection | Reject, warn | Primer dimer filter > 0.785 or minority MID filter > 0.11 | Sample X was determined to be either a negative control or suffered a loss of product during library preparation. | If this was not intended to be a negative control, please check samples proximity to an intended negative control, and if the negative control passes, consider that there may have been mislaballing. Please repeat the library preparation for this sample. |
| D | Unclassified sequences form dominant population for sample | Warn, notify CDC | Proportion of sequences that cannot be classified is higher than for any other subtype. | Sample X cannot be classified. CDC/DVH has been notified. | Please wait to be contacted by CDC staff. |
| E | Subtype classification issues | Warn | Sequences within the population has a best subtype match and second best subtype match with ratio < 2 | For sample X, ambiguous subtype classifications have been detected. | Please note that this sample’s subtype is questionable. |
| F | Chordless cycle detected | Warn, notify CDC | Analysis task contains a chordless cycle of 4 nodes or more | Chordless cycle detected in samples X, Y, and Z. | Please check for signs of contamination between samples X, Y, and Z, and repeat library if feasible. |
| G | Residual read-pair level too low | Reject | Read pair count < 10,000 after all other filters execute | Sample X does not have enough reads after all filters execute to proceed. | Please review the sample preparation for sample X. Repeat this sample in next library if feasible. If not, contact ghost@cdc.gov about relaxing read pair level restrictions. |