| Literature DB >> 30506954 |
George John Kastanis1, Luis V Santana-Quintero2, Maria Sanchez-Leon1, Sara Lomonaco1,3, Eric W Brown1, Marc W Allard1.
Abstract
Whole genome sequencing of bacterial isolates has become a daily task in many laboratories, generating incredible amounts of data. However, data acquisition is not an end in itself; the goal is to acquire high-quality data useful for understanding genetic relationships. Having a method that could rapidly determine which of the many available run metrics are the most important indicators of overall run quality and having a way to monitor these during a given sequencing run would be extremely helpful to this effect. Therefore, we compared various run metrics across 486 MiSeq runs, from five different machines. By performing a statistical analysis using principal components analysis and a K-means clustering algorithm of the metrics, we were able to validate metric comparisons among instruments, allowing for the development of a predictive algorithm, which permits one to observe whether a given MiSeq run has performed adequately. This algorithm is available in an Excel spreadsheet: that is, MiSeq Instrument & Run (In-Run) Forecast. Our tool can help verify that the quantity/quality of the generated sequencing data consistently meets or exceeds recommended manufacturer expectations. Patterns of deviation from those expectations can be used to assess potential run problems and plan preventative maintenance, which can save valuable time and funding resources.Entities:
Keywords: Forecast; In-Run; MiSeq; sequencing; tool
Mesh:
Year: 2019 PMID: 30506954 PMCID: PMC6487961 DOI: 10.1111/1755-0998.12973
Source DB: PubMed Journal: Mol Ecol Resour ISSN: 1755-098X Impact factor: 7.090
Summary of MiSeq run metrics. The following table depicts the run metrics to be compared and analysed in this study. Manufacturer recommended metrics are shown when available as well as a brief description of each run metric
| Run metric | Manufacturer recommended range/value | Brief description |
|---|---|---|
| Q30 yield | N/A | The number of gigabases (Gb) that passed the chastity filter |
| Reads PF | 24–30 million reads | The number of reads that passed the chastity filter |
| Total yield | 7.5–8.5 Gb | The total number of Gb expected to be generated during the sequencing run |
| %≥Q30 (Overall) | ≥75.0% | The percentage of bases having a quality score of 30 or higher |
| %≥Q30 (R1, R2, IR1, IR2) | N/A | The %≥Q30 score broken down into its component parts |
| Phasing (R2), Prephasing (R2) | N/A | The amount of asynchrony during the reverse sequencing read |
| Phasing (R1), Prephasing (R1) | <0.1% | The amount of asynchrony during the forward sequencing read |
| Cluster density (CD) | 1,000–1,200 K/mm2 | The quantity of clusters that are generated per flow cell surface area during the cluster generation stage of a sequencing run |
| Total reads | N/A | The total number of reads generated during a sequencing run |
| Clusters PF | ≥80.0% | The percentage of generated clusters that pass the chastity filter |
Comparison of 15 MiSeq run metrics according to the 5 MiSeq instruments
| MiSeq instrument ( | |||||
|---|---|---|---|---|---|
| A ( | B ( | C ( | D ( | E ( | |
| Cluster density | 929.1 ± 31.7 | 886.8 ± 35.8 | 1,032.1 ± 37.1 |
|
|
| Clusters PF | 86.9% ± 0.8% |
|
| 84.4% ± 1.3% | 86.5% ± 0.8% |
| %≥Q30 (Overall) |
|
| 78.9% ± 1.4% | 76.4% ± 1.7% | 78.5% ± 0.7% |
| %≥Q30 (R1) |
| 82.7% ± 1.0% | 85.2% ± 1.4% |
| 85.6% ± 0.6% |
| %≥Q30 (IR1) | 93.8% ± 0.6% |
|
| 93.5% ± 1.2% | 90.1% ± 1.0% |
| %≥Q30 (IR2) | 90.5% ± 0.8% |
|
| 89.1% ± 1.5% | 82.9% ± 1.2% |
| %≥Q30 (R2) |
|
| 72.1% ± 1.5% | 70.4% ± 1.7% | 71.4% ± 0.9% |
| Total yield | 7.5 ± 0.2 Gb | 7.0 ± 0.3 Gb |
| 8.3 ± 0.3 Gb |
|
| q30 yield | 6.1 ± 0.2 Gb |
|
| 6.4 ± 0.3 Gb |
|
| Total reads | 1.7 × 107 ± 5.4 × 105 | 1.6 × 107 ± 6.4 × 105 |
| 1.9 × 107 ± 7.9 × 10 |
|
| Reads PF | 1.5 × 107 ± 4.2 × 105 |
|
| 1.6 × 107 ± 6.5 × 10 |
|
| Prephasing (R1) |
| 0.111% ± 0.008% | 0.113% ± 0.031% |
| 0.074% ± 0.004% |
| Prephasing (R2) |
| 0.171% ± 0.016% | 0.169% ± 0.028% |
| 0.131% ± 0.007% |
| Phasing (R1) | 0.074% ± 0.005% |
| 0.092% ± 0.017% |
| 0.087% ± 0.005% |
| Phasing (R2) |
|
| 0.172% ± 0.008% | 0.162% ± 0.007% | 0.172% ± 0.008% |
“n” is equal to the number of MiSeq runs performed per sequencer. Values represent the means ± SE. Values that are deemed statistically significant are indicated by the corresponding symbol according to the figure legend below. The best and worst values based on manufacturer recommendations for each variable are indicated with underline and bold, respectively
Significant compared to MiSeq A.
Significant compared to MiSeq B.
Significant compared to MiSeq C.
Significant compared to MiSeq D.
Heatmap comparison of changes in run metrics between the start and end of a MiSeq sequencing run for each MiSeq instrument. “n” is equal to the number of MiSeq runs performed per sequencer. Values represent the arithmetic means + SE. The color palette in the note indicates the range of the most favorable average % loss/fold changes (green) to least favorable average % loss/fold changes (red), comparatively for the 5 MiSeq instruments [Colour table can be viewed at wileyonlinelibrary.com]
Pearson's correlation matrix for 15 analysed MiSeq run metrics. Strong correlations (values between 0.80–1.00) are indicated in green; moderate correlations (0.50–0.79) are in yellow, and values with no color indicate weak correlations (0.00–0.49)
| % Clusters PF | %≥Q30 (Overall) | %≥Q30 (R1) | %≥Q30 (IR1) | %≥Q30 (IR2) | %≥Q30 (R2) | Phasing (R1) | Prephasing (R1) | Phasing (R2) | Prephasing (R2) | Total reads | Reads PF | Total yield | Q30 yield | Cluster density | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| % Clusters PF | * | ||||||||||||||
| %≥Q30 (Overall) | 0.54 | * | |||||||||||||
| %≥Q30 (R1) | 0.49 | 0.89 | * | ||||||||||||
| %≥Q30 (IR1) | 0.52 | 0.46 | 0.38 | * | |||||||||||
| %≥Q30 (IR2) | 0.55 | 0.5 | 0.41 | 0.6 | * | ||||||||||
| %≥Q30 (R2) | 0.52 | 0.9 | 0.77 | 0.46 | 0.5 | * | |||||||||
| Phasing (R1) | −0.3 | −0.32 | −0.38 | −0.29 | −0.3 | −0.27 | * | ||||||||
| Prephasing (R1) | −0.14 | −0.58 | −0.67 | −0.16 | −0.15 | −0.48 | 0.26 | * | |||||||
| Phasing (R2) | −0.03 | 0.05 | 0.03 | 0.09 | −0.19 | 0.04 | 0.13 | −0.03 | * | ||||||
| Prephasing (R2) | −0.02 | −0.17 | −0.2 | 0.02 | −0.12 | −0.15 | 0.11 | 0.31 | 0.56 | * | |||||
| Total reads | −0.14 | −0.15 | −0.13 | 0.04 | 0.13 | −0.15 | −0.04 | −0.05 | −0.29 | −0.23 | * | ||||
| Reads PF | 0.16 | 0.02 | 0.02 | 0.23 | 0.32 | 0.02 | −0.16 | −0.1 | −0.29 | −0.23 | 0.94 | * | |||
| Total yield | 0.14 | −0.01 | 0.01 | 0.21 | 0.3 | −0.01 | −0.16 | −0.09 | −0.3 | −0.24 | 0.93 | 0.99 | * | ||
| Q30 yield | 0.23 | 0.2 | 0.19 | 0.3 | 0.4 | 0.2 | −0.22 | −0.18 | −0.28 | −0.34 | 0.86 | 0.96 | 0.96 | * | |
| Cluster density | −0.25 | −0.2 | −0.16 | −0.03 | 0.05 | −0.21 | −0.04 | −0.08 | −0.27 | −0.21 | 0.9 | 0.82 | 0.84 | 0.77 | * |
Figure 1PCA loading plot of the 15 observed MiSeq run metrics across 486 MiSeq runs. Three groups can be distinguished from the plot and are indicated in different colours [Colour figure can be viewed at wileyonlinelibrary.com]
Figure 2A three‐dimensional PCA plot of each MiSeq instrument. Each colour represents a particular MiSeq desktop sequencer (MiSeq A–E), and each point on the plot represents a single observation (2 × 250 500 cycle V2 MiSeq sequencing run) [Colour figure can be viewed at wileyonlinelibrary.com]
Figure 3(a) A three‐dimensional k‐means cluster analysis plot of the PCA (Euclidean). The analysis was run several times with a priori number of clusters from 2 to 5. In this plot, higher quality MiSeq runs, which exhibited metrics that met performance criteria, are represented by the green cluster, MiSeq runs that failed primarily due to %≥Q30 issues are represented in blue, and MiSeq runs that failed due to Phasing or Prephasing are depicted in red. (b) A three‐dimensional k‐means cluster analysis plot of the PCA (Mahalanobis). The analysis was run several times with a priori number of clusters from 2 to 5. In this plot, higher quality MiSeq runs, which exhibited metrics that met performance criteria, are represented by the green cluster, MiSeq runs that failed due to %≥Q30 issues and other metrics are represented in blue, and MiSeq runs that failed due to Phasing or Prephasing and other metrics are depicted in red [Colour figure can be viewed at wileyonlinelibrary.com]
Figure 4MiSeq In‐Run Forecast Algorithm Flow Chart. A stepwise representation depicting the “MiSeq In‐Run Forecast” operation [Colour figure can be viewed at wileyonlinelibrary.com]