| Literature DB >> 31807141 |
Yanqiu Zhou1, Chen Liu1, Rongfang Zhou1, Anzhi Lu1, Biao Huang1, Liling Liu1, Ling Chen1, Bei Luo1, Jin Huang1, Zhijian Tian1.
Abstract
BACKGROUND: The sequencing platform BGISEQ-500 is based on DNBSEQ technology and provides high throughput with low costs. This sequencer has been widely used in various areas of scientific and clinical research. A better understanding of the sequencing process and performance of this system is essential for stabilizing the sequencing process, accurately interpreting sequencing results and efficiently solving sequencing problems. To address these concerns, a comprehensive database, SEQdata-BEACON, was constructed to accumulate the run performance data in BGISEQ-500.Entities:
Keywords: BGISEQ-500; Data analysis; Database; Prediction tools; Sequencing run performance
Year: 2019 PMID: 31807141 PMCID: PMC6857306 DOI: 10.1186/s13040-019-0209-9
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Fig. 1Schematic architecture of SEQdata-BEACON. The architecture of the database and applications contains four parts. “Sequencing Files” shows the files generated from BGISEQ-500 sequencers. “SEQdata-BEACON” shows the information of metrics. “Yield simulation model” shows the special function of our database to predict the final yield from input metrics. “Web services” shows three applications in our website to provide interactive functions for users--Predicting, Browsing and Data querying
Fig. 2Web visual interface of SEQdata-BEACON. The screen shots show views of functional modules. a The navigator bar on the “Home” page. b The drop-down menu and distribution charts of metrics are shown on the “Browse” page. c The download sources are listed on our website’s “Download” page. d The input windows and the results of the yield simulation model are shown on the “Tools” page. Please note that not all fields are shown
Fig. 3Numerical metrics correlation. Hierarchical clustering of the Pearson’s correlation matrix between 52 metrics. The three branches Yield and Quality, Machine State, and Sequencing Calibration are marked with purple, orange and yellow, respectively, on the right side of the Y-axis in the figure. Red blocks in the heatmap indicate positive relationship, blue blocks indicate negative relationship, and white blocks indicate no relationship
Fig. 4Distribution of Metrics in SEQdata-BEACON. a Q30 versus Reads, histogram of Q30 and Reads shown in gray, density profiles shown in blue. b Histogram of FIT in all entries. c Scatterplot of FIT through 200 cycles
Model summaries of linear regressions for predicting yield outputs
| Independent variables | Analytical results for Eq. ( | Analytical results for Eq. ( | Analytical results for Eq. ( | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Std. error | Std. error | Std. error | |||||||
| Constants | 11.424 | −11.353 | < 2e-16** | 10.662 | −12.173 | < 2e-16** | 39.411 | −3.578 | 0.00035** |
| TotalEsr*Dnbnumber | 0.010 | 95.192 | < 2e-16** | 0.010 | 96.130 | < 2e-16** | 0.033 | 38.837 | < 2e-16** |
| BIC | 0.192 | 7.278 | 4.73e-13** | 0.178 | 7.873 | 5.46e-15** | 0.724 | 6.556 | 6.91e-11** |
| accGRR | 6.175 | 0.023 | 0.981 | – | – | – | 59.009 | − 10.243 | < 2e-16** |
| SNR | 0.388 | −8.586 | < 2e-16** | 0.388 | −8.596 | < 2e-16** | 0.653 | −5.307 | 1.23e-07** |
| FIT | 4.757 | 5.675 | 1.57e-08** | 4.577 | 5.905 | 4.08e-09** | 11.249 | 2.567 | 0.010* |
TotalEsr: ESR (Effective Spot Rate), the percentage of filtered Reads among the DNBs recognized by Basecalling. ESR = Total Reads/theoretical maximum reads number of one sequencing lane. TotalEsr calculated ESR value in the first 15 cycles in read1 and read2, and kept constant in the rest of each read
Dnbnumber: The theoretical maximum number of DNBs on the patterned array
BIC: Basecall information content, the percentage of DNBs that can be used for Basecalling among the DNBs recognized by the optical system. BIC = (numbers of DNB that can be used for Basecalling/numbers of DNB that can be recognized by the optical systems) × 100%
accGRR: Accumulated Good Reads Rate, taking chastity greater than 0.6 as the filtering criteria, the percentage of filtered Reads among the DNBs recognized by Basecalling. accGRR = Total Reads/theoretical maximum reads number of one sequencing lane. This value is only a statistical indicator which reflects the overall quality of the read (multi-cycle state)
SNR: Signal to Noise Ratio, taking the SNR calculation of a single DNB as an example, base A (maximum light intensity) is used as the signal, the CGT is the background, and the variance of the CGT light intensity is noise. A_SNR = A_mean/CGT_dev
FIT: FIT value represents the distribution of differences between signal and noise for each base. The FIT value is higher when the distribution of differences between signal to noise for each channel/color are more concentrated
**Significant at the 1% probability level
* Significant at the 5% probability level
Fig. 5A comparison of predicted vs. actual yield in training sets and test sets. a Training sets. b Test sets. The linear regression line is shown in blue